Here are my notes and some ancient PHP source.
For a modern JavaScript utility which does exactly why my tool used to do, I recommend js-escapes.
UTF-8 Conversion Notes
The following are shorthand notes for the conversion. For a thorough discussion (and for my notes below to mean anything), I recommend the Wikipedia article on UTF-8.
Rules for determining number of bytes used by character:
if first_byte > 1111 0000 (F0h, 240d) then there will be 4 bytes in this character
if first_byte > 1110 0000 (E0h, 224d) then there will be 3 bytes in this character
if first_byte > 1100 0000 (C0h, 192d) then there will be 2 bytes in this character
if first_byte > 1000 0000 (80h, 128d) then ERROR, this is a continuation byte!
else there will be 1 byte in this character
To create proper Unicode from the UTF multibytes:
AND MASK Byte Bin Hex Dec ----------------------------------------------------------------- 1st byte of 4 byte character (1111 0xxx) 111 7 7 1st byte of 3 byte character (1110 xxxx) 1111 F 15 1st byte of 2 byte character (110x xxxx) 11111 1F 31 2nd, 3rd, or 4th byte (10xx xxxx) 111111 3F 63 1st byte of 1 byte character (0xxx xxxx) n/a n/a n/a Multiplier Byte Bin Hex Dec ------------------------------------------------------------------ 1 of 4 1 000000 000000 000000 40000 262144 1 of 3; 2 of 4 1 000000 000000 1000 4096 1 of 2; 2 of 3; 3 of 4 1 000000 40 64 1 of 1; 2 of 2; 3 of 3; 4 of 4 n/a n/a n/a
My old utility
For posterity (or laughs?), here is the ancient PHP that was still running up until 2018 to do these conversions for the poor souls who needed it:
<?php // It was quite amusing to try to do this without making // the browser use UTF-8 encoding in the early 2000s when // I wrote this! header("Content-type: text/html; charset=utf-8"); function getByte($offset){ global $stuff; // note: bah is 'foo' in sheep $hexbyte = unpack("@{$offset}/C1bah", $stuff); return $hexbyte["bah"]; } function getNum($byte){ if($byte >= 240){ return 4; } // 1111 0000 if($byte >= 224){ return 3; } // 1110 0000 if($byte >= 192){ return 2; } // 1100 0000 if($byte >= 128){ return 0; } // 1000 0000 else{ return 1; } // 0000 0000 } // "Darts" was a template engine that ran ratfactor.com for over a decade. // To run this, you'll have to remove it and replace its logic with vanilla PHP // Try: $_POST[...] if(Darts::isWebData('stuff')){ $results = ''; if(Darts::isWebData('escape')) $escape = true; $stuff = Darts::webData("stuff"); $len = strlen($stuff); $results .= "$stuff <i>($len bytes)</i>: "; for($offset=0; $offset<$len; ){ // until end of string $output[$outputnum] = 0; // initialize output $first_byte = getByte($offset); $numBytes = getNum(getByte($offset)); if(!$numBytes) $results .= "<b>Error! Unexpected 10b!</b>"; for($x=0; $x<$numBytes; $x++){ // do $numBytes of bytes $bytes[$x] = getByte($offset); $thistype = getNum($bytes[$x]); switch($thistype){ // mask it! case 4: $bytes[$x] &= 7; break; // 1111 0xxx case 3: $bytes[$x] &= 15; break; // 1110 0xxx case 2: $bytes[$x] &= 31; break; // 1100 0xxx case 0: $bytes[$x] &= 63; break; // 1000 0xxx } switch(($numBytes-1) - $x){ case 3: $bytes[$x] *= 262144; break; // 3 bytes over case 2: $bytes[$x] *= 4096; break; // 2 bytes over case 1: $bytes[$x] *= 64; break; // 1 byte over } $output[$outputnum] += $bytes[$x]; // add it to ouput $offset++; // increment byte offset } $outputnum++; // increment output offset } if($escape) $results .= '"'; for($i=0; $i<$outputnum; $i++){ // write the little stinker if($escape){ $results .= "\u"; } else{ $results .= " "; } if($output[$i] < 256) $results .= "00"; // padd extra zeros if needed $results .= strtoupper(dechex($output[$i])); } if($escape) $results .= '"'; }