Here are my notes and some ancient PHP source.
For a modern JavaScript utility which does exactly why my tool used to do, I recommend js-escapes.
UTF-8 Conversion Notes
The following are shorthand notes for the conversion. For a thorough discussion (and for my notes below to mean anything), I recommend the Wikipedia article on UTF-8.
Rules for determining number of bytes used by character:
if first_byte > 1111 0000 (F0h, 240d) then there will be 4 bytes in this character
if first_byte > 1110 0000 (E0h, 224d) then there will be 3 bytes in this character
if first_byte > 1100 0000 (C0h, 192d) then there will be 2 bytes in this character
if first_byte > 1000 0000 (80h, 128d) then ERROR, this is a continuation byte!
else there will be 1 byte in this character
To create proper Unicode from the UTF multibytes:
AND MASK
Byte Bin Hex Dec
-----------------------------------------------------------------
1st byte of 4 byte character (1111 0xxx) 111 7 7
1st byte of 3 byte character (1110 xxxx) 1111 F 15
1st byte of 2 byte character (110x xxxx) 11111 1F 31
2nd, 3rd, or 4th byte (10xx xxxx) 111111 3F 63
1st byte of 1 byte character (0xxx xxxx) n/a n/a n/a
Multiplier
Byte Bin Hex Dec
------------------------------------------------------------------
1 of 4 1 000000 000000 000000 40000 262144
1 of 3; 2 of 4 1 000000 000000 1000 4096
1 of 2; 2 of 3; 3 of 4 1 000000 40 64
1 of 1; 2 of 2; 3 of 3; 4 of 4 n/a n/a n/a
My old utility
For posterity (or laughs?), here is the ancient PHP that was still running up until 2018 to do these conversions for the poor souls who needed it:
<?php
// It was quite amusing to try to do this without making
// the browser use UTF-8 encoding in the early 2000s when
// I wrote this!
header("Content-type: text/html; charset=utf-8");
function getByte($offset){
global $stuff;
// note: bah is 'foo' in sheep
$hexbyte = unpack("@{$offset}/C1bah", $stuff);
return $hexbyte["bah"];
}
function getNum($byte){
if($byte >= 240){ return 4; } // 1111 0000
if($byte >= 224){ return 3; } // 1110 0000
if($byte >= 192){ return 2; } // 1100 0000
if($byte >= 128){ return 0; } // 1000 0000
else{ return 1; } // 0000 0000
}
// "Darts" was a template engine that ran ratfactor.com for over a decade.
// To run this, you'll have to remove it and replace its logic with vanilla PHP
// Try: $_POST[...]
if(Darts::isWebData('stuff')){
$results = '';
if(Darts::isWebData('escape')) $escape = true;
$stuff = Darts::webData("stuff");
$len = strlen($stuff);
$results .= "$stuff <i>($len bytes)</i>: ";
for($offset=0; $offset<$len; ){ // until end of string
$output[$outputnum] = 0; // initialize output
$first_byte = getByte($offset);
$numBytes = getNum(getByte($offset));
if(!$numBytes) $results .= "<b>Error! Unexpected 10b!</b>";
for($x=0; $x<$numBytes; $x++){ // do $numBytes of bytes
$bytes[$x] = getByte($offset);
$thistype = getNum($bytes[$x]);
switch($thistype){ // mask it!
case 4: $bytes[$x] &= 7; break; // 1111 0xxx
case 3: $bytes[$x] &= 15; break; // 1110 0xxx
case 2: $bytes[$x] &= 31; break; // 1100 0xxx
case 0: $bytes[$x] &= 63; break; // 1000 0xxx
}
switch(($numBytes-1) - $x){
case 3: $bytes[$x] *= 262144; break; // 3 bytes over
case 2: $bytes[$x] *= 4096; break; // 2 bytes over
case 1: $bytes[$x] *= 64; break; // 1 byte over
}
$output[$outputnum] += $bytes[$x]; // add it to ouput
$offset++; // increment byte offset
}
$outputnum++; // increment output offset
}
if($escape) $results .= '"';
for($i=0; $i<$outputnum; $i++){ // write the little stinker
if($escape){
$results .= "\u";
}
else{
$results .= " ";
}
if($output[$i] < 256) $results .= "00"; // padd extra zeros if needed
$results .= strtoupper(dechex($output[$i]));
}
if($escape) $results .= '"';
}