UTF-8 to escape sequences

This used to be a PHP utility

Here are my notes and some ancient PHP source.

For a modern JavaScript utility which does exactly why my tool used to do, I recommend js-escapes.

UTF-8 Conversion Notes

The following are shorthand notes for the conversion. For a thorough discussion (and for my notes below to mean anything), I recommend the Wikipedia article on UTF-8.

Rules for determining number of bytes used by character:

if first_byte > 1111 0000 (F0h, 240d) then there will be 4 bytes in this character
if first_byte > 1110 0000 (E0h, 224d) then there will be 3 bytes in this character
if first_byte > 1100 0000 (C0h, 192d) then there will be 2 bytes in this character
if first_byte > 1000 0000 (80h, 128d) then ERROR, this is a continuation byte!
else there will be 1 byte in this character

To create proper Unicode from the UTF multibytes:

                                                    AND MASK
  Byte                                        Bin       Hex    Dec
  -----------------------------------------------------------------
  1st byte of 4 byte character (1111 0xxx)       111      7      7
  1st byte of 3 byte character (1110 xxxx)      1111      F     15
  1st byte of 2 byte character (110x xxxx)     11111     1F     31
  2nd, 3rd, or 4th byte        (10xx xxxx)    111111     3F     63
  1st byte of 1 byte character (0xxx xxxx)       n/a    n/a    n/a


                                               Multiplier
  Byte                                   Bin        Hex    Dec
  ------------------------------------------------------------------
  1 of 4                   1 000000 000000 000000   40000  262144
  1 of 3; 2 of 4                  1 000000 000000    1000    4096
  1 of 2; 2 of 3; 3 of 4                 1 000000      40      64
  1 of 1; 2 of 2; 3 of 3; 4 of 4              n/a     n/a     n/a

My old utility

For posterity (or laughs?), here is the ancient PHP that was still running up until 2018 to do these conversions for the poor souls who needed it:

<?php

// It was quite amusing to try to do this without making
// the browser use UTF-8 encoding in the early 2000s when
// I wrote this!
header("Content-type: text/html; charset=utf-8");

function getByte($offset){
  global $stuff;
  // note: bah is 'foo' in sheep
  $hexbyte = unpack("@{$offset}/C1bah", $stuff);
  return $hexbyte["bah"];
}

function getNum($byte){
    if($byte >= 240){ return 4; } // 1111 0000
    if($byte >= 224){ return 3; } // 1110 0000
    if($byte >= 192){ return 2; } // 1100 0000
    if($byte >= 128){ return 0; } // 1000 0000
    else{            return 1; } // 0000 0000
}

// "Darts" was a template engine that ran ratfactor.com for over a decade.
// To run this, you'll have to remove it and replace its logic with vanilla PHP
//   Try: $_POST[...]

if(Darts::isWebData('stuff')){
  $results = '';

  if(Darts::isWebData('escape')) $escape = true;
  $stuff = Darts::webData("stuff");

  $len = strlen($stuff);
  $results .= "$stuff <i>($len bytes)</i>: ";

  for($offset=0; $offset<$len; ){                 // until end of string
    $output[$outputnum] = 0;                      // initialize output
    $first_byte = getByte($offset);
    $numBytes = getNum(getByte($offset));
    if(!$numBytes) $results .= "<b>Error! Unexpected 10b!</b>";

    for($x=0; $x<$numBytes; $x++){                // do $numBytes of bytes
      $bytes[$x] = getByte($offset);
      $thistype = getNum($bytes[$x]);
      switch($thistype){                          // mask it!
      case 4: $bytes[$x] &= 7;  break;            // 1111 0xxx
      case 3: $bytes[$x] &= 15; break;            // 1110 0xxx
      case 2: $bytes[$x] &= 31; break;            // 1100 0xxx
      case 0: $bytes[$x] &= 63; break;            // 1000 0xxx
      }
      switch(($numBytes-1) - $x){
      case 3: $bytes[$x] *= 262144; break;        // 3 bytes over
      case 2: $bytes[$x] *= 4096;   break;        // 2 bytes over
      case 1: $bytes[$x] *= 64;     break;        // 1 byte  over
      }
      $output[$outputnum] += $bytes[$x];          // add it to ouput
      $offset++;                                  // increment byte offset
    }
    $outputnum++;                                 // increment output offset
  }

  if($escape) $results .= '"';

  for($i=0; $i<$outputnum; $i++){                 // write the little stinker
    if($escape){
      $results .= "\u";
    }
    else{
      $results .= " ";
    }
    if($output[$i] < 256) $results .= "00";             // padd extra zeros if needed
    $results .= strtoupper(dechex($output[$i]));
  }

  if($escape) $results .= '"';
}