UTF-8 to Unicode Converter

Get Unicode Character Codes

This converts UTF-8 strings to sequences of Unicode hexadecimal character codes. This can be handy for various programming reasons.

Just as an example, Adobe (or was it Macromedia then?) Flash didn't used to support non-European characters in the development interface, but it would display them in the runtime environment. The solution was to put the characters into Flash as long strings of escaped Unicode hexadecimal codes.


UTF-8 Conversion Notes

The following are shorthand notes for the conversion. For a thorough discussion (and for my notes below to mean anything), I recommend the Wikipedia article on UTF-8.

Rules for determining number of bytes used by character:

if first_byte > 1111 0000 (F0h, 240d) then there will be 4 bytes in this character
if first_byte > 1110 0000 (E0h, 224d) then there will be 3 bytes in this character
if first_byte > 1100 0000 (C0h, 192d) then there will be 2 bytes in this character
if first_byte > 1000 0000 (80h, 128d) then ERROR, this is a continuation byte!
else then there will be 1 byte in this character

To create proper Unicode from the UTF multibytes:
                                                    AND MASK
  Byte                                        Bin       Hex    Dec
  -----------------------------------------------------------------
  1st byte of 4 byte character (1111 0xxx)       111      7      7
  1st byte of 3 byte character (1110 xxxx)      1111      F     15
  1st byte of 2 byte character (110x xxxx)     11111     1F     31
  2nd, 3rd, or 4th byte        (10xx xxxx)    111111     3F     63
  1st byte of 1 byte character (0xxx xxxx)       n/a    n/a    n/a


                                               Multiplier
  Byte                                   Bin        Hex    Dec
  ------------------------------------------------------------------
  1 of 4                   1 000000 000000 000000   40000  262144
  1 of 3; 2 of 4                  1 000000 000000    1000    4096
  1 of 2; 2 of 3; 3 of 4                 1 000000      40      64
  1 of 1; 2 of 2; 3 of 3; 4 of 4              n/a     n/a     n/a

This Page

To make sure the page displays correctly and receives your text input as UTF-8, I use PHP's header() function to make the following setting in the HTTP header returned by the server:

header("Content-type: text/html; charset=utf-8");