faceclick/data/README

This document explores the Emoji data and how to pare it down in different ways
to make a subset that:

* Works for my intended audience
* Is as small on disc as possible
* Has a great keyword search feature

After some research, I'm going with the emojibase.dev data set, which is based
on the official Unicode data files. It has excellent keywords ("tags") for
searching the emoji, labels, grouping, etc.

-------------------------------------------------------------------------------

Maybe I'm a dunce, but I had a heck of a time figuring out where to get the
JSON files from the https://emojibase.dev website. But I eventually found the
CDN where the raw files are hosted. I got the full raw JSON file here:

    https://cdn.jsdelivr.net/npm/emojibase-data@16.0.3/en/

Initially, I ran the data through `jq` to pretty-print it for readability
while I was getting started and named it emojibase.pretty.json. So any
references to the "pretty" file are talking about that.

(Side note: I started learning enough of the jq command line tool to use it for
all of my JSON manipulation needs, but then I realized that I already had a
full language that I already knew at my disposal: Ruby. It's three lines to
read the file, and then I can just use familiar loops and such. You can
pretty-print the data and everything else.)

The name, as downloaded, of the Emojibase data set in this directory is:

    data.json

That's the FULL data with labels, groups, keywords, etc.

-------------------------------------------------------------------------------

Tools:

customizer.rb - Makes personalized alterations to the full emojibase JSON

getstats.rb   - Prints stats (including bytes used) about the relevant parts
                of a given JSON file

makejs.rb     - Process an Emoji data set (presumably 'myemoj.json', generated
                by customizer.rb) and generate JavaScript (pretty much jSON
                but namespaced to 'FC.data' for the final library and with
                things that are not legal in JSON, such as comments and
                trailing commas.

go.sh         - Open it and see! (Automates my most common process, and is
                currently changing rapidly. It will probably end up doing the
                entire process of making customizing the data, making an HTML
                contact sheet (to see which emoji are used), and exporting the
                JavaScript version of the data for use in the final picker.)

makesheet.rb  - Creates an HTML contact sheet for a given JSON input file.
                Sheet is a single page with all of the emoji and labels in
                tooltips.
                (Now includes stats from getstats.rb!)

-------------------------------------------------------------------------------

Is it worth trying to "compress" via indexed keywords, etc?

Let's look at gzip compression for a rough idea:

               uncompressed     compressed
    data       385566           31635
    w/ groups  420085           32258
    emoj list  23137            5281
    emoj txt   11699            4762

Most of the raw data stats below were gathered with getstats.rb.

-------------------------------------------------------------------------------

Full list (full-base-stats.rb):

    (file size of emojibase.pretty.json is 1174981 bytes)

    list len: 1941
    raw emoji len: 3377 (longer than list due to ligature combos!)
    raw emoji bytes: 12295 (much longer due to multibyte + ligatures)
    labels (bytes): 25721
    tags: 10108
    tags (bytes): 56816
    unique tags: 3615
    unique tags (bytes): 21079
    -----------------
    tags+labels+emoji (bytes): 94832

My list (myemoj.json):

    NOTE: The exact numbers below were out of date almost immediately because I
    found more items to remove. I'm not going to keep updating them here. But
    you can always re-run the script(s) for your data pleasures.

    File sizes:
        869445  With all emojibase data
        310256  With just labels, emoji, group, and tags

    Raw data:
    list len: 1778
    raw emoji len: 2758 (longer than list due to ligature combos!)
    raw emoji bytes: 10234 (much longer due to multibyte + ligatures)
    labels (bytes): 22946
    tags: 8885
    tags (bytes): 49539
    unique tags: 3571
    unique tags (bytes): 20776
    -----------------
    tags+labels+emoji (bytes): 82719

JSON vs Raw data - (all relating to myemoj.json)

    310256  Pretty formatted JSON
    185830  JSON       -124426 bytes (40% smaller than pretty)
    82719   Raw data   -103111 bytes (55% smaller than JSON)

So JSON encoding alone doubles the file size. Pretty JSON nearly triples it.

I'll need SOME sort of encoding, and I suspect I'm going to end up with some
sort of hybrid with data packed into some sort of string. It will still be
lighning fast to chop up.

-------------------------------------------------------------------------------

New idea:

    Deconstruct labels into synthetic tags by splitting on space, then
    add those to the tag list and then re-construct the labels at runtime
    by using tag indexes!

    Here goes:

    78,393 mydata.js - with labels and tags
    88,663 mydata.js - with tags + label words + labels desconstructed
    87,932 mydata.js - same as above, but all lower case (not worth it)

    So that didn't work! The size went up because of all of the 4-digit
    index numbers.

    But... it got me thinking that part of the reason the result was so
    bulky is that the labels and tag references are quite redundant - I
    don't need to reference a tag from an emoji if I also reference that
    same tag from the emoji's label.

    So now I'm going to change it to a *simpler* system where I gather
    all "words" from both tags and labels:

        2811 unique words - tags
        3218 unique words - tags + labels split into words

    So labels only contain about 400 words not already in tags. This looks
    very promising!

        Also:
        3169 unique words - if we also make labels downcase...
        ...which I've decided to not do (It's commented out in customizer.rb)

    And then store the words to re-construct the labels first. And then
    ONLY store the tags that aren't already part of the label...

    74,057 mydata.js - yeah! that's 4kb smaller than the raw labels and tags

Conclusion: It's surprisingly hard to actually save any space when a small 4-digit
number is actually stored as 4 whole characters, plus the surrounding syntax
of an array [] and commas to separate values.

It *would* be quite interesting to pack bits...but I'm pretty sure the unpacking
code would eat up most of the savings, and I don't see any sense in making
it more obfuscated than it already is. Obfuscation has never been the goal...

In fact, rather than three separate lists, I think I should have the tags and
labels nested with the emoji so it's actually readable. I will pay for the
additional quotes '' around each emoji which comes to 2kb...hmm... Totally
worth it.

Also, any reference to single-use words can add up to
7 completely wasted characters.
So I need to only store words that are used more than
once...and even then, probably only words that are 2 or more
characters in size:

    emoji | label with ref  | tags
    ------+-----------------+--------
    ["X", "Big $23 dog", ["+",34,15]]

Wow, very surprised to find that there are only 404
unique words once you de-dupe the synthetic tags from
the labels.

-------------------------------------------------------------------------------
Next day:

Okay, so now I've got BOTH keyword lists (labels and tags) stored as
space-separated strings because that's way more compact (and readable!) than
a JS array and is trivial to split into an array in JS.

The tags and labels are de-duped *per emoji* because I'm going to search the
terms in both anyway. In fact, I think this will actually speed up search
on the other end if I don't ever even turn them into arrays, LOL. Kind of
amazing how going deep on a problem can really turn it inside-out and
end up simplifying...but I'm getting ahead of myself. Gotta see how big
the output is and then find the right blend of word usage vs word length.

I have two parameters I can mess with now to try to make it as compact
as possible:

    min_word_usage_count = 2
    min_word_length = 1

Those settings give me...
    
    65,514 mydata.js - Now we're talkin!

The previous best was 74,057, so this is
8.5Kb savings.

So I want to test a small number of
permutations to see if I can improve on that
initial setting. I'm going to write a little
script to automate testing...

    ruby word_experiment.rb

I put the bytes output at the beginning so I can sort, so let's see...

    62719 bytes. min usage count=4 min word length=4
    62759 bytes. min usage count=5 min word length=4
    62817 bytes. min usage count=3 min word length=5
    62837 bytes. min usage count=3 min word length=4
    62919 bytes. min usage count=4 min word length=5
    63073 bytes. min usage count=5 min word length=5
    63153 bytes. min usage count=5 min word length=3
    63210 bytes. min usage count=4 min word length=3
    63280 bytes. min usage count=5 min word length=1
    63280 bytes. min usage count=5 min word length=2
    63360 bytes. min usage count=4 min word length=2
    63388 bytes. min usage count=4 min word length=1
    63472 bytes. min usage count=3 min word length=3
    63631 bytes. min usage count=3 min word length=2
    63656 bytes. min usage count=2 min word length=5
    63763 bytes. min usage count=3 min word length=1
    64084 bytes. min usage count=2 min word length=4
    65049 bytes. min usage count=2 min word length=3
    65307 bytes. min usage count=2 min word length=2
    65514 bytes. min usage count=2 min word length=1
    73302 bytes. min usage count=1 min word length=5
    75830 bytes. min usage count=1 min word length=4
    77661 bytes. min usage count=1 min word length=3
    78104 bytes. min usage count=1 min word length=2
    78400 bytes. min usage count=1 min word length=1

Okay, that's awesome. The total size goes down as I increase the word usage
count and minimum word length..until we get to the magical value of 4 for each
and then it starts to creep back up again.

This was a highly variable problem, so an experiment was, by far, the easiest
and quickest way to find the optimal settings and shave off an additional
2.8Kb.

To be clear, at 74Kb, I had something pretty obfuscated, but at 63Kb, it is
waaaay more understandable ("readable" would probably be overstating it.)

Now to re-write everything that uses this data to see if it, you know, works!

2025-08-13 IT WORKS!!!! And the whole thing is under 70Kb (or a little over
if you include the CSS).