faceclick/data/README

1 faceclick/data/README 2 3 This document explores the Emoji data and how to pare it down in different ways 4 to make a subset that: 5 6 * Works for my intended audience 7 * Is as small on disc as possible 8 * Has a great keyword search feature 9 10 After some research, I'm going with the emojibase.dev data set, which is based 11 on the official Unicode data files. It has excellent keywords ("tags") for 12 searching the emoji, labels, grouping, etc. 13 14 ------------------------------------------------------------------------------- 15 16 Maybe I'm a dunce, but I had a heck of a time figuring out where to get the 17 JSON files from the https://emojibase.dev website. But I eventually found the 18 CDN where the raw files are hosted. I got the full raw JSON file here: 19 20 https://cdn.jsdelivr.net/npm/emojibase-data@16.0.3/en/ 21 22 Initially, I ran the data through `jq` to pretty-print it for readability 23 while I was getting started and named it emojibase.pretty.json. So any 24 references to the "pretty" file are talking about that. 25 26 (Side note: I started learning enough of the jq command line tool to use it for 27 all of my JSON manipulation needs, but then I realized that I already had a 28 full language that I already knew at my disposal: Ruby. It's three lines to 29 read the file, and then I can just use familiar loops and such. You can 30 pretty-print the data and everything else.) 31 32 UPDATE: In September 2025, "demodulatingswan" sent me a jq version of the 33 customizer! See README-JQ. 34 35 The name, as downloaded, of the Emojibase data set in this directory is: 36 37 data.json 38 39 That's the FULL data with labels, groups, keywords, etc. 40 41 ------------------------------------------------------------------------------- 42 43 Tools: 44 45 customizer.rb - Makes personalized alterations to the full emojibase JSON 46 47 getstats.rb - Prints stats (including bytes used) about the relevant parts 48 of a given JSON file 49 50 makejs.rb - Process an Emoji data set (presumably 'myemoj.json', generated 51 by customizer.rb) and generate JavaScript (pretty much jSON 52 but namespaced to 'FC.data' for the final library and with 53 things that are not legal in JSON, such as comments and 54 trailing commas. 55 56 go.sh - Open it and see! (Automates my most common process, and is 57 currently changing rapidly. It will probably end up doing the 58 entire process of making customizing the data, making an HTML 59 contact sheet (to see which emoji are used), and exporting the 60 JavaScript version of the data for use in the final picker.) 61 62 makesheet.rb - Creates an HTML contact sheet for a given JSON input file. 63 Sheet is a single page with all of the emoji and labels in 64 tooltips. 65 (Now includes stats from getstats.rb!) 66 67 ------------------------------------------------------------------------------- 68 69 Is it worth trying to "compress" via indexed keywords, etc? 70 71 Let's look at gzip compression for a rough idea: 72 73 uncompressed compressed 74 data 385566 31635 75 w/ groups 420085 32258 76 emoj list 23137 5281 77 emoj txt 11699 4762 78 79 Most of the raw data stats below were gathered with getstats.rb. 80 81 ------------------------------------------------------------------------------- 82 83 Full list (full-base-stats.rb): 84 85 (file size of emojibase.pretty.json is 1174981 bytes) 86 87 list len: 1941 88 raw emoji len: 3377 (longer than list due to ligature combos!) 89 raw emoji bytes: 12295 (much longer due to multibyte + ligatures) 90 labels (bytes): 25721 91 tags: 10108 92 tags (bytes): 56816 93 unique tags: 3615 94 unique tags (bytes): 21079 95 ----------------- 96 tags+labels+emoji (bytes): 94832 97 98 My list (myemoj.json): 99 100 NOTE: The exact numbers below were out of date almost immediately because I 101 found more items to remove. I'm not going to keep updating them here. But 102 you can always re-run the script(s) for your data pleasures. 103 104 File sizes: 105 869445 With all emojibase data 106 310256 With just labels, emoji, group, and tags 107 108 Raw data: 109 list len: 1778 110 raw emoji len: 2758 (longer than list due to ligature combos!) 111 raw emoji bytes: 10234 (much longer due to multibyte + ligatures) 112 labels (bytes): 22946 113 tags: 8885 114 tags (bytes): 49539 115 unique tags: 3571 116 unique tags (bytes): 20776 117 ----------------- 118 tags+labels+emoji (bytes): 82719 119 120 JSON vs Raw data - (all relating to myemoj.json) 121 122 310256 Pretty formatted JSON 123 185830 JSON -124426 bytes (40% smaller than pretty) 124 82719 Raw data -103111 bytes (55% smaller than JSON) 125 126 So JSON encoding alone doubles the file size. Pretty JSON nearly triples it. 127 128 I'll need SOME sort of encoding, and I suspect I'm going to end up with some 129 sort of hybrid with data packed into some sort of string. It will still be 130 lighning fast to chop up. 131 132 ------------------------------------------------------------------------------- 133 134 New idea: 135 136 Deconstruct labels into synthetic tags by splitting on space, then 137 add those to the tag list and then re-construct the labels at runtime 138 by using tag indexes! 139 140 Here goes: 141 142 78,393 mydata.js - with labels and tags 143 88,663 mydata.js - with tags + label words + labels desconstructed 144 87,932 mydata.js - same as above, but all lower case (not worth it) 145 146 So that didn't work! The size went up because of all of the 4-digit 147 index numbers. 148 149 But... it got me thinking that part of the reason the result was so 150 bulky is that the labels and tag references are quite redundant - I 151 don't need to reference a tag from an emoji if I also reference that 152 same tag from the emoji's label. 153 154 So now I'm going to change it to a *simpler* system where I gather 155 all "words" from both tags and labels: 156 157 2811 unique words - tags 158 3218 unique words - tags + labels split into words 159 160 So labels only contain about 400 words not already in tags. This looks 161 very promising! 162 163 Also: 164 3169 unique words - if we also make labels downcase... 165 ...which I've decided to not do (It's commented out in customizer.rb) 166 167 And then store the words to re-construct the labels first. And then 168 ONLY store the tags that aren't already part of the label... 169 170 74,057 mydata.js - yeah! that's 4kb smaller than the raw labels and tags 171 172 Conclusion: It's surprisingly hard to actually save any space when a small 4-digit 173 number is actually stored as 4 whole characters, plus the surrounding syntax 174 of an array [] and commas to separate values. 175 176 It *would* be quite interesting to pack bits...but I'm pretty sure the unpacking 177 code would eat up most of the savings, and I don't see any sense in making 178 it more obfuscated than it already is. Obfuscation has never been the goal... 179 180 In fact, rather than three separate lists, I think I should have the tags and 181 labels nested with the emoji so it's actually readable. I will pay for the 182 additional quotes '' around each emoji which comes to 2kb...hmm... Totally 183 worth it. 184 185 Also, any reference to single-use words can add up to 186 7 completely wasted characters. 187 So I need to only store words that are used more than 188 once...and even then, probably only words that are 2 or more 189 characters in size: 190 191 emoji | label with ref | tags 192 ------+-----------------+-------- 193 ["X", "Big $23 dog", ["+",34,15]] 194 195 Wow, very surprised to find that there are only 404 196 unique words once you de-dupe the synthetic tags from 197 the labels. 198 199 ------------------------------------------------------------------------------- 200 Next day: 201 202 Okay, so now I've got BOTH keyword lists (labels and tags) stored as 203 space-separated strings because that's way more compact (and readable!) than 204 a JS array and is trivial to split into an array in JS. 205 206 The tags and labels are de-duped *per emoji* because I'm going to search the 207 terms in both anyway. In fact, I think this will actually speed up search 208 on the other end if I don't ever even turn them into arrays, LOL. Kind of 209 amazing how going deep on a problem can really turn it inside-out and 210 end up simplifying...but I'm getting ahead of myself. Gotta see how big 211 the output is and then find the right blend of word usage vs word length. 212 213 I have two parameters I can mess with now to try to make it as compact 214 as possible: 215 216 min_word_usage_count = 2 217 min_word_length = 1 218 219 Those settings give me... 220 221 65,514 mydata.js - Now we're talkin! 222 223 The previous best was 74,057, so this is 224 8.5Kb savings. 225 226 So I want to test a small number of 227 permutations to see if I can improve on that 228 initial setting. I'm going to write a little 229 script to automate testing... 230 231 ruby word_experiment.rb 232 233 I put the bytes output at the beginning so I can sort, so let's see... 234 235 62719 bytes. min usage count=4 min word length=4 236 62759 bytes. min usage count=5 min word length=4 237 62817 bytes. min usage count=3 min word length=5 238 62837 bytes. min usage count=3 min word length=4 239 62919 bytes. min usage count=4 min word length=5 240 63073 bytes. min usage count=5 min word length=5 241 63153 bytes. min usage count=5 min word length=3 242 63210 bytes. min usage count=4 min word length=3 243 63280 bytes. min usage count=5 min word length=1 244 63280 bytes. min usage count=5 min word length=2 245 63360 bytes. min usage count=4 min word length=2 246 63388 bytes. min usage count=4 min word length=1 247 63472 bytes. min usage count=3 min word length=3 248 63631 bytes. min usage count=3 min word length=2 249 63656 bytes. min usage count=2 min word length=5 250 63763 bytes. min usage count=3 min word length=1 251 64084 bytes. min usage count=2 min word length=4 252 65049 bytes. min usage count=2 min word length=3 253 65307 bytes. min usage count=2 min word length=2 254 65514 bytes. min usage count=2 min word length=1 255 73302 bytes. min usage count=1 min word length=5 256 75830 bytes. min usage count=1 min word length=4 257 77661 bytes. min usage count=1 min word length=3 258 78104 bytes. min usage count=1 min word length=2 259 78400 bytes. min usage count=1 min word length=1 260 261 Okay, that's awesome. The total size goes down as I increase the word usage 262 count and minimum word length..until we get to the magical value of 4 for each 263 and then it starts to creep back up again. 264 265 This was a highly variable problem, so an experiment was, by far, the easiest 266 and quickest way to find the optimal settings and shave off an additional 267 2.8Kb. 268 269 To be clear, at 74Kb, I had something pretty obfuscated, but at 63Kb, it is 270 waaaay more understandable ("readable" would probably be overstating it.) 271 272 Now to re-write everything that uses this data to see if it, you know, works! 273 274 2025-08-13 IT WORKS!!!! And the whole thing is under 70Kb (or a little over 275 if you include the CSS).