1 faceclick/data/README
2
3 This document explores the Emoji data and how to pare it down in different ways
4 to make a subset that:
5
6 * Works for my intended audience
7 * Is as small on disc as possible
8 * Has a great keyword search feature
9
10 After some research, I'm going with the emojibase.dev data set, which is based
11 on the official Unicode data files. It has excellent keywords ("tags") for
12 searching the emoji, labels, grouping, etc.
13
14 -------------------------------------------------------------------------------
15
16 Maybe I'm a dunce, but I had a heck of a time figuring out where to get the
17 JSON files from the https://emojibase.dev website. But I eventually found the
18 CDN where the raw files are hosted. I got the full raw JSON file here:
19
20 https://cdn.jsdelivr.net/npm/emojibase-data@16.0.3/en/
21
22 Initially, I ran the data through `jq` to pretty-print it for readability
23 while I was getting started and named it emojibase.pretty.json. So any
24 references to the "pretty" file are talking about that.
25
26 (Side note: I started learning enough of the jq command line tool to use it for
27 all of my JSON manipulation needs, but then I realized that I already had a
28 full language that I already knew at my disposal: Ruby. It's three lines to
29 read the file, and then I can just use familiar loops and such. You can
30 pretty-print the data and everything else.)
31
32 The name, as downloaded, of the Emojibase data set in this directory is:
33
34 data.json
35
36 That's the FULL data with labels, groups, keywords, etc.
37
38 -------------------------------------------------------------------------------
39
40 Tools:
41
42 customizer.rb - Makes personalized alterations to the full emojibase JSON
43
44 getstats.rb - Prints stats (including bytes used) about the relevant parts
45 of a given JSON file
46
47 makejs.rb - Process an Emoji data set (presumably 'myemoj.json', generated
48 by customizer.rb) and generate JavaScript (pretty much jSON
49 but namespaced to 'FC.data' for the final library and with
50 things that are not legal in JSON, such as comments and
51 trailing commas.
52
53 go.sh - Open it and see! (Automates my most common process, and is
54 currently changing rapidly. It will probably end up doing the
55 entire process of making customizing the data, making an HTML
56 contact sheet (to see which emoji are used), and exporting the
57 JavaScript version of the data for use in the final picker.)
58
59 makesheet.rb - Creates an HTML contact sheet for a given JSON input file.
60 Sheet is a single page with all of the emoji and labels in
61 tooltips.
62 (Now includes stats from getstats.rb!)
63
64 -------------------------------------------------------------------------------
65
66 Is it worth trying to "compress" via indexed keywords, etc?
67
68 Let's look at gzip compression for a rough idea:
69
70 uncompressed compressed
71 data 385566 31635
72 w/ groups 420085 32258
73 emoj list 23137 5281
74 emoj txt 11699 4762
75
76 Most of the raw data stats below were gathered with getstats.rb.
77
78 -------------------------------------------------------------------------------
79
80 Full list (full-base-stats.rb):
81
82 (file size of emojibase.pretty.json is 1174981 bytes)
83
84 list len: 1941
85 raw emoji len: 3377 (longer than list due to ligature combos!)
86 raw emoji bytes: 12295 (much longer due to multibyte + ligatures)
87 labels (bytes): 25721
88 tags: 10108
89 tags (bytes): 56816
90 unique tags: 3615
91 unique tags (bytes): 21079
92 -----------------
93 tags+labels+emoji (bytes): 94832
94
95 My list (myemoj.json):
96
97 NOTE: The exact numbers below were out of date almost immediately because I
98 found more items to remove. I'm not going to keep updating them here. But
99 you can always re-run the script(s) for your data pleasures.
100
101 File sizes:
102 869445 With all emojibase data
103 310256 With just labels, emoji, group, and tags
104
105 Raw data:
106 list len: 1778
107 raw emoji len: 2758 (longer than list due to ligature combos!)
108 raw emoji bytes: 10234 (much longer due to multibyte + ligatures)
109 labels (bytes): 22946
110 tags: 8885
111 tags (bytes): 49539
112 unique tags: 3571
113 unique tags (bytes): 20776
114 -----------------
115 tags+labels+emoji (bytes): 82719
116
117 JSON vs Raw data - (all relating to myemoj.json)
118
119 310256 Pretty formatted JSON
120 185830 JSON -124426 bytes (40% smaller than pretty)
121 82719 Raw data -103111 bytes (55% smaller than JSON)
122
123 So JSON encoding alone doubles the file size. Pretty JSON nearly triples it.
124
125 I'll need SOME sort of encoding, and I suspect I'm going to end up with some
126 sort of hybrid with data packed into some sort of string. It will still be
127 lighning fast to chop up.
128
129 -------------------------------------------------------------------------------
130
131 New idea:
132
133 Deconstruct labels into synthetic tags by splitting on space, then
134 add those to the tag list and then re-construct the labels at runtime
135 by using tag indexes!
136
137 Here goes:
138
139 78,393 mydata.js - with labels and tags
140 88,663 mydata.js - with tags + label words + labels desconstructed
141 87,932 mydata.js - same as above, but all lower case (not worth it)
142
143 So that didn't work! The size went up because of all of the 4-digit
144 index numbers.
145
146 But... it got me thinking that part of the reason the result was so
147 bulky is that the labels and tag references are quite redundant - I
148 don't need to reference a tag from an emoji if I also reference that
149 same tag from the emoji's label.
150
151 So now I'm going to change it to a *simpler* system where I gather
152 all "words" from both tags and labels:
153
154 2811 unique words - tags
155 3218 unique words - tags + labels split into words
156
157 So labels only contain about 400 words not already in tags. This looks
158 very promising!
159
160 Also:
161 3169 unique words - if we also make labels downcase...
162 ...which I've decided to not do (It's commented out in customizer.rb)
163
164 And then store the words to re-construct the labels first. And then
165 ONLY store the tags that aren't already part of the label...
166
167 74,057 mydata.js - yeah! that's 4kb smaller than the raw labels and tags
168
169 Conclusion: It's surprisingly hard to actually save any space when a small 4-digit
170 number is actually stored as 4 whole characters, plus the surrounding syntax
171 of an array [] and commas to separate values.
172
173 It *would* be quite interesting to pack bits...but I'm pretty sure the unpacking
174 code would eat up most of the savings, and I don't see any sense in making
175 it more obfuscated than it already is. Obfuscation has never been the goal...
176
177 In fact, rather than three separate lists, I think I should have the tags and
178 labels nested with the emoji so it's actually readable. I will pay for the
179 additional quotes '' around each emoji which comes to 2kb...hmm... Totally
180 worth it.
181
182 Also, any reference to single-use words can add up to
183 7 completely wasted characters.
184 So I need to only store words that are used more than
185 once...and even then, probably only words that are 2 or more
186 characters in size:
187
188 emoji | label with ref | tags
189 ------+-----------------+--------
190 ["X", "Big $23 dog", ["+",34,15]]
191
192 Wow, very surprised to find that there are only 404
193 unique words once you de-dupe the synthetic tags from
194 the labels.
195
196 -------------------------------------------------------------------------------
197 Next day:
198
199 Okay, so now I've got BOTH keyword lists (labels and tags) stored as
200 space-separated strings because that's way more compact (and readable!) than
201 a JS array and is trivial to split into an array in JS.
202
203 The tags and labels are de-duped *per emoji* because I'm going to search the
204 terms in both anyway. In fact, I think this will actually speed up search
205 on the other end if I don't ever even turn them into arrays, LOL. Kind of
206 amazing how going deep on a problem can really turn it inside-out and
207 end up simplifying...but I'm getting ahead of myself. Gotta see how big
208 the output is and then find the right blend of word usage vs word length.
209
210 I have two parameters I can mess with now to try to make it as compact
211 as possible:
212
213 min_word_usage_count = 2
214 min_word_length = 1
215
216 Those settings give me...
217
218 65,514 mydata.js - Now we're talkin!
219
220 The previous best was 74,057, so this is
221 8.5Kb savings.
222
223 So I want to test a small number of
224 permutations to see if I can improve on that
225 initial setting. I'm going to write a little
226 script to automate testing...
227
228 ruby word_experiment.rb
229
230 I put the bytes output at the beginning so I can sort, so let's see...
231
232 62719 bytes. min usage count=4 min word length=4
233 62759 bytes. min usage count=5 min word length=4
234 62817 bytes. min usage count=3 min word length=5
235 62837 bytes. min usage count=3 min word length=4
236 62919 bytes. min usage count=4 min word length=5
237 63073 bytes. min usage count=5 min word length=5
238 63153 bytes. min usage count=5 min word length=3
239 63210 bytes. min usage count=4 min word length=3
240 63280 bytes. min usage count=5 min word length=1
241 63280 bytes. min usage count=5 min word length=2
242 63360 bytes. min usage count=4 min word length=2
243 63388 bytes. min usage count=4 min word length=1
244 63472 bytes. min usage count=3 min word length=3
245 63631 bytes. min usage count=3 min word length=2
246 63656 bytes. min usage count=2 min word length=5
247 63763 bytes. min usage count=3 min word length=1
248 64084 bytes. min usage count=2 min word length=4
249 65049 bytes. min usage count=2 min word length=3
250 65307 bytes. min usage count=2 min word length=2
251 65514 bytes. min usage count=2 min word length=1
252 73302 bytes. min usage count=1 min word length=5
253 75830 bytes. min usage count=1 min word length=4
254 77661 bytes. min usage count=1 min word length=3
255 78104 bytes. min usage count=1 min word length=2
256 78400 bytes. min usage count=1 min word length=1
257
258 Okay, that's awesome. The total size goes down as I increase the word usage
259 count and minimum word length..until we get to the magical value of 4 for each
260 and then it starts to creep back up again.
261
262 This was a highly variable problem, so an experiment was, by far, the easiest
263 and quickest way to find the optimal settings and shave off an additional
264 2.8Kb.
265
266 To be clear, at 74Kb, I had something pretty obfuscated, but at 63Kb, it is
267 waaaay more understandable ("readable" would probably be overstating it.)
268
269 Now to re-write everything that uses this data to see if it, you know, works!
270
271 2025-08-13 IT WORKS!!!! And the whole thing is under 70Kb (or a little over
272 if you include the CSS).