JavaScript: Blank line scanning/parsing/lexing by hand vs regex
See also JavaScript string parsing performance test where I did a very similar performance test of
hand-written lexer/scanner versus regex. (Which you can run right on the page.) I also discuss
the merits of performance.now()
in testing in browsers.
For this particular test, I put the benchmark in a separate HTML file.
Check it out and run it here: line-scanning-bench.html
The benchmark finds blank line-separated segments (mostly paragraph breaks) in the complete text of Project Gutenberg’s edition of How to Live on 24 Hours a Day, which contains 91Kb of characters in 245 blank line-separated segments. (I also added variations of SPC, TAB, LF, and CR+LF to make sure each version was working correctly.)
Results
As with my previous test, regex won again. On short input (just 8 lines in 4 segments), my hand-written parser wins (up to twice as fast), but on larger amounts of input and on greater repetitions of the test, regex pulls away - quickly becoming about twice as fast.
The exact speeds depend on hardware and don’t really matter, but the relative differences do:
Repetitions | Hand-written scanner | Regex scanner |
---|---|---|
10 |
3 ms |
2 ms |
100 |
21 ms |
12 ms |
1000 |
232 ms |
91 ms |
10000 |
2069 ms |
960 ms |
(Smaller number is better.)
I ran each test multiple times and am showing the last time (the first "cold" time is always a bit longer - particularly for the regex version).
Another interesting result is that Firefox (SpiderMonkey) 143.0.4 was notably faster than Chrome (V8) 140.0.7339.80 in all of my testing. The relative difference between the hand-written scanner versus the regex scanner was about the same in both browsers. This test isn’t about testing browser differences, so I’m not even going to break those numbers out into a table - and you should do your own testing on your own system(s) if you want to get into that.
The two methods (source code)
Of course, maybe my hand-written scanner or regex-based scanners are deeply flawed. Judge for yourself!
Keep in mind that part of the difficulty of determining blank lines as I’m defining them is that:
-
A "blank" line can contain spaces and tabs
-
"Lines" are separated by either Unix/POSIX-style line endings: LR,
0x0A
or Web/Internet RFC-style line-endings: CR+LF0x0D 0x0A
(which are also used by Windows) See: https://en.wikipedia.org/wiki/Newline
I’ve cleaned up the source a little for readability, but these will be pretty much straight out of the rough-and-ready benchmark - bad variable names and all.
I’ll list the regex version first, which is shorter.
Regex scanner
NOTE: Variables start, end, and eof aren’t needed for the benchmark, but will be needed in my real-life project!
var start = 0; var end; var eof = false; var segment_count = 0; // g - populate and use the lastIndex property // d - provide substring match indices var blankrx = /\r?\n\s*\r?\n/gd; while(!eof){ z = blankrx.exec(input); segment_count++; // will include last one if(!z){ eof=true; break; } end = z.indices[0][0]; start = blankrx.lastIndex; }
Hand-written character-based scanner
NOTE: Some of the data I’m capturing here isn’t needed for the benchmark, but will be needed in my real-life project - so both benchmarks capture it.
var start = 0; var end; var txt; // the string we're parsin' var len; // length of the string var eof = false; // have we hit eof? var TAB = '\t'; // 0x09 (9) var LF = '\n'; // 0x0A (10) var CR = '\r'; // 0x0D (13) var SPC = ' '; // 0x20 (32) function scan_to_blank(){ var eol_count = 0; for(var i=start; i < len; i++){ var c = txt[i]; if(c === LF){ eol_count++; if(eol_count > 1) return; end = i; // So far. Back 1 for CR: if(i>0 && txt[i-1] === CR) end--; continue; } if(eol_count < 1) continue; if(c !== TAB && c !== CR && c !== SPC){ eol_count = 0; // not a blank line } } eof = true; end = len-1; } function consume_blanks(){ var c, last_end; for(i=end; i<len; i++){ c=txt[i]; if(c === LF){ last_end = i+1; if(i>0 && txt[i-1] === CR) last_end--; continue; } if(c !== TAB && c !== SPC && c !== CR) return last_end; } eof = true; return len; } function scan(input){ txt = input; len = txt.length; segment_count = 0; while(!eof){ scan_to_blank(); // At this point, there's a segment between 'start' and 'end' start = consume_blanks(); segment_count++; } return `${segment_count} segments.`; }
I worked out the core of scan_to_blank()
and consume_blanks()
on paper one
evening and I’m happy with the design. There is, at most, one character of
back-tracking per line and I believe that if this were written in a compiled
language (so my functions were native code), this would be way faster than the
regex!
(At the very least, I know it’s pretty decent because it is faster than the regex version for very short input and considerably faster on short input and a "cold" first run!)
As in my previous test, this shows that the compiled code of the regex engines will probably be faster for large chunks of processing than anything you write in JavaScript, even with the best interpreters in the world - even though the regex engines are at a disadvantage in terms of "understanding" the exact problem ahead of time versus a custom lexer.
Function overhead?
My hand-written parser splits the work into two functions. I would need them to be split up this way in my real project. But not in this benchmark. So does inlining the functions help?
No!
For some reason, inlining the functions actually gave slightly worse performance.
For strictly syntactical reasons, I needed to add if
conditions to test the
scanner position coming out of the loops (versus just falling through the loops in the
function versions) to check for the EOF condition.
Now, with some clever refactoring, the extra conditions could be eliminated. But
the point is, function calls in these JavaScript engines are extremely efficient -
lighter than if
statements. Don’t sweat function calls!