This is a card in Dave's Virtual Box of Cards.

JavaScript: Blank line scanning/parsing/lexing by hand vs regex

Page created: 2025-10-09

See also JavaScript string parsing performance test where I did a very similar performance test of hand-written lexer/scanner versus regex. (Which you can run right on the page.) I also discuss the merits of performance.now() in testing in browsers.

For this particular test, I put the benchmark in a separate HTML file.

Check it out and run it here: line-scanning-bench.html

The benchmark finds blank line-separated segments (mostly paragraph breaks) in the complete text of Project Gutenberg’s edition of How to Live on 24 Hours a Day, which contains 91Kb of characters in 245 blank line-separated segments. (I also added variations of SPC, TAB, LF, and CR+LF to make sure each version was working correctly.)

Results

As with my previous test, regex won again. On short input (just 8 lines in 4 segments), my hand-written parser wins (up to twice as fast), but on larger amounts of input and on greater repetitions of the test, regex pulls away - quickly becoming about twice as fast.

The exact speeds depend on hardware and don’t really matter, but the relative differences do:

Repetitions	Hand-written scanner	Regex scanner
10	3 ms	2 ms
100	21 ms	12 ms
1000	232 ms	91 ms
10000	2069 ms	960 ms

Repetitions

Hand-written scanner

Regex scanner

3 ms

2 ms

100

21 ms

12 ms

1000

232 ms

91 ms

10000

2069 ms

960 ms

(Smaller number is better.)

I ran each test multiple times and am showing the last time (the first "cold" time is always a bit longer - particularly for the regex version).

Another interesting result is that Firefox (SpiderMonkey) 143.0.4 was notably faster than Chrome (V8) 140.0.7339.80 in all of my testing. The relative difference between the hand-written scanner versus the regex scanner was about the same in both browsers. This test isn’t about testing browser differences, so I’m not even going to break those numbers out into a table - and you should do your own testing on your own system(s) if you want to get into that.

The two methods (source code)

Of course, maybe my hand-written scanner or regex-based scanners are deeply flawed. Judge for yourself!

Keep in mind that part of the difficulty of determining blank lines as I’m defining them is that:

A "blank" line can contain spaces and tabs
"Lines" are separated by either Unix/POSIX-style line endings: LR, 0x0A or Web/Internet RFC-style line-endings: CR+LF 0x0D 0x0A (which are also used by Windows) See: https://en.wikipedia.org/wiki/Newline

I’ve cleaned up the source a little for readability, but these will be pretty much straight out of the rough-and-ready benchmark - bad variable names and all.

I’ll list the regex version first, which is shorter.

Regex scanner

NOTE: Variables start, end, and eof aren’t needed for the benchmark, but will be needed in my real-life project!

var start = 0;
var end;
var eof = false;
var segment_count = 0;

// g - populate and use the lastIndex property
// d - provide substring match indices
var blankrx = /\r?\n\s*\r?\n/gd;

while(!eof){
    z = blankrx.exec(input);
    segment_count++; // will include last one

    if(!z){ eof=true; break; }

    end = z.indices[0][0];
    start = blankrx.lastIndex;
}

Hand-written character-based scanner

NOTE: Some of the data I’m capturing here isn’t needed for the benchmark, but will be needed in my real-life project - so both benchmarks capture it.

var start = 0;
var end;

var txt; // the string we're parsin'
var len; // length of the string
var eof = false; // have we hit eof?

var TAB = '\t'; // 0x09 (9)
var LF  = '\n'; // 0x0A (10)
var CR  = '\r'; // 0x0D (13)
var SPC = ' ';  // 0x20 (32)

function scan_to_blank(){
    var eol_count = 0;
    for(var i=start; i < len; i++){
        var c = txt[i];
        if(c === LF){
            eol_count++;
            if(eol_count > 1) return;
            end = i; // So far. Back 1 for CR:
            if(i>0 && txt[i-1] === CR) end--;
            continue;
        }
        if(eol_count < 1) continue;
        if(c !== TAB && c !== CR && c !== SPC){
            eol_count = 0; // not a blank line
        }
    }
    eof = true;
    end = len-1;
}

function consume_blanks(){
    var c, last_end;
    for(i=end; i<len; i++){
        c=txt[i];
        if(c === LF){
            last_end = i+1;
            if(i>0 && txt[i-1] === CR) last_end--;
            continue;
         }
        if(c !== TAB && c !== SPC && c !== CR) return last_end;
    }
    eof = true;
    return len;
}

function scan(input){
    txt = input;
    len = txt.length;

    segment_count = 0;

    while(!eof){
        scan_to_blank();
        // At this point, there's a segment between 'start' and 'end'
        start = consume_blanks();
        segment_count++;
    }
    return `${segment_count} segments.`;
}

I worked out the core of scan_to_blank() and consume_blanks() on paper one evening and I’m happy with the design. There is, at most, one character of back-tracking per line and I believe that if this were written in a compiled language (so my functions were native code), this would be way faster than the regex!

(At the very least, I know it’s pretty decent because it is faster than the regex version for very short input and considerably faster on short input and a "cold" first run!)

As in my previous test, this shows that the compiled code of the regex engines will probably be faster for large chunks of processing than anything you write in JavaScript, even with the best interpreters in the world - even though the regex engines are at a disadvantage in terms of "understanding" the exact problem ahead of time versus a custom lexer.

Function overhead?

My hand-written parser splits the work into two functions. I would need them to be split up this way in my real project. But not in this benchmark. So does inlining the functions help?

No!

For some reason, inlining the functions actually gave slightly worse performance. For strictly syntactical reasons, I needed to add if conditions to test the scanner position coming out of the loops (versus just falling through the loops in the function versions) to check for the EOF condition.

Now, with some clever refactoring, the extra conditions could be eliminated. But the point is, function calls in these JavaScript engines are extremely efficient - lighter than if statements. Don’t sweat function calls!