This is a card in Dave's Virtual Box of Cards.

Getting the wordcount of my email inbox (or: "Parsing email like a fool")

Page created: 2026-05-16
Updated: 2026-06-17

Writing Project Inbox 2026 right now and accidentally wondered about the wordcount in my current inbox. So of course, now I gotta find out because I’m really curious.

Keep in mind, this inbox has been already been pared down to just high quality correspondence from real people.

Getting the email out of IMAP via Thunderbird

I’m running this version of the Thunderbird email application:

thunderbird --version
Mozilla Thunderbird 144.0.1

First, I found that I can select all of the email messages in my inbox, right click, and Save As…​ to a directory. (Thunderbird saves them all instantly to the chosen directory as separate files, so you might want to create a new fresh new empty directory first.)

The messages are saved as plain *.eml text files. Great, I can work with this!

So how do I extract just the plain text message body from these?

Determining the Content-Type from the header

An email’s header is separated from it’s body by a "blank line". It’s specifically a pair of CR LF, but I can chomp the lines to normalize the newline sequences away and detect the blank line as just that:

Dir.glob('*.eml') do |fname|
  File.readlines(fname, chomp:true).each do |line|

    if in_header
      if line == ''
        in_header = false
        next
      end
    else

      # Process body...

    end

  end
end

Distraction

I also noticed that all of the email text files end the headers with this property (the '#'s are a sequence of numbers):

X-Transit-Start: ###########

Super weird that a web search turns up zero hits on this property. DuckDuckGo admits it has nothing. Marginalia Search (marginalia-search.com) turns up nothing. Google turns up a handful of false positives. Microslop Bing doesn’t even pretend it cares what I searched for, it just wants to show me as much junk as possible. Oh, I know! I’ll use my Kagi trial account…​nope, no matches.

Whatever. Maybe it’s Thunderbird-specific. Resisting. Urge. To. Download. Thunderbird. Source…​ Okay, the fever passed.

Getting the Content-Type

Next, do all emails contain Content-Type? NO, one of them has the second word in the property in lower case, Content-type. Fine, whatever. I’ll do it case-insensitive and now the answer is YES.

How many different content types are there? I do a regex match on the Content-Type field in the header and…​ the results are in:

  53.6% text/plain
  34.5% multipart/alternative
   4.8% multipart/related
   4.8% multipart/mixed
   2.4% multipart/signed

Nice, more than half of the messages I get are plaintext! Thank you!

The text/plain email messages are super easy to process. Any line after the header is part of the body, so I’ll just print those to STDOUT to redirect to a file later (or maybe just pipe directly to wc for the final answer).

Getting the multipart text/plain body

To do this right, we have to get the boundary marker from the Content-Type: multipart/* field.

It turns out that’s not fun to do because header lines over a certain length are "folded" onto more than one line with a newline followed by at least one whitespace character to indent the continued header. Here’s an example:

Content-Type: multipart/alternative;
    boundary="4F9FB2B9-63BF-4127-8F27-E62F95E339C7"

Which isn’t too bad, it’s just that I’m starting to collect all of these edge cases. Like, the double-quotes around the boundary are optional. And there can also sometimes be other properties after the boundary property. It adds up.

Anyway, with multipart, I’m looking for lines with pairs of dashes (--) followed by the boundary string followed by a little mini-header for that part with its own Content-Type, etc. like so:

--4F9FB2B9-63BF-4127-8F27-E62F95E339C7
Content-Type: text/plain; charset="UTF-8"
...

Great, this working…​

But, oh, come on! Some of these are nested multiparts. The email is multipart and contains one part that is itself multipart. Both multiparts have unique nested boundaries. Why does it always have to be like this?!

In meme format: "Yo dawg, I heard you liked multipart email bodies, so I made multipart multiparts so you can find boundaries while you’re finding boundaries."

I shake my fist at the sky.

Doing it the "right way" isn’t getting me anywhere, so…​

Fine! I’ll collect all your dang boundaries.

I had no intention of writing a freaking industrial-grade email parser today. I just wanted to satisfy a curiosity.

Hey, what if my script just stores any boundary=XXXX values no matter where they are? By golly, I think that will work just fine.

So here’s what I’m gonna do:

  1. Collect all of the boundary definitions, including nested ones in the body

  2. Find Content-Type: text/plain anywhere in the email

  3. Find the next blank line after that

  4. Stop on any line that begins with one of the boundaries

Ha ha, it worked!

And wow, that’s a lot of text. It’s legit, I scrolled through it all.

Here’s the entire Ruby script for the silly 4-step method listed above:

Dir.glob('*.eml') do |fname|
  #puts "================#{fname}================"
  scan_to_text = false
  in_text = false
  boundaries = []

  File.readlines(fname, chomp:true).each do |line|

    # Collect boundaries no matter WHERE they are!
    m = /boundary="?([^" ]+)"?/.match(line)
    if m
      boundaries.push "--#{m[1]}"
    end

    if line =~ /^Content-[Tt]ype: text\/plain/
      scan_to_text = true
      next
    end

    if scan_to_text
      # header or multipart declaration ends with...
      if line == ''
        in_text = true
        scan_to_text = false
        next
      end
    end

    if !in_text
      next
    end

    boundaries.each do |b|
      if line.start_with?(b)
        in_text = false
        next
      end
    end

    # Print that sweet, sweet plaintext!
    puts line
  end
end

Useage: I piped the output straight to wc for a word count:

$ ruby ebody.rb | wc -w
52358

Neat.

See also: Email or My Thunderbird Page