Battle of the tokenizers – delimited text parser performance

An interesting question about StringTokenizer popped up on stackoverflow the other day.  It was essentially about how to optimise reading delimitated data, in this case lines of integers separated by lines of spaces.

It demonstrated three things.

  1. Don’t fixate on micro-optomisations when you probably have big bottlenecks elsewhere
  2. String.split() is really slow
  3. The difference is pretty negligible even over millions of rows.

I ran some benchmarks to see how different methods compared.  I’m not claiming these are great benchmarks, but they’re good enough to demonstrate general performance of various solutions.   The results of parsing 100000 rows of “12 34″ on my macbook were:

  • Split: 366ms
  • IndexOf: 50ms
  • StringTokenizer: 89ms
  • GuavaSplit: 109ms
  • IndexOf2 (some super optomised solution given in the above question): 14ms
  • CsvMapperSplit (mapping row by row): 326ms
  • CsvMapperSplit_DOC (building one doc and mapping all rows in one go): 177ms

The surprising thing here is how slow split is compared to other solutions, I really had no idea until I ran these tests. You can map to Objects twice as fast using Jackson CsvMapper than split and three times as fast using the Guava Splitter.   Building your own using indexOf is even faster, but of course harder to code and maintain.  I wouldn’t recommend anyone use the super optimised version unless it’s really, really important to shave off those milliseconds – one reason it’s faster is that it implements it’s own String->Int conversion.

Here’s the complete hacked together JUnit test case I threw together:


3 thoughts on “Battle of the tokenizers – delimited text parser performance”

    1. This was done in java six. I didn’t realise there’d be a difference, perhaps they’ve optimised it. I’ll run it again tomorrow on 6&7 and post up a comparison.

  1. thank you for the interesting experiments

    here the perf results with java7,
    Split: 262ms
    IndexOf: 68ms
    StringTokenizer: 122ms
    IndexOf2: 24ms
    Split: 154ms
    IndexOf: 50ms
    StringTokenizer: 96ms
    IndexOf2: 16ms
    Split: 156ms
    IndexOf: 48ms
    StringTokenizer: 98ms
    IndexOf2: 11ms

Leave a Reply