Battle of the tokenizers – delimited text parser performance

An interesting question about StringTokenizer popped up on stackoverflow the other day.  It was essentially about how to optimise reading delimitated data, in this case lines of integers separated by lines of spaces.

It demonstrated three things.

  1. Don’t fixate on micro-optimisations when you probably have big bottlenecks elsewhere
  2. String.split() is really slow
  3. The difference is pretty negligible even over millions of rows.

I ran some benchmarks to see how different methods compared.  I’m not claiming these are great benchmarks, but they’re good enough to demonstrate general performance of various solutions.   The results of parsing 100000 rows of “12 34” on my macbook were:

  • Split: 366ms
  • IndexOf: 50ms
  • StringTokenizer: 89ms
  • GuavaSplit: 109ms
  • IndexOf2 (some super optimised solution given in the above question): 14ms
  • CsvMapperSplit (mapping row by row): 326ms
  • CsvMapperSplit_DOC (building one doc and mapping all rows in one go): 177ms

The surprising thing here is how slow split is compared to other solutions, I really had no idea until I ran these tests. You can map to Objects twice as fast using Jackson CsvMapper than split and three times as fast using the Guava Splitter.   Building your own using indexOf is even faster, but of course harder to code and maintain.  I wouldn’t recommend anyone use the super optimised version unless it’s really, really important to shave off those milliseconds – one reason it’s faster is that it implements it’s own String->Int conversion.

Here’s the complete hacked together JUnit test case I threw together:

 

Written by Tom

6 Comments

Thomas Johnson

Was this done in Java 6? String.split peforming better in Java 7. At least for single characters.

Reply
Tom

This was done in java six. I didn’t realise there’d be a difference, perhaps they’ve optimised it. I’ll run it again tomorrow on 6&7 and post up a comparison.

Reply
Rim

thank you for the interesting experiments

here the perf results with java7,
Split: 262ms
IndexOf: 68ms
StringTokenizer: 122ms
IndexOf2: 24ms
————-
Split: 154ms
IndexOf: 50ms
StringTokenizer: 96ms
IndexOf2: 16ms
————-
Split: 156ms
IndexOf: 48ms
StringTokenizer: 98ms
IndexOf2: 11ms

Reply
Ward

The word is spelled ‘optimized.’ I’d have thought that it was merely a typo the first time until you did it again later on.

Reply
dani

I guess System.nanoTime() is better for benchmarking, as it provide monotonous time.

Reply

Leave a Reply