How to load the ORCID data dump into mongo, without dying of old age

I’ve recently been doing some analysis of the 2016 ORCID data dump.  This is essentially 2.5 million JSON of varying size documents in a giant 10gb zip file and if you’ve not worked with this kind of thing before it can be a painful. I thought I’d post up some tips so that you don’t have to go through the same pain that I did.

1: Use WGET, not CURL.  wget will retry broken connections, curl will not.  I found this out after several attempts to download the file, each taking hours before they broke.

2: Unzip the file.  This takes 8 hours on a macbook pro with a SSD.  Your millage will vary.  All 2.5 million files will be in a single directory.  Do not attempt to list the contents of this directory as you’ll be waiting forever.  If you really, really want to list the contents, disable sorting when you ls and it’s not so bad.

2: Startup mongo with 2gb of memory and journaling disabled.  Both of these will help speed things up.  

3: Batch up the JSON files and import.  Batching will dramatically speed up the loading process.  However, try and do it too fast and mongo will (unhelpfully) silently fail to load some.  I found that 100 at a time meant I loaded them all in a reasonable timeframe – 9 hours.  If you do them one at a time it will take 5 times as long.  This is the script I used to batch and load.  If you do not have jq installed, use homebrew or whatever to install it.

4: Go to bed and wake up to a lovely mongo DB full of ORCID records.

5: Start aggregating!

Leave a Reply