Parallel CSV reader

I almost forgot I wrote the code a while back. Someone asked me about it yesterday, so I dusted the parallel CSV reader code off this morning and here are the results:

This test basically reads a file with 10M customer records (generated), sized 919169988 bytes in 18.3 seconds. (50MB/s) Obviously, my poor laptop disk can’t deliver at that speed, so these test results are obtained by utilizing the excellent Linux caching system :-)

In any case, the caching system simulates faster disk subsystem.

On my computer, the system doesn’t really scale linearly (especially in this case, the OS uses up some CPU power too) , but the speedup is noticeable from 25.8 to 18.3 seconds. (about 30% faster)

The interesting thing is that if you have more CPUs at your disposal (both SMP and clustered setups work) you can probably make it scale to the full extent of your disk speed.

In the case where lazy conversion is disabled (classical database loading situations comes to mind) you can see the read performance increase from around 75krows/s to around 100krows/s :

In both scenarios, both CPUs in my machine are 100% utilized (or at least very close to that number) and as such, there is very high hope that this system scales over more than 2 CPUs as well.

You can find this feature in the latest SVN builds (revision 7049 and up) or in a next release of 3.1.0.

Feel free to let us know how you like this new performance enhancement!

Until next time,

Matt

4 comments

  • Cool matt. I’ll HAVE to get this going on the cloud and give it a go.

    Nick

  • I did some testing with this running two slaves on my local box and two slaves and a master on my test box out at the data center. They all took their slice of the file and merrily went to work on it and burned through the 2m record file at an amazing clip.

    My only problem is that the files I have to process are normally stored on the SAN in gzipped format and of course, CSV doesn’t currently support gunzipping. I could add an extra step to unpack the file manually and clean it up afterward, but I was actually wondering whether it wouldn’t be okay to relax those NIO restrictions somewhat. Since using a VFS URI will cause the file to be extracted out into the vfs_cache directory, I don’t see a real reason why you wouldn’t be able to just NIO on that temp file.

    Anyway, exciting stuff.

  • Hi Daniel,

    Thanks for that information. I don’t think that going with GZip would matter that much as far as NIO is concerned. The I/O throughput is reduced dramatically in that case anyway (at a CPU cost) and the CPU usage can be offset by the parallel nature of the reads.
    It remains to be seen how quick and optimal GZipInputStream.skip() is (needed for the parallel read), but I would propose that we give that a try ;-)

    Matt