Test case : fast parallel flat file reading

Version 3 of Pentaho Data Integration will feature 2 news steps to load flat files:

  • CSV Input: to read delimited text files
  • Fixed Input: to read fixed width flat files

Both steps where not designed to be as versatile as possible. We already have the regular “Text File Input” step for that. These steps where designed to be as fast as possible.

Here are the reasons why these steps are fast:

  • They use non-blocking I/O (NIO) to read data in big chunks at the time
  • They have greatly simplified algorithms without any bloat
  • They allow us to read data using lazy conversions

Besides these items we also greatly reduced the overhead of garbage collection in the java virtual machine and the metadata handling in our new version.

So where does that leave us? How fast can we read data now? Well, let’s take a really big file and try it out for ourselves to see.

Generate a text file

A test file is a problem because for reference you want everyone to use the same text file and yet you can’t just post a 15GB text file on an FTP server somewhere. (it’s just not practical).

So you need to generate one. I created a small C program to handle this: printFixedRows.tar.gz

A Linux (x86) binary is included in the archive, but you can compile the C program for your own platform with the following command:

cc -o printFixedRows printFixedRows.c

Then you can launch this command to get a 15GB test file:

./printFixedRows 10000000 150 N > bigfixedfile.txt

This generates 10 million rows with 1529 bytes on each row. (a size of 15.290.001.529 bytes)

Reading a text file

Now that we’re certain that the system cache is not going to make too much of a difference in the results, we’re going to read this file.

Here is the transformation to read the file.

Parallel reading transformation

Performance monitoring

If we adjust the transformation and point it to the correct filename, we can run it.

We can then see that the performance is around 12k rows/s or 12000×1529 bytes/s = 18MB/s.
Each row has 152 fields in it of 4 different data types (String, Integer, Date and Number). As such, this step generates 1.8 million data fields per seconds.

In fact if we look at the results of “iostat -k 5” we basically find the same results:

iostat results

We also note that the CPU usage is very low at around 22% and that the I/O Wait % is very high. In other words: our transformation is waiting for the disk to give back data. The NIO system really pays off here by reading large blocks at the same time from disk, offering great performance.

Parallel reading

At first glance, parallel reading wouldn’t do us any good here since the disk can’t go any faster. Having 2 processes read the same file at the same time is not the solution.

So I copied the file over to another disk and made symbolic links in the /tmp folder

  • /tmp/0 for step copy 0 with a sym-link to biginputfile.txt on disk 1
  • /tmp/1 for step copy 1 with a sym-link to biginputfile.txt on disk 2

In my case it’s this:

0/bigfixed.txt -> /bigone/bigfixed.txt
1/bigfixed.txt -> /parking/bigfixed.txt

Please forget for a minute the fact that copying the data to another disk is slower than reading it in in the first place. I’m sure you all have fancy RAID drives that do this transparently, I’m just a poor blogger, remember? 🙂 If you have a fast disk, you can just fire up 2 or more copies of the step without the trick above to get the speedup.

As you all figured out by now, the speed of the transformation just doubled simply by doubling the number of copies to start to 2 for the “Fixed Input” step . Throughput went from 18MB/s to around 40MB/s. The second disk is a bit faster as it’s an internal disk, not an USB drive, and we’re now reading data at around 26k row/s. (4 million data fields per second)

iostat results

Possibilities and conclusions

Because of all the optimizations that we did, we are now reading data in faster than what our limited I/O system can give us. That means that using faster disks or in general getting more “spindles” is a good idea if you want to read data faster.

Obviously, the same goes for the output part as well, but that’s a whole different story for another time. For now, we demonstrated that the new “Fixed Input” step can read a 15GB file in 400 seconds flat. That’s not a bad result for my laptop system, any way you look at it. The 2 CPU are even 75% unused leaving room for lots of interesting transformations you want to do on the data.

Until next time,

Matt