Being lazy

Dear Kettle fan,
Since our code is open, we have to be honest: in the past, the performance of Kettle was less than stellar in the “Text File” department. It’s true that we did offer some workarounds with respect to database loading, but there are cases when people don’t want to touch any database at all. Let’s take a closer look at that specific problem…

Reading and writing text files…
Let’s take a look at this delimited (CSV) file (28MB). Unzipped, the file is around 89MB in size.

Suppose you read this file using version 2.5.1 (soon to be out) with a single “Text File Input” step. On my machine, that process consumes most of the available CPU power it can take and takes around 57 seconds to complete. (1M rows/minute or 60M rows/hour)

When we analyze what’s eating the CPU resources during that minute, there are a number of bottlenecks: metadata object allocation and garbage collection, byte[] to String conversion (by Java) and data type conversion (String to Date, Integer, etc)

Now suppose we write that data back to another file. (mostly identical) On my machine, that transformation takes around 80 seconds. Again, it’s String-to-byte[] conversion that’s eating CPU as well as the normal data type conversions taking place. (Date to String, Integer to string, etc)

So we made some improvements in the 3.0 version…

First of all, the metadata object allocation issue was solved by separating the data from the metadata. Before, we saw that the JVM was doing garbage collection for up to 40% or more of the time. That performance hit is gone and has led to great improvements all around.

Then, during 3.0 development, we can up with a new way of handling the specific Textfile-to-Textfile problem: in a lot of cases we’re doing the same conversions twice:

  • byte[] (in a file) to Java String (UTF-8), then Java String (UTF-8) back to byte[]
  • String to Date using a certain mask, then Date to String
  • String to Integer using a certain mask, then Integer to String again

There is a symmetry to be found here. As it turns out, UTF-8 (Unicode) encoding is very expensive, CPU-wise. Unfortunately, it’s also an important cornerstone of Java and something we can’t really do without. It allows us to read files from countries around the globe in all sort of languages and codepages. Throwing this out is not an option.

However… lazy bums as we are, we did come up with the concept of “Lazy Conversion”. Lazy conversion delays conversion of data as long as possible with the hope it might never occur at all. Obviously, in the case where you read from a text-file and write back to another one in the same format, conversion never occurs. In that particular case, we read bytes and dump them to file again without ever looking at the actual content. (unless we have to).

Let’s see how the 3.0 engine handles this.

  • When we run the exact same transformation we ran in version 2.5.1 in version 3.0, it takes 39 seconds (half the time)
  • When we use the new CSV Input step and turn on Lazy Conversion it takes about 10 seconds. (8 times faster, 6M row/min or 360M rows/hour. The transformation used to create the image below can be found here: lazy-conversion-3.ktr

Lazy conversion sample

  • If we simply read the file, without writing it back, the CSV Input step is completely I/O bound. Since the file fits in the cache of my system, the transformation completes in about 2 seconds. (reading is obviously faster than writing)

Fast, faster, fastest...

Conclusion

Our “lazy conversion” algorithms show great promise in those situations where a lot of data goes through the transformation untouched. That usually includes sorts, aggregates on certain keys, etc. Because the data stays in its raw binary format all the time, serialization of objects (during sorts, temporary files, over sockets, etc) is also a lot faster.

For the data warehouse architects and Kettle users out there, it’s another option to help you process your data as fast as possible.

Until next time,

Matt

2 comments

  • Pingback: Matt Casters on Data Integration » Test case : fast parallel flat file reading

  • hai,
    its deepak here. the examples u ve shown is simple. i ve a problem and want u to help me with tht.
    i want to extract data from three different tables of a database, process them and map to an output table. i don’t know how to map. i ve tried by using the select values component, but not all the data are populated at a time. and if it is done, then all the data is populated in a same column or it generated errors.
    so will you plz help me with this??
    reply soon.
    mail me.