Data vs Metadata : Kettle 3.0

A few weeks ago we started work on Kettle 3.0.
One of the notable changes is a redesign of the internal data engine.
More to the point, we’re aiming for a strict separation of data and metadata.

The main reason for doing so was the reduction of object allocation and also to allow us to extend the metadata without a performance impact.
We anticipated a performance gain here and there, and initial test-code gave us a 15-20% increase in performance to hope for.

Who could have guessed that reading this file, sorting it and writing it back to disk would turn out to become more than 5 times faster? (4.83 seconds!!!)  Granted, we take the opportunity to do a full code review and if we spot a possible performance problem, we fix it too.   

We’re also making a library of test-cases to run performance and regression tests again version 2.5 code.  The result with comparison (speedup calculation) is posted here.

One of the nice things about the code changes is that although it will break the API, it’s a Good Think(TM) for the project in the long run.  It will give us breathing room to keep innovating in the future.  Speeding up steps between 15 and 1700% is a good start for the first 20 steps that are converted.  It’s also nice to see a lot of test-cases double or triple in speed:

  • Select Values : x9 (Random select), x6 (Delete field), x5.5 (Change metadata)
  • Calculations: between x1.8 and 3.6 faster
  • Add sequence : up to 3x faster
  • Table Output : 15% faster up to x2
  • Add constant values: x5
  • Filter rows: x1.5
  • Row generator: x2.5 - x3 (up to 1.2M empty rows/s)
  • Sort rows: x1.15 - x5
  • Stream Lookup: x1.24 - x1.36 (and this step already got some serious tuning in the past)
  • Table Input: x1.2
  • Text File Input: x1.6 - x3.4
  • Text File Output: x1.75 - x2.9

On top of that, memory usage should have been reduced as well, although we don’t have any exact figures on that.

Until next time,

Matt

 

2 comments

  • Joris

    Interesting!

    I try to make the same job with Talend to make a benchmark. I donwloaded your input file without any problem. But I need to know your sort ciriteria and a little bit about the hardware you use.

    Regards,
    Joris

  • The test was run on my poor 2.0Ghz Pentium M laptop. The sort criteria are mentioned in the test: name and firstname, alpha, asc. All rows are sorted in memory (110k rows buffer)

    I couldn’t for the life of me figure out how to export the metadata from TOS, so here is the generated “documentation”: http://www.kettle.be/dloads/readSortWrite.zip

    Please note that I couldn’t figure out how to read date/time fields in TOS, so I used the “Day” data type. I’m guessing that the precision is a day. Also note that I couldn’t figure out how to set the desired output format for the fields so I left it to default.
    That is in comparisson with the situation in Kettle where you have full control over the layout, encoding, localisation, etc.

    BTW, the timing I did does not include the code-generation in TOS either.

    HTH,

    Matt