The Single Threader step

Dear Kettle fans,

At the end of last year while we were doing a lot of optimizations and testing with embedding Pentaho Data Integration in Hadoop we came upon the brilliant idea to write a single threaded engine.

The idea back then was that since Hadoop itself was already using parallelism it might be more efficient for once to process rows of data in a single thread with minimal overhead.  This is very much like the approach that Talend has: they generate a single threaded Java application that has very little overhead for data passing.  So an engine was written, for the Java fans materialized in class SingleThreadedTransExecutor, to allow for that to happen.  The writing and testing was a lot of fun and for a single threaded engine the result is indeed very fast.  However, to make a long and tedious story short, the Pentaho Hadoop team tested the performance and found out that the regular parallel (multi-threaded) engine worked faster. (Duh!)  I guess it also has to do with the fact that if you use a single Hadoop node per server you indeed have multiple cores at your disposal.  So it might be the test-setup that plays a role as well.

Well, at that point we had an engine without a use-case which is always a bad place to end up.  So the engine risked being stuck on a one-way trip to Oblivion.

However, there is actually a use-case for the step.  Once every couple of months we get the question (from the sales-team usually, not from actual users) if it is possible to limit the number of threads or processors used in a transformation.  Up until now the answer was “No, if you have 20 steps you’ll have 20 threads, end of story”.

The new “Single Threader” that we’re introducing and that uses the single threaded engine changes that.  The most pressing problem that this step solves is the reduction of data passing and thread context switching overhead.

Let’s take an example, a transformation with 100 steps.  To make matters worse, the dummy steps don’t do anything so all we’re measuring with this case is overhead:

Because this transformation uses over 100 threads on a 4-core system a lot over thread context switching is taking pace.  We also have over 100 row buffers and locks between the steps that lower performance.  Not by much, but as we’ll see it all adds up.

OK, now let’s put the 100 dummy steps in a sub-transformation:

For this we use 1 extra step, an Injector step that will accept the rows from this parent transformation:

Please note that we can execute the “Single Threader” step in multiple copies.  On my test-computer I have 4 cores so I can run in 4 different threads.  In the “Single Threader” step we can specify the sub-transformation we defined above as well as the number of rows we’ll pass through at once:

When we then look at the performance of both solutions we find out that our original transformation runs in 105 seconds on my system.  The new solution completes the task in about 55 seconds or almost half the time or almost twice as fast.

Since this behaves very much like a Mapping or sub-transformation you can also use it as a way to execute re-usable logic. As an additional advantage it makes complex transformations perhaps a bit less cluttered.

Well, there you have it: another option to tune the performance of your transformations.  You can find this feature in new downloads from out Jenkins CI build server or later in 4.2.0-RC1.

Until next time,

Matt

5 comments

  • Brandon Maness

    After reading your post I was excited to see what difference refactoring 8 of 10 sub-transformations from a moderately complex transformation would yield. I should note that this is probably a best case scenario because all of the sub-transformations were compatible with single threaded mode and were primarily focused on processing the row coming in with minimal external lookups being used. Here’s what I noticed after testing the 2 variations (mapping vs single) a few times.

    Threads
    167 threads during peak (+154 vs single)
    12.84 or 1,284% decrease in threads used

    Ram
    120 mb ram used (+80mb vs single)
    3 or 300% decrease in ram used

    Handles
    23,425 handles during peak (+1,166 vs single)
    1.05 or 105% decrease in handles used

    Total time to process 110k rows
    3mn 41s vs 1mn 6s (155 extra seconds to process)
    3.34 or 334% decrease in time

    I for one am very impressed with the results of about 30 minutes of re-factoring. I’d also like to note that I thought the Injection/Dummy step implementation was more streamlined compared to the Input/Output Mapping, but would like to add that it might be nice not to have to map a retrieval step if there isn’t one. A dummy step fixes the problem, so it’s not a big deal either way.

    That being said, Great job adding such an elegant solution that (in my test case) yielded a 300% improvement in total processing time, while using fewer resources to boot! Nice.

  • Brandon Maness

    edit: handles show a 5% decrease not 105%

  • Great to hear about these results Brandon.
    Just for the record, you can use any step to read from, not just a dummy.
    Having support for headless or tailless sub-transformations makes sense indeed but we had to start somewhere. 😉

    Cheers,
    Matt

  • Felipe Torres

    Very interesting, Matt!
    Congrats!
    I’ll test this with more than one single Threaders, dividing the number of steps.
    Should have an ideal reason for each tuple number of threads and number of steps, isn’t it?

    Cheers,

    Felipe Torres de Oliveira

  • Indeed it gives you more control over the number of threads used in the transformation.
    Have fun with it,
    Matt