The Single Threader step
Dear Kettle fans,
At the end of last year while we were doing a lot of optimizations and testing with embedding Pentaho Data Integration in Hadoop we came upon the brilliant idea to write a single threaded engine.
The idea back then was that since Hadoop itself was already using parallelism it might be more efficient for once to process rows of data in a single thread with minimal overhead. This is very much like the approach that Talend has: they generate a single threaded Java application that has very little overhead for data passing. So an engine was written, for the Java fans materialized in class SingleThreadedTransExecutor, to allow for that to happen. The writing and testing was a lot of fun and for a single threaded engine the result is indeed very fast. However, to make a long and tedious story short, the Pentaho Hadoop team tested the performance and found out that the regular parallel (multi-threaded) engine worked faster. (Duh!) I guess it also has to do with the fact that if you use a single Hadoop node per server you indeed have multiple cores at your disposal. So it might be the test-setup that plays a role as well.
Well, at that point we had an engine without a use-case which is always a bad place to end up. So the engine risked being stuck on a one-way trip to Oblivion.
However, there is actually a use-case for the step. Once every couple of months we get the question (from the sales-team usually, not from actual users) if it is possible to limit the number of threads or processors used in a transformation. Up until now the answer was “No, if you have 20 steps you’ll have 20 threads, end of story”.
The new “Single Threader” that we’re introducing and that uses the single threaded engine changes that. The most pressing problem that this step solves is the reduction of data passing and thread context switching overhead.
Let’s take an example, a transformation with 100 steps. To make matters worse, the dummy steps don’t do anything so all we’re measuring with this case is overhead:
Because this transformation uses over 100 threads on a 4-core system a lot over thread context switching is taking pace. We also have over 100 row buffers and locks between the steps that lower performance. Not by much, but as we’ll see it all adds up.
OK, now let’s put the 100 dummy steps in a sub-transformation:
For this we use 1 extra step, an Injector step that will accept the rows from this parent transformation:
Please note that we can execute the “Single Threader” step in multiple copies. On my test-computer I have 4 cores so I can run in 4 different threads. In the “Single Threader” step we can specify the sub-transformation we defined above as well as the number of rows we’ll pass through at once:
When we then look at the performance of both solutions we find out that our original transformation runs in 105 seconds on my system. The new solution completes the task in about 55 seconds or almost half the time or almost twice as fast.
Since this behaves very much like a Mapping or sub-transformation you can also use it as a way to execute re-usable logic. As an additional advantage it makes complex transformations perhaps a bit less cluttered.
Until next time,