Pentaho Data Integration vs Talend (part 1)

Hello data integration fans,

In the course of the last year or so there have been a number of benchmarks on blogs here and there that claimed a certain “result” pertaining to performance of both Talend and Pentaho Data Integration (PDI a.k.a. Kettle).  Usually I tend to ignore these results a bit, but a recent benchmark got so far off track that I had to finally react.

Benchmarking itself is a very time-consuming and difficult process and in general I advice people to do their own.  That being said, let’s attack a first item that appears in most of these benchmarks: reading and copying a file.

Usually the reasoning goes like this: we want to see how fast a transformation can read a file and then how fast it can also write it back to another file.  I guess the idea behind it is to get a general sense of how long it takes to process a file.

Here is a PDF that describes the results I became when benchmarking PDI 3.1.0GA and TOS 3.0.2GA.  The specs of the test box etc are also in there.

Reading a file

OK, so how fast can you read a file, how scalable is that process and what are the options?  In all situations we’re reading a 2.4GB test file with random customer data.  The download location is in the PDF on page 4 and elsewhere on this blog.

Remember that the architectures of PDI and Talend are vastly different so there are various options we can set, various configurations to try…

1) Simply reading with the “CSV File Input” step, lazy conversion enabled, 500k NIO buffer : 150,8 seconds on average for PDI. Talend performs this in 112,2 seconds.

2) This test configuration is identical to 1) except that PDI now runs 2 copies of the input step.  Results: 94,2 seconds for PDI.  This test is not possible in Talend since the generated Java software is single threaded.

Reading a delimited file, time in seconds, lower is better

There is a certain scalability advantage of being able to read and process files on multiple CPU and even multiple systems across a SAN.  There is a serious limitation in Talend since they can’t do that.  A 19% speed advantage for PDI is inconsequential for simple reads but brutal for more complex situations, very large files and/or lots of CPUs/systems involved.  For example, we have customers that read large web log files in parallel over a high speed SAN across a cluster of 3 or 4 machines.  Trust me, a SAN is typically faster than what any single box can process.

Writing back the data

The test itself is kinda silly but since it is being carried around in the blogosphere, let’s set a reference, a copy command.   I simply copied the file and timed the duration.  That particular copy set a reference time of 122.2 seconds: a copy from my internal disk to an external USB 2.0 disk. (for the exact configurations see the PDF)

3) If reading in parallel is the fastest option for PDI, we retain that option.  Then we write the data back with a single target file.  PDI handles this in 196.2 seconds.  Talend can’t read in parallel so we don’t have any results there.

4) A lot of times, these newly generated text files are just temporary files for upstream processes.  As such it might (or might not) be possible to create multiple files as target.  This would increase the parallelism in both this challenge as the upstream tasks.  PDI handles this task in 149.3 seconds.  Again I didn’t find any parallelization options in TOS.

5) Since neither 3) and 4) are possible in Talend I tried the single delimited reader / writer approach.  That one ran for 329.4 seconds.

Reading/writing a delimited file, time in seconds, lower is better

CPU utilisation

I also monitored the CPU utlisation of the various Talend jobs and Kettle transformations and came to the conclusion that Talend will never utilize more than 1 CPU while Kettle uses whatever it needs and get its hands on.  For the single threaded scenario, the CPU utilization is on par with the delivered performance of both tools.  There doesn’t seem to be any large difference in efficiency.

Conclusion

Talend wins in the first test with their single threaded reading algorithm.  I think their overhead is lower because they don’t run in multiple threads. (Don’t worry, we’re working on it :-))  In all the other situations where you have more complex situations, where you can indeed run in multiple threads, there is a severe performance advantage to using Kettle.  In the file reading/writing department for example, PDI runs in 3 threads and lazy conversion beats Talend by being more than twice as fast in the best case scenario and 65% faster in the worst case.

Please remember that my laptop is hardly and not by any definition “high end” equipment and that dual and quad core CPUs are commonplace these days.  It’s important to be able to use them properly.

The source code please!

Now obviously, I absolutely hate it when people post claims and benchmarks without backing them up.  Here you can find the used PDI transformations and TOS jobs.  With a little bit of modification I’m sure you all can run your own tests.  Just remember to be critical, even concerning these results!  Trust me when I say I’m not a TOS expert :-)  Who knows, perhaps I used a wrong setting in TOS here or there.  All I can say is that I tried various settings and that this seemed the fastest for TOS.

Remember also that if even a simple “file copy” can be approached with various scenarios, that this certainly goes for more complex situations as well.  Even the other tools out there deserve that much credit.  Just because Talend can’t run in multiple threads, that doesn’t mean that Informatica, IBM, SAP and all are not capable of doing so.

If I find the time I’ll post a part 2 and 3 later on.  Feel free to propose your own scenarios to benchmark as well.  Whatever results come of it, it will lead to the betterment of both open source tools and communities.

Until next time,
Matt

18 comments

  • Pingback: Pentaho Data Integration (Kettle) V Talend Benchmark « Gobán Saor

  • Matt – I think you’ve hit the nail on the head. One of the things I alluded to in my blog was that as long as it scales (which Kettle with it’s clustering, partitioning, out of the box pipeline parallelism and further available configurable parallelism).

    Kettle IMHO made the correct choice. Why spend developer hours streamlining every bit of every string parsing algorithm. I don’t care if it’s faster (marginal amounts) as long as it scales.

    Nice work!

  • Are you saying that reading CSV files is CPU-bound?

  • Daniel, as I demonstrate in this test, file reading is NOT CPU bound on my system.
    Then again, my laptop disk can only deliver 35-40MB/s. Most “real” data warehouse systems would probably have a NAS, SAN, Fibre channel or at least a SCSI system to work with.
    In most of the large data warehouse projects I did, reading CSV files was indeed CPU bound. Heck I had Oracle go CPU-bound with 4 CPUs when it read at 500MB/s or thereabouts. (HP XP storage subsystem or something like that, can’t remember the details)

    The challenge these days is the parallelization to make it scale across multiple CPUs and hosts.

  • By the way, if you set the NIO buffer size to a very LOW value (5000 or something like that) PDI runs test 1 in 95 seconds. I think it has to do with the Linux disk-subsystem, latency, caching, whatever. Again: everyone should do their own tests to find their own version of the truth.

  • Sorry to bug you, but if you are able to go faster (even if just a little bit) using two readers instead of one, then isn’t it an indication that you were partially CPU-bound?

    If you were entirely I/O bound, then adding an extra core to the job would not help. Would it?

    Also, sorry if I am an idiot, but given that you have a single disk, how do you share the parsing? I mean, do you have a one reader read the second half of the file… or do you alternate, one reader reading one line and the other reading another line and so on…

    I have no idea how anyone can parse a CSV file with more than one thread. I know I must sound very naive here…

    Do you require that the order of the lines be preserved once they are loaded?

    (I am planning to teach a university-level course in DW next year, and I will have a week or so on Pentaho… so you are maybe not wasting entirely your time on me.)

  • No, I think there is some latency involved in the disk reading, perhaps something regarding disk spinning? Linux buffering? I have no idea actually.
    I tried a few settings and in the end it was just a question of picking the right buffering strategy. As mentioned above if you pick a lower buffering size, we’re just a little bit faster than Talend with a single thread as well.

    The thing is that the second thread kept the disk subsystem as busy as it possibly could.

    As far as parallel CSV reading, that was a bit tricky to set up… The size of the file is measured and split into the number of threads. All threads know at that point in time what part to read since they all know what number they have and how many there are, even across a cluster. The idea is then to read until you find the next newline/CR and continue until you hit the size limit. There are some borderline cases to handle, but that’s the general idea.
    As documented the whole idea obviously only works with files that don’t have any newlines embedded in single fields. (we do support those in single threaded mode)

    Preservation of the order of the lines is not always possible. If you have an sequence number in the file, a timestamp, sequential id or if the file is sorted on some key you can use a “Sorted Merge” step to slip them back together however. If the file is not sorted, you probably don’t want to preserve the order anyway :-)

    It’s not a common situation to have though. Usually you want to parse in parallel as much as possible, not put them back causing a serial bottleneck.

    Good questions, take care,

    Matt

  • Leandro Concon

    Matt,
    I would like to congratulate the publication of Benchmark.
    I see in my opinion Kettle as a powerful tool, which rely on skilled people and attentive to increasingly improve the ETL tool.
    I would like to emphasize that it would be interesting to create a session at Benchmark Tool own site, then can view the tests performed in different environments and ways.

    Well, I am trying to view the source that was available, only that perhaps he had trouble exporting, the aquivos of Kettle are zero.

    Congratulate!!! ;)

  • The thing is that the second thread kept the disk subsystem as busy as it possibly could.

    That is what I meant by CPU-bound. That is, you must throw a second core to keep the I/O system busy. One core is not enough.

    Have you tried the same thing with a SSD drive? If I really wanted to load up CSV files fast, I’d store them on SSDs.

    As far as parallel CSV reading, that was a bit tricky to set up… The size of the file is measured and split into the number of threads. All threads know at that point in time what part to read since they all know what number they have and how many there are, even across a cluster. The idea is then to read until you find the next newline/CR and continue until you hit the size limit. There are some borderline cases to handle, but that’s the general idea.

    I still do not get it. Suppose that I have the following CSV data:

    a1, a2, a3

    b1, b2, b3

    c1, c2, c3

    d1, d2, d3

    Now, the first thread starts and reads the first line. What the heck can the second thread do? It cannot jump and read b1, b2, b3 because it does not know where b1 before the first thread has read a3.

    What I am guessing is this. You start the first thread at a1. Meanwhile, you have the second thread jump forward in the file, at maybe c2… then it starts reading… of course, the first line it reads is wasted, but you keep going.

    That’s nice, but you have some overhead. For example, after each line the first thread reads, it must check whether it reached the area read by thread two.

    Then, also, you risk having your disk go crazy because you ask for a lot of random access. Of course, this would not be happening with SSD.

  • >> That is what I meant by CPU-bound. That is, you must throw a second core to keep the I/O system busy. One core is not enough.

    I think it is with the right settings in this case. Personally I think it has more to do with disk latency. However for argument sake it’s besides the point. Like you said, if you would jump to the new types of SSD or very fast disk subsystems, one or even 2 cores would not be enough to process the +200MByte/s of throughput. I’ve seen one particular situation (can’t disclose) where we needed 10 servers with 4 cores to read the data at peak performance (a single fixed width flat file) : 2GByte/s. Of-course it goes beyond saying that 20Gbit/s disk subsystems don’t come cheap and are not always readily available to us for testing :-)

    >> Now, the first thread starts and reads the first line. What the heck can the second thread do? It cannot jump and read b1, b2, b3 because it does not know where b1 before the first thread has read a3.

    That’s exactly what it does: jump to a neighbor of line 3. The algorithm is not line based anyway, it’s size based.
    Positioning is done with a disk seek, not by actually reading the file. In the worst case the disk heads have to reposition, no other time or CPU cycles are wasted on it.
    To prevent the disk from going “crazy” as you put it, it’s better to read in larger blocks. That is why the second test has a large NIO buffer size set.

    Obviously, looking at it from the position of a single disk spindle is a bit simple since most enterprise data these days would be stored on disk subsystems, RAID-5, mirrored disks, etc.

    By the way, it’s a different subject, but you would need very new SSD drives to surpass a recent hard drive performance, especially when it comes to writing. Up until now, SSD has been marketing, nothing more. Check out Linus’ comments on the subject: http://torvalds-family.blogspot.com/2008/10/so-i-got-one-of-new-intel-ssds.html

  • Thanks for all the insight.

    Regarding the limitations, that’s when you use random writes, see my blog post:

    http://www.daniel-lemire.com/blog/archives/2008/02/02/random-write-performance-in-solid-state-drives/

    But if you are reading a CSV file, you are good. Of course, writing out a transformed output randomly on disk will get you in trouble. But thankfully, you are not doing that (I hope you are loading it up into a table or something).

  • “PDI experts writes benchmark that says PDI is faster” ;-)

    By the way, it would be much more interesting to compare PDI to proprietary solutions (Datastage, Informatica…).

    As I usually say, PDI & Talend are not competitors…

    Fabrice

  • You are absolutely right Fabrice! Then again, I didn’t get this thing started. ;-)

    Again, I think we need to define these tests in an open source way and find some DS, INFA, BODI, OWB, Talend, PDI experts to execute the tests.
    That sort of thing needs to happen on common grounds so I’m looking for a “neutral” site to host the benchmark suites and results.

    What do you think of that idea?

    All the best,

    Matt

  • Pingback: Parsing text files is CPU bound

  • Matt,

    take a look at this comment I put over on Vincent’s blog and tell me what you think about it. I have to make these transformations available regardless, but as I read more comments about these benchmarks, the more I think that having a set of real world transformations to work against might be interested for all the different vendors.

    http://it.toolbox.com/blogs/infosphere/was-the-manapps-etl-benchmark-test-flawed-or-baised-28697#2498479

  • Pingback: Parsing CSV files is CPU bound: a C++ test case

  • Pingback: Parsing CSV files is CPU bound: a C++ test case (Update 1)

  • Pingback: Comparando Kettle y Talend « Guía Mundial de Países