PDI cloud : massive performance roundup

Dear Kettle fans,

As expected there was a lot of interest in cloud computing at the MySQL conference last week.  It felt really good to be able to pass the Bayon Technologies white paper around to friends, contacts and analysts.  It’s one thing to demonstrate a certain scalability on your blog, it’s another entirely to have a smart man like Nicholas Goodman do the math.

Sorting massive amounts of rows is hard problem to take on.  Making it scale on low-cost EC2 instances is interesting as it proves a certain level of scalability.  Nick ran 40 EC2 nodes in parallel to do the work and saw that it was good.  450,000 rows/s for $US 4,00/hour is not bad. Note: the tests sort 300M (50), 600M (100) and 1.8B (300) line-item rows from TCP-H respectively.

For certain, the paper seemed to make it easier for me to point to PDI scalability and it opened some doors for further testing on big iron at Sun Microsystems.  It was great to talk to so many people.  I even walked up to the Amazon Web Services booth at the expo to ask about the performance bottleneck in the EBS that was exposed by the white paper.  “It’s being worked on” was the reply :-)

The most interesting thing about the PDI cloud integration work is that there don’t seem to be a lot of other ETL tool vendors doing it.  In fact, after a Google or 2 I could only find Informatica with a Saas (not even IaaS) offering and I kinda doubt that closed source software is a good match for cloud computing.

So I went out there and did a presentation on the subject to explain to people how they would set it up for themselves.  The open source way is to not only do the marketing but to allow people to run their own tests and see for themselves.  That way you get valuable feedback to improve your offering.

Here is a copy of the presentation I gave: Cloud Computing with MySQL and Kettle.

I thought it was a good session although for once I didn’t get “The Question”, you know the one where people ask me how Kettle is different from Talend and where I get to comment on their lack of scalability.  Oh well, I guess you can’t win them all :-)

Finally, people have been asking me about integration with both SQLStream on the one hand and MapReduce/Hadoop/Hive/HDFS on the other hand.  I’m happy to say that the former is in progress and that I’ve started talks with the fine folks from Cloudera to get started on the latter.  I simply loved Aaron Kimball’s tutorial @ MySQL Conf on the MapReduce subject and think that there is a lot of potential for integration with PDI to make us scale even better.

Until next time,

Matt

13 comments

  • Hi Matt,

    These results are great! Good job…

    About Talend, your answer was good…3 years before !
    Since version 2.1, we have some interesting features like Grid Computing and Massive Parallisation components that scale quite well.

    We made some tests (using the same data as you (TPCH), only on 8 nodes) a couple of weeks ago, you can find our results here:

    http://blogs.sun.com/aja/entry/talend_s_new_data_processing

    Once again, our competitors are Informatica, IBM Datastage & Oracle DI, certainly not PDI!

    Regards,
    Fabrice
    Talend COO

  • Hi Fabrice,

    Software available under an open source license makes a lot of sense when you talk about massive parallel computing and cloud computing.
    Especially cloud computing where computing time is cheap and but you need a lot of hosts to run at a decent speed.

    Just for completeness, is your Grid Computing offering also available under the GPL license, like the rest of Talend?

    All the best,

    Matt

  • Hi Matt,
    >Just for completeness, is your Grid Computing offering also available under the GPL license, like the rest of Talend?
    As you know, Grid computing is included in Talend Integration Suite and we distribute it under the Talend Integration Suite license. Talend Integration Suite license provides indemnification, guarantee SLA for support (with unlimited tickets)… important features not included inside the GPL license but asked every time by our customers.
    Of course, you can’t redistribute for free Talend Integration Suite (we have an OEM program for that). Since the beginning, our business model is not focuses on service: we are an editor and not a company that provides services: we spend all our time to build the best data integration and data quality platform and some part of this platform (the Enterprise part) require a commercial subscription.
    According to your slides attached on this post, Pentaho switched recently on the same model with PDI Enterprise Edition. I write “recently” because it’s the first time I see this and Google with the query site:pentaho.com “PDI EE” return only results available on your bugtracker, nothing on your Web site!)
    So you explain on slide 3, you have a PDI CE (aka kettle) and a PDI EE with some advanced features like Management Service Console, Documentation, Support…
    The only one result from “PDI Enterprise Edition” in google is a post from you :
    “The management side of it all (Sven’s JIRA case) is being worked on as well and the first part will be released under the name of the PDI Management Services console.
    That console will be closed source software for now, offered to customers of our PDI enterprise edition version.”
    Perhaps I don’t understand correctly your slides / your forum post and don’t understand your new business model, but you have now a PDI Community Edition under Open Source license and a Closed Source product called PDI Enterprise Edition (and we should pay to use this closed source software): don’t hesitate to correct me!

    @Fabrice,
    I don’t agree with you ;), for parallel computing our competitors are only Ab Initio and DataStage PX.
    IMHO, ODI, DataStage Server… cannot reach 1 million per second to read 7.4 GB of data, sort all this 60 millions rows and write the sorted file with a bi-cpu quad core!
    If you want more figures or the complete description of this benchmark, check out it on Sun Microsystem Web site : http://blogs.sun.com/aja/entry/talend_s_new_data_processing !

    All the best,
    -cedric
    Talend CTO
    http://www.talend.com

  • Cedric, thanks for the lengthy explanation, but since from a relative perspective nobody reads my blog anyway (only a few hundred people every day), a simple “no” would have been enough :-)

    I guess another difference between Talend and Kettle is that with Kettle ALL functionality is included under an LGPL license. Redistribution or OEM is possible without any problems. Only the enterprise *management* features are closed source. That means that if you are interested in plugging a TOS step into Kettle for cloud computing reasons, you are free to do so. (for example :-))

    Just for the record, by now you should know that benchmarks without all the details, without including source data (TCP-H in this case), without used code, without configuration etc it is meaningless. Even with those it’s mostly good for marketing material. Potential users should always roll their own when possible.

    Fabrice, back to the subject of this blog entry: the cloud computing exercises are interesting, not because of the absolute throughput of 450,000 rows/s, but because of the rock bottom price of $US 4,00/hour. The high-end equipment you’re referencing is usually a bit more expensive. I remember that a few years ago we pushed PDI to something like 2GB/s throughput on 10 quad-core boxes connected with a 20Gbit/s SAN. However, that sort of equipment is slightly more expensive than $US 4,00/hour. :-)

    Have a good one,

    Matt

  • Cédric Carbone

    Yes, your Community Edition is OEMisable for free.

    About the benchmark, I agree with you. But on the Sun blog post, you can find a lot of details like the link to download the generator of data (and an extract of this data). About configuration, hardware is explained, what mean count, average or sort operation…

    And yes, cloud computing is interesting, specially when your SI are already in the cloud. If your data are in your network,transportation from your SI to the internet + from the cloud to your SI can be very expansive (specially when you have TB of data). In this case, make parallel computing very near of your SAN where you data is stored should be more interesting.

    -Cédric

  • Nice conversations happening here.

    I haven’t used the Talend grid capabilities because a) I’m so busy with all the customers using PDI that I don’t have time to go off learning the MPP capabilities of another tool and b) they aren’t open source. For reference, PDI clustering/partitioning/parallel capabilities are LGPL; only the monitoring/management is Enterprise. That being said, I’d welcome a chance to put Talend capabilities through the same exact tests if Talend is willing to do the MPP / tuning work since I’m no expert on their tool.

    It appears as if the benchmark Talend links to shows only parallel capabilities on a single machine. The whole point of my benchmark was to use cheap/crappy EC2 instances to prove you can scale out versus scaling up. Does Talend offer scale out capabilities (physically separate machines) instead of parallel capabilities within a machine? PDI offers both, for reference.

    Does Talend offer the ability to read in parallel? ie – does the input step know that it’s running in parallel and split the file into sections so that workers can be using different I/O channels? This is not as important on the single machine system the tests your referred to above but it’s really important when you get to splitting out the processing on multiple machines.

    I’d offer to do an Apples to Apples comparison if Talend is actually interested in looking at identical configuration benchmarks on 10/20/40 nodes using scales 50/100/300. You guys send me your optimized transforms that do the same SORT/READ as I outline in my paper along with AVG/SORT/COUNT you used @ sun. I’ll build the corresponding AVERAGE/SORT/COUNT for PDI and we can run them all through the tests on the same hardware configurations on EC2.

    Interested ? Send an email on through to me. We can webex/etc to make sure a real talend engineer can tune it.

    PS – One of the customers I did PDI Clustering @ was an Ab Initio replacement so yes, you are right, that’s the market we are talking about here! :)

  • Sean

    I use Talend for some work but mainly use Informatica. I have played with PDI but have not gotten time to do anything extensive. I think these are all good ETL tools. I like Talend for its ease of use and code generation ability.

    From Goodman’s paper, it seems PDI has an edge when sorting large chunks which is not possible on INFA due to licensing issues and associated costs. I also like the fact the PDI multi-tasking abilities are open source. PDI seems very close to what INFA can do (e.g. split the input file in sections and do the sort on various parts).

    I have seen Talend go more and more closed source (some ELT and CDC capabilities are being developed as closed source). I think this is not going to be helpful for Talend. This feels like the Talend team is having “doubts” about open source model. I think at least the basic functionality should be all open source. Management and monitoring can be closed as it does not go to the core of the tool’s functionality.

    Bottom line, are Talend guys ready to take on PDI in this challenge and show how they can match or beat PDI and speed and setup?

  • NALIN JHA

    Hi matt,
    Unfortunately i missed your webinar on “High Performance ETL using Cloud- and Cluster-based Deployment” on 27th may 2009.
    Do you intend to put those resources shared during presentation available for folks ?

    Thanks
    Nalin

  • Of-course Nalin, we will be making the recording of the Webcast available in the next couple of days. I’ll twitter AND blog about it :-)

  • anton@ru

    Very interesting indeed.
    I looked at the Talend’s new benchmarks and it seems that it is not a good test of it’s parallel capabilities. Most tests are certainly I/O bound and I’m surprised that expensive SUN server with a lot of memory cannot handle linearly the move from sorting of 6 million to 60 million records using Talend’s software.
    It would be interesting to see the same benchmarks run using Unix sort comand – that would give us some baseline.

  • Hi Anton, obviously it’s clear that a small EC2 instance isn’t really fast at $0.10/hour :-)
    The test was about scalability not about absolute performance. However, I can’t help but agree with you that a box with fast I/O and a bunch of CPUs should be able to scale higher.

    The Unix sort command is interesting but you have to keep in mind it only sorts strings. Date/time/number conversions etc are a totally different beast not to mention Unicode. I’m not certain the Unix sort can do an external sort either. As such I’m not even certain it’s possible to sort the 1.8B lines on a single EC2 node. Given all that, I do agree it would be a baseline number.

    Cheers,
    Matt

  • anton@ru

    Hi Matt
    I run a quick test on my $300 desktop PC and sorting of 6 million records lineitem.tbl TPC-H file on shipment date takes about 14 seconds using standard unix sort utility.
    In a light of this the 12 second result of Talend’s on a high-end SUN box doesn’t look so impressive…

    Regards,

    Anton

  • Anton, 6M rows can be sorted on just about any hardware. It fits nicely in memory in just about any tool. 400k rows/sec for this sort of deal sounds about right for reading/sorting a single thread of line-items in memory.
    Where it gets interesting in these parallel tests is when you have say 5 machines with 1GB of memory and 20GB of data to sort.
    Look at it this way, even if sorting those 6M rows would take 4 times as long on Talend or PDI, why would you care, it’s still more than fast enough. However, with 1.8B rows things are different. Then you have 300 times more data and 300×14 seconds nor 300 minutes are trivial durations anymore. Given the external sort involved you would drop the times even more, leading you to hours and hours of processing time.
    This is where the parallel engines come in by splitting the hard work (read/encode/sort) over multiple CPUs/boxes.

    Anyway, with that baseline in mind and keeping an eye on the fact that the PDI transformation also joins etc, I don’t think those 40 cheap EC2 boxes did a bad job @ 4000 seconds, slightly faster than the 300x14s you get on your cheap PC ;-)