PDI cloud : massive performance roundup
Dear Kettle fans,
As expected there was a lot of interest in cloud computing at the MySQL conference last week. It felt really good to be able to pass the Bayon Technologies white paper around to friends, contacts and analysts. It’s one thing to demonstrate a certain scalability on your blog, it’s another entirely to have a smart man like Nicholas Goodman do the math.
Sorting massive amounts of rows is hard problem to take on. Making it scale on low-cost EC2 instances is interesting as it proves a certain level of scalability. Nick ran 40 EC2 nodes in parallel to do the work and saw that it was good. 450,000 rows/s for $US 4,00/hour is not bad. Note: the tests sort 300M (50), 600M (100) and 1.8B (300) line-item rows from TCP-H respectively.
For certain, the paper seemed to make it easier for me to point to PDI scalability and it opened some doors for further testing on big iron at Sun Microsystems. It was great to talk to so many people. I even walked up to the Amazon Web Services booth at the expo to ask about the performance bottleneck in the EBS that was exposed by the white paper. “It’s being worked on” was the reply 🙂
The most interesting thing about the PDI cloud integration work is that there don’t seem to be a lot of other ETL tool vendors doing it. In fact, after a Google or 2 I could only find Informatica with a Saas (not even IaaS) offering and I kinda doubt that closed source software is a good match for cloud computing.
So I went out there and did a presentation on the subject to explain to people how they would set it up for themselves. The open source way is to not only do the marketing but to allow people to run their own tests and see for themselves. That way you get valuable feedback to improve your offering.
Here is a copy of the presentation I gave: Cloud Computing with MySQL and Kettle.
I thought it was a good session although for once I didn’t get “The Question”, you know the one where people ask me how Kettle is different from Talend and where I get to comment on their lack of scalability. Oh well, I guess you can’t win them all 🙂
Finally, people have been asking me about integration with both SQLStream on the one hand and MapReduce/Hadoop/Hive/HDFS on the other hand. I’m happy to say that the former is in progress and that I’ve started talks with the fine folks from Cloudera to get started on the latter. I simply loved Aaron Kimball’s tutorial @ MySQL Conf on the MapReduce subject and think that there is a lot of potential for integration with PDI to make us scale even better.
Until next time,