Archive for April, 2008

April 28th 2008

EC2 : Scaling large files with S3 and Kettle

Dear Kettle fans,

Here is a problem: you want to process a large amount of data, coming from one or more CSV files.  Suppose for example you have a file like this one: http://mattcasters.s3.amazonaws.com/customers-25M.txt (don’t download all at once, it’s 25 million rows or 2.4GB big)

Now, reading that file is a problem.  You can do all sorts of interesting things in parallel, but unless you can read the file itself in parallel, you have yourself a nice bottleneck.

As such, we created the parallel CSV Input step to come to the rescue.  While this works fine on ordinary file systems, it doesn’t offer any help out on Amazon EC2.  Over there, we don’t really have a file system yet that offers parallel access in a scalable way.  I investigated the promising s3fs project, but that one retrieved complete files in stead of chunks of data and always saves to local file first.  That in itself then becomes the bottleneck.

JetS3t

After even more searching over the weekend, I ended up using the JetS3t Java API to create a new “S3 Csv Input” plugin (SVN: svn://source.pentaho.org/svnkettleroot/plugins/S3CsvInput/trunk) to read the data in parallel, unbuffered and in chunks using AWS (Amazon Web Services) in this case on the REST service they provide.

The plugin allowed us to learn some interesting things about EC2 and S3…

Performance throttling

The first thing we see is that a single node never consumes more than 40-something% CPU.  We suspect that this is because of price/performance reasons.  Remember, these servers cost $US0.10/hour.  Even given this fact and the fact that we’re basically running inside a virtual machine, we can still process around 30,000 rows/s on a single node.

It scales!

The performance of a single node isn’t really all that interesting.  What we want to know is : if we read the same file with 2, 5 or 10 nodes, how long does it take then?

Well, obviously I tried it and here are the results:

–> 1 node=842 seconds, 2 nodes=433 seconds, 5 nodes=170 seconds, 10 nodes=96 seconds

This means that on 10 nodes, you go 8.8 times faster which is actually quite nice considering startup costs, etc.

The transformation

The 10 nodes scenario transformation looks like this:

This means: A cluster schema with 10 slave servers (1 carte per server) and 4 step copies per slave server (total of 40 step copies each reading around 25M/40=625k rows).

Throughput

The throughput of this solution comes down to a staggering 260,000 rows/second or 935M rows/hour.

Costprice

If you would have processed these 935M rows, you would have spent $US 1,00 (EURO 0,64) in EC2 rental costs and a few cents extra for the data transfer costs.

Usability

I always make it a point to provide the technology for ETL developers to use, never to dictate how that technology should be used.  However, in this case, I’m sure you’ll agree that the possibilities are endless, certainly in those situations where you want to apply complex logic, heavy calculations on individual or aggregated rows, scoring, etc. Especially those situations where a single server isn’t enough and you want to throw a bunch of servers at it.   What we have shown is that you *can* split up the load over several servers to make it scale beyond a single machine.

Limitation

The main limitation seems to be how fast you can get data to upload to S3 for processing on EC2… using Kettle… and how fast you can get the data back to your own servers.  Obviously, if your services are already hosted on EC2, then that problem doesn’t exist.

Until next time,

Matt

2 Comments »

April 24th 2008

Hardy Heron arrives

For me that means Kubuntu 8.04

This time, I’m going to show restraint and NOT upgrade in the first 2 weeks.  I’m going to let the dust settle on the download servers first :-)   Then I’m going to install it on my test system and then maybe on my main laptop.

Matt

2 Comments »

April 23rd 2008

Give MySQL a break please

In a unique display of mass hysteria, one blogger after the other and even slashdot (no, I’m not going to link) managed to take the completely innocent message that certain new enterprise features might get released as closed source only and turn it into an ongoing bad press onslaught about “MySQL closing down source code”.

Why don’t you all give MySQL a break here please?  The rule is always the same for everybody: the one that writes the code gets to pick the license.  Listen, I 100% believe in open source and I consider myself to be a big advocate, but commercial open source companies like MySQL (and Pentaho) are commercial entities.  At lease try to put yourself in their position for a second.  For example, if a customer asks you to NOT to release a piece of software they paid for, you don’t release it, it’s that simple.

In the end, what MySQL is doing is simple: they are experimenting with a commercial open source  (COS) model.  Why are they experimenting?  Because the concept of COS is very new and there are no clear guidelines.  It simply hasn’t been done before.  How do you keep growing?  How do you keep paying more open source developers?  How do you pay for the millions of web hits each day?  How do you pay for the millions of downloads, the Tera bytes of internet traffic?  How do you guarantee your long term survival?  How do you strike a balance between commercial success and widespread open source adoption?  How do you keep your investors happy as well as your community?

I guess we learned one thing the past week : it’s easier to spout criticism than to give answers to these tough questions.

Matt

7 Comments »

April 21st 2008

3.1.0-M1 is out

For those of you that missed it: Pentaho Data Integration (Kettle) version 3.1.0-M1 is out.

A lot of new goodies are in that first milestone (development) release such as zooming, snap-to-grid, results pane integration, a series of bug fixes, improved translations, performance enhancements, parallel job entry execution, a new data validation step, a new Greenplum bulk loader step, a new Property Input step, new job entries SSH2GET and SSH2PUT, a new Split Field To Rows step (ported from plugin) and much much more.

Until next time,
Matt

No Comments yet »

April 21st 2008

Korean Kettle : ko_KR

Only a few weeks after we started to receive a lot of Japanese translations, Kim YoungWoo offered to do Korean translations.  A lot of translations came in over the weekend for both languages (as well as a host of other fixes) :

Here is the list of languages supported with the % complete next to it:

en_US : 99,80% complete  (7956)
fr_FR : 98,36% complete  (7841)
it_IT : 89,30% complete  (7119)
es_AR : 75,63% complete  (6029)
de_DE : 56,41% complete  (4497)
ja_JP : 56,26% complete  (4485)
zh_CN : 53,94% complete  (4300)
es_ES : 48,19% complete  (3842)
nl_NL : 18,09% complete  (1442)
pt_BR : 15,48% complete  (1234)
pt_PT : 15,48% complete  (1234)
ko_KR : 10,99% complete  (876)

Thanks again to all the great work done by all translators!

Matt

1 Comment »

Next »

Pentaho world image