EC2 : Scaling large files with S3 and Kettle
Dear Kettle fans,
Here is a problem: you want to process a large amount of data, coming from one or more CSV files. Suppose for example you have a file like this one: http://mattcasters.s3.amazonaws.com/customers-25M.txt (don’t download all at once, it’s 25 million rows or 2.4GB big)
Now, reading that file is a problem. You can do all sorts of interesting things in parallel, but unless you can read the file itself in parallel, you have yourself a nice bottleneck.
As such, we created the parallel CSV Input step to come to the rescue. While this works fine on ordinary file systems, it doesn’t offer any help out on Amazon EC2. Over there, we don’t really have a file system yet that offers parallel access in a scalable way. I investigated the promising s3fs project, but that one retrieved complete files in stead of chunks of data and always saves to local file first. That in itself then becomes the bottleneck.
After even more searching over the weekend, I ended up using the JetS3t Java API to create a new “S3 Csv Input” plugin (SVN: svn://source.pentaho.org/svnkettleroot/plugins/S3CsvInput/trunk) to read the data in parallel, unbuffered and in chunks using AWS (Amazon Web Services) in this case on the REST service they provide.
The plugin allowed us to learn some interesting things about EC2 and S3…
The first thing we see is that a single node never consumes more than 40-something% CPU. We suspect that this is because of price/performance reasons. Remember, these servers cost $US0.10/hour. Even given this fact and the fact that we’re basically running inside a virtual machine, we can still process around 30,000 rows/s on a single node.
The performance of a single node isn’t really all that interesting. What we want to know is : if we read the same file with 2, 5 or 10 nodes, how long does it take then?
Well, obviously I tried it and here are the results:
–> 1 node=842 seconds, 2 nodes=433 seconds, 5 nodes=170 seconds, 10 nodes=96 seconds
This means that on 10 nodes, you go 8.8 times faster which is actually quite nice considering startup costs, etc.
The 10 nodes scenario transformation looks like this:
This means: A cluster schema with 10 slave servers (1 carte per server) and 4 step copies per slave server (total of 40 step copies each reading around 25M/40=625k rows).
The throughput of this solution comes down to a staggering 260,000 rows/second or 935M rows/hour.
If you would have processed these 935M rows, you would have spent $US 1,00 (EURO 0,64) in EC2 rental costs and a few cents extra for the data transfer costs.
I always make it a point to provide the technology for ETL developers to use, never to dictate how that technology should be used. However, in this case, I’m sure you’ll agree that the possibilities are endless, certainly in those situations where you want to apply complex logic, heavy calculations on individual or aggregated rows, scoring, etc. Especially those situations where a single server isn’t enough and you want to throw a bunch of servers at it. What we have shown is that you *can* split up the load over several servers to make it scale beyond a single machine.
The main limitation seems to be how fast you can get data to upload to S3 for processing on EC2… using Kettle… and how fast you can get the data back to your own servers. Obviously, if your services are already hosted on EC2, then that problem doesn’t exist.
Until next time,