<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: EC2 : Scaling large files with S3 and Kettle</title>
	<atom:link href="http://www.ibridge.be/?feed=rss2&#038;p=113" rel="self" type="application/rss+xml" />
	<link>http://www.ibridge.be/?p=113</link>
	<description>Venting steam after a long day of writing code...</description>
	<pubDate>Fri, 10 Sep 2010 23:55:10 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5</generator>
		<item>
		<title>By: Matt Casters</title>
		<link>http://www.ibridge.be/?p=113#comment-22181</link>
		<dc:creator>Matt Casters</dc:creator>
		<pubDate>Sun, 11 May 2008 07:58:10 +0000</pubDate>
		<guid isPermaLink="false">http://www.ibridge.be/?p=113#comment-22181</guid>
		<description>Alex, lazy conversion will indeed speed that up.

I wrote another blog entry on the subject:  http://www.ibridge.be/?p=63
And another one on performance : http://www.ibridge.be/?p=78

The high CPU usage basically is the "Java" price you pay for Unicode conversions and data type conversions.
To get around that we invented the concept of lazy conversion to try and avoid doing unneeded data conversions.
If you're loading into a database, it doesn't really matter all that much since that's going to be the bottleneck.

On the slowness of the EC2 nodes: nothing we can do about that really.  It's a price/performance thing.  There are faster nodes to be had, but they go at $US 0.30/hour.  Also please remember that the data is coming in over REST (web services).  That is bound to be a bit slower than reading from local disks.   Again, the point is that it scales.  Perhaps if we would add another 10 nodes, it would scale even further.  I'm lacking the time to test it out at the moment.

Matt</description>
		<content:encoded><![CDATA[<p>Alex, lazy conversion will indeed speed that up.</p>
<p>I wrote another blog entry on the subject:  <a href="http://www.ibridge.be/?p=63" rel="nofollow">http://www.ibridge.be/?p=63</a><br />
And another one on performance : <a href="http://www.ibridge.be/?p=78" rel="nofollow">http://www.ibridge.be/?p=78</a></p>
<p>The high CPU usage basically is the &#8220;Java&#8221; price you pay for Unicode conversions and data type conversions.<br />
To get around that we invented the concept of lazy conversion to try and avoid doing unneeded data conversions.<br />
If you&#8217;re loading into a database, it doesn&#8217;t really matter all that much since that&#8217;s going to be the bottleneck.</p>
<p>On the slowness of the EC2 nodes: nothing we can do about that really.  It&#8217;s a price/performance thing.  There are faster nodes to be had, but they go at $US 0.30/hour.  Also please remember that the data is coming in over REST (web services).  That is bound to be a bit slower than reading from local disks.   Again, the point is that it scales.  Perhaps if we would add another 10 nodes, it would scale even further.  I&#8217;m lacking the time to test it out at the moment.</p>
<p>Matt</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Alex</title>
		<link>http://www.ibridge.be/?p=113#comment-22180</link>
		<dc:creator>Alex</dc:creator>
		<pubDate>Sun, 11 May 2008 07:11:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.ibridge.be/?p=113#comment-22180</guid>
		<description>Interesting stuff, especially the nearly perfect scaling. I"ve been digging into some aspects of Kettle's performance lately. It would seem that data conversion from source input is pretty CPU-hungry, because the per node throughput seems to be: 

1 node: 2400/842 = 2.85mb/s
10 nodes: 2400/96 = 25mb/s aggregate (2.5mb /s /node)

It seems like a lot of effort to get 25mb/s given that your average SATA disk can 
get sequential read speeds of 40-50mb/s.

I just tested a simple transform against a 2.5GB file with a series of generated UUIDs and random numbers to load the data and then sum up the random numbers. Sure enough, I can barely make the disk sweat, only managing 85k rows/s (4mb/s). The CPU on this box is a bit old though, but I would not have expected this.

Is this the primary goal of lazy conversion and parallel loading -- to speed up the CPU bottleneck of conversions? 

Sorry if i'm going a bit off topic...</description>
		<content:encoded><![CDATA[<p>Interesting stuff, especially the nearly perfect scaling. I&#8221;ve been digging into some aspects of Kettle&#8217;s performance lately. It would seem that data conversion from source input is pretty CPU-hungry, because the per node throughput seems to be: </p>
<p>1 node: 2400/842 = 2.85mb/s<br />
10 nodes: 2400/96 = 25mb/s aggregate (2.5mb /s /node)</p>
<p>It seems like a lot of effort to get 25mb/s given that your average SATA disk can<br />
get sequential read speeds of 40-50mb/s.</p>
<p>I just tested a simple transform against a 2.5GB file with a series of generated UUIDs and random numbers to load the data and then sum up the random numbers. Sure enough, I can barely make the disk sweat, only managing 85k rows/s (4mb/s). The CPU on this box is a bit old though, but I would not have expected this.</p>
<p>Is this the primary goal of lazy conversion and parallel loading &#8212; to speed up the CPU bottleneck of conversions? </p>
<p>Sorry if i&#8217;m going a bit off topic&#8230;</p>
]]></content:encoded>
	</item>
</channel>
</rss>
