<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Pentaho Data Integration vs Talend (part 1)</title>
	<atom:link href="http://www.ibridge.be/?feed=rss2&#038;p=150" rel="self" type="application/rss+xml" />
	<link>http://www.ibridge.be/?p=150</link>
	<description>Venting steam after a long day of writing code...</description>
	<pubDate>Wed, 08 Sep 2010 11:46:50 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5</generator>
		<item>
		<title>By: Comparando Kettle y Talend &#171; Guía Mundial de Países</title>
		<link>http://www.ibridge.be/?p=150#comment-27074</link>
		<dc:creator>Comparando Kettle y Talend &#171; Guía Mundial de Países</dc:creator>
		<pubDate>Sat, 14 Mar 2009 09:46:29 +0000</pubDate>
		<guid isPermaLink="false">http://www.ibridge.be/?p=150#comment-27074</guid>
		<description>[...] version de Matt Casters (Kettle). Así podríamos decir el último post que ha realizado Matt en donde realiza una interesante comparativa entre ámbas herramientas ETL, [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] version de Matt Casters (Kettle). Así podríamos decir el último post que ha realizado Matt en donde realiza una interesante comparativa entre ámbas herramientas ETL, [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Parsing CSV files is CPU bound: a C++ test case (Update 1)</title>
		<link>http://www.ibridge.be/?p=150#comment-26292</link>
		<dc:creator>Parsing CSV files is CPU bound: a C++ test case (Update 1)</dc:creator>
		<pubDate>Fri, 19 Dec 2008 05:23:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.ibridge.be/?p=150#comment-26292</guid>
		<description>[...] experiments are motivated by a post by Matt Casters and some said that Java was guilty. I use C++ and I get a similar result. So far at least. Can you [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] experiments are motivated by a post by Matt Casters and some said that Java was guilty. I use C++ and I get a similar result. So far at least. Can you [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Parsing CSV files is CPU bound: a C++ test case</title>
		<link>http://www.ibridge.be/?p=150#comment-26247</link>
		<dc:creator>Parsing CSV files is CPU bound: a C++ test case</dc:creator>
		<pubDate>Tue, 16 Dec 2008 15:12:51 +0000</pubDate>
		<guid isPermaLink="false">http://www.ibridge.be/?p=150#comment-26247</guid>
		<description>[...] This quest started out from a post by Matt Casters where he reported that you could parse a CSV file faster using two CPU cores instead of just [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] This quest started out from a post by Matt Casters where he reported that you could parse a CSV file faster using two CPU cores instead of just [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Einspanjer</title>
		<link>http://www.ibridge.be/?p=150#comment-26216</link>
		<dc:creator>Daniel Einspanjer</dc:creator>
		<pubDate>Sat, 13 Dec 2008 18:34:15 +0000</pubDate>
		<guid isPermaLink="false">http://www.ibridge.be/?p=150#comment-26216</guid>
		<description>Matt,

take a look at this comment I put over on Vincent's blog and tell me what you think about it.  I have to make these transformations available regardless, but as I read more comments about these benchmarks, the more I think that having a set of real world transformations to work against might be interested for all the different vendors.

http://it.toolbox.com/blogs/infosphere/was-the-manapps-etl-benchmark-test-flawed-or-baised-28697#2498479</description>
		<content:encoded><![CDATA[<p>Matt,</p>
<p>take a look at this comment I put over on Vincent&#8217;s blog and tell me what you think about it.  I have to make these transformations available regardless, but as I read more comments about these benchmarks, the more I think that having a set of real world transformations to work against might be interested for all the different vendors.</p>
<p><a href="http://it.toolbox.com/blogs/infosphere/was-the-manapps-etl-benchmark-test-flawed-or-baised-28697#2498479" rel="nofollow">http://it.toolbox.com/blogs/infosphere/was-the-manapps-etl-benchmark-test-flawed-or-baised-28697#2498479</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Parsing text files is CPU bound</title>
		<link>http://www.ibridge.be/?p=150#comment-26166</link>
		<dc:creator>Parsing text files is CPU bound</dc:creator>
		<pubDate>Mon, 08 Dec 2008 20:14:02 +0000</pubDate>
		<guid isPermaLink="false">http://www.ibridge.be/?p=150#comment-26166</guid>
		<description>[...] Casters showed that using an open source data warehousing tool, parsing simple CSV files is CPU bound. That is, he [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] Casters showed that using an open source data warehousing tool, parsing simple CSV files is CPU bound. That is, he [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matt Casters</title>
		<link>http://www.ibridge.be/?p=150#comment-26165</link>
		<dc:creator>Matt Casters</dc:creator>
		<pubDate>Mon, 08 Dec 2008 12:19:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.ibridge.be/?p=150#comment-26165</guid>
		<description>You are absolutely right Fabrice!  Then again, I didn't get this thing started. ;-)

Again, I think we need to define these tests in an open source way and find some DS, INFA, BODI, OWB, Talend, PDI experts to execute the tests.
That sort of thing needs to happen on common grounds so I'm looking for a "neutral" site to host the benchmark suites and results.

What do you think of that idea?

All the best,

Matt</description>
		<content:encoded><![CDATA[<p>You are absolutely right Fabrice!  Then again, I didn&#8217;t get this thing started. <img src='http://www.ibridge.be/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p>Again, I think we need to define these tests in an open source way and find some DS, INFA, BODI, OWB, Talend, PDI experts to execute the tests.<br />
That sort of thing needs to happen on common grounds so I&#8217;m looking for a &#8220;neutral&#8221; site to host the benchmark suites and results.</p>
<p>What do you think of that idea?</p>
<p>All the best,</p>
<p>Matt</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Fabrice</title>
		<link>http://www.ibridge.be/?p=150#comment-26163</link>
		<dc:creator>Fabrice</dc:creator>
		<pubDate>Mon, 08 Dec 2008 09:26:53 +0000</pubDate>
		<guid isPermaLink="false">http://www.ibridge.be/?p=150#comment-26163</guid>
		<description>“PDI experts writes benchmark that says PDI is faster” ;-)

By the way, it would be much more interesting to compare PDI to proprietary solutions (Datastage, Informatica...).

As I usually say, PDI &#38; Talend are not competitors...

Fabrice</description>
		<content:encoded><![CDATA[<p>“PDI experts writes benchmark that says PDI is faster” <img src='http://www.ibridge.be/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p>By the way, it would be much more interesting to compare PDI to proprietary solutions (Datastage, Informatica&#8230;).</p>
<p>As I usually say, PDI &amp; Talend are not competitors&#8230;</p>
<p>Fabrice</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Lemire</title>
		<link>http://www.ibridge.be/?p=150#comment-26152</link>
		<dc:creator>Daniel Lemire</dc:creator>
		<pubDate>Sat, 06 Dec 2008 03:05:10 +0000</pubDate>
		<guid isPermaLink="false">http://www.ibridge.be/?p=150#comment-26152</guid>
		<description>Thanks for all the insight.

Regarding the limitations, that's when you use random writes, see my blog post:

http://www.daniel-lemire.com/blog/archives/2008/02/02/random-write-performance-in-solid-state-drives/

But if you are reading a CSV file, you are good. Of course, writing out a transformed output randomly on disk will get you in trouble.  But thankfully, you are not doing that (I hope you are loading it up into a table or something).</description>
		<content:encoded><![CDATA[<p>Thanks for all the insight.</p>
<p>Regarding the limitations, that&#8217;s when you use random writes, see my blog post:</p>
<p><a href="http://www.daniel-lemire.com/blog/archives/2008/02/02/random-write-performance-in-solid-state-drives/" rel="nofollow">http://www.daniel-lemire.com/blog/archives/2008/02/02/random-write-performance-in-solid-state-drives/</a></p>
<p>But if you are reading a CSV file, you are good. Of course, writing out a transformed output randomly on disk will get you in trouble.  But thankfully, you are not doing that (I hope you are loading it up into a table or something).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matt Casters</title>
		<link>http://www.ibridge.be/?p=150#comment-26146</link>
		<dc:creator>Matt Casters</dc:creator>
		<pubDate>Fri, 05 Dec 2008 18:11:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.ibridge.be/?p=150#comment-26146</guid>
		<description>&gt;&gt; That is what I meant by CPU-bound. That is, you must throw a second core to keep the I/O system busy. One core is not enough.

I think it is with the right settings in this case.  Personally I think it has more to do with disk latency.  However for argument sake it's besides the point. Like you said, if you would jump to the new types of SSD or very fast disk subsystems, one or even 2 cores would not be enough to process the +200MByte/s of throughput.  I've seen one particular situation (can't disclose) where we needed 10 servers with 4 cores to read the data at peak performance (a single fixed width flat file) : 2GByte/s.   Of-course it goes beyond saying that 20Gbit/s disk subsystems don't come cheap and are not always readily available to us for testing :-)

&gt;&gt; Now, the first thread starts and reads the first line. What the heck can the second thread do? It cannot jump and read b1, b2, b3 because it does not know where b1 before the first thread has read a3.

That's exactly what it does: jump to a neighbor of line 3.  The algorithm is not line based anyway, it's size based.
Positioning is done with a disk seek, not by actually reading the file.  In the worst case the disk heads have to reposition, no other time or CPU cycles are wasted on it.
To prevent the disk from going "crazy" as you put it, it's better to read in larger blocks.  That is why the second test has a large NIO buffer size set.

Obviously, looking at it from the position of a single disk spindle is a bit simple since most enterprise data these days would be stored on disk subsystems, RAID-5, mirrored disks, etc.

By the way, it's a different subject, but you would need very new SSD drives to surpass a recent hard drive performance, especially when it comes to writing.  Up until now, SSD has been marketing, nothing more.  Check out Linus' comments on the subject: http://torvalds-family.blogspot.com/2008/10/so-i-got-one-of-new-intel-ssds.html</description>
		<content:encoded><![CDATA[<p>>> That is what I meant by CPU-bound. That is, you must throw a second core to keep the I/O system busy. One core is not enough.</p>
<p>I think it is with the right settings in this case.  Personally I think it has more to do with disk latency.  However for argument sake it&#8217;s besides the point. Like you said, if you would jump to the new types of SSD or very fast disk subsystems, one or even 2 cores would not be enough to process the +200MByte/s of throughput.  I&#8217;ve seen one particular situation (can&#8217;t disclose) where we needed 10 servers with 4 cores to read the data at peak performance (a single fixed width flat file) : 2GByte/s.   Of-course it goes beyond saying that 20Gbit/s disk subsystems don&#8217;t come cheap and are not always readily available to us for testing <img src='http://www.ibridge.be/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>>> Now, the first thread starts and reads the first line. What the heck can the second thread do? It cannot jump and read b1, b2, b3 because it does not know where b1 before the first thread has read a3.</p>
<p>That&#8217;s exactly what it does: jump to a neighbor of line 3.  The algorithm is not line based anyway, it&#8217;s size based.<br />
Positioning is done with a disk seek, not by actually reading the file.  In the worst case the disk heads have to reposition, no other time or CPU cycles are wasted on it.<br />
To prevent the disk from going &#8220;crazy&#8221; as you put it, it&#8217;s better to read in larger blocks.  That is why the second test has a large NIO buffer size set.</p>
<p>Obviously, looking at it from the position of a single disk spindle is a bit simple since most enterprise data these days would be stored on disk subsystems, RAID-5, mirrored disks, etc.</p>
<p>By the way, it&#8217;s a different subject, but you would need very new SSD drives to surpass a recent hard drive performance, especially when it comes to writing.  Up until now, SSD has been marketing, nothing more.  Check out Linus&#8217; comments on the subject: <a href="http://torvalds-family.blogspot.com/2008/10/so-i-got-one-of-new-intel-ssds.html" rel="nofollow">http://torvalds-family.blogspot.com/2008/10/so-i-got-one-of-new-intel-ssds.html</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Lemire</title>
		<link>http://www.ibridge.be/?p=150#comment-26144</link>
		<dc:creator>Daniel Lemire</dc:creator>
		<pubDate>Fri, 05 Dec 2008 17:43:33 +0000</pubDate>
		<guid isPermaLink="false">http://www.ibridge.be/?p=150#comment-26144</guid>
		<description>&lt;i&gt;The thing is that the second thread kept the disk subsystem as busy as it possibly could.&lt;/i&gt;

That is what I meant by CPU-bound. That is, you must throw a second core to keep the I/O system busy. One core is not enough.

Have you tried the same thing with a SSD drive? If I really wanted to load up CSV files fast, I'd store them on SSDs.

&lt;i&gt;As far as parallel CSV reading, that was a bit tricky to set up… The size of the file is measured and split into the number of threads. All threads know at that point in time what part to read since they all know what number they have and how many there are, even across a cluster. The idea is then to read until you find the next newline/CR and continue until you hit the size limit. There are some borderline cases to handle, but that’s the general idea.&lt;/i&gt;

I still do not get it. Suppose that I have the following CSV data:

a1, a2, a3

b1, b2, b3

c1, c2, c3

d1, d2, d3


Now, the first thread starts and reads the first line. What the heck can the second thread do? It cannot jump and read b1, b2, b3 because it does not know where b1 before the first thread has read a3.

What I am guessing is this. You start the first thread at a1. Meanwhile, you have the second thread jump forward in the file, at maybe c2... then it starts reading... of course, the first line it reads is wasted, but you keep going.

That's nice, but you have some overhead. For example, after each line the first thread reads, it must check whether it reached the area read by thread two.

Then, also, you risk having your disk go crazy because you ask for a lot of random access. Of course, this would not be happening with SSD.</description>
		<content:encoded><![CDATA[<p><i>The thing is that the second thread kept the disk subsystem as busy as it possibly could.</i></p>
<p>That is what I meant by CPU-bound. That is, you must throw a second core to keep the I/O system busy. One core is not enough.</p>
<p>Have you tried the same thing with a SSD drive? If I really wanted to load up CSV files fast, I&#8217;d store them on SSDs.</p>
<p><i>As far as parallel CSV reading, that was a bit tricky to set up… The size of the file is measured and split into the number of threads. All threads know at that point in time what part to read since they all know what number they have and how many there are, even across a cluster. The idea is then to read until you find the next newline/CR and continue until you hit the size limit. There are some borderline cases to handle, but that’s the general idea.</i></p>
<p>I still do not get it. Suppose that I have the following CSV data:</p>
<p>a1, a2, a3</p>
<p>b1, b2, b3</p>
<p>c1, c2, c3</p>
<p>d1, d2, d3</p>
<p>Now, the first thread starts and reads the first line. What the heck can the second thread do? It cannot jump and read b1, b2, b3 because it does not know where b1 before the first thread has read a3.</p>
<p>What I am guessing is this. You start the first thread at a1. Meanwhile, you have the second thread jump forward in the file, at maybe c2&#8230; then it starts reading&#8230; of course, the first line it reads is wasted, but you keep going.</p>
<p>That&#8217;s nice, but you have some overhead. For example, after each line the first thread reads, it must check whether it reached the area read by thread two.</p>
<p>Then, also, you risk having your disk go crazy because you ask for a lot of random access. Of course, this would not be happening with SSD.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
