Archive for October, 2007

October 30th 2007

Kettle at Talend

Dear open source ETL friends,

today I met with the nice folks from Talend in their offices in Paris.  Contrary to what some of you might conclude from the sometimes heated technical debates on the internet, it was nice to hear that both the folks from Talend and myself have the same opinion on our position on the market: we don’t see each other as competitors. (**)

Some of you might think that this sounds a lot like a marketing slogan so let me expand upon this a bit.  The matter of fact is that Talend and Pentaho Data Integration are sufficiently different in conception, architecture and implementation that they are in fact two distinct choices on the ETL market.  While some people prefer one tool and some prefer another, the choice is there.  Having the opportunity to try out both tools for Free and to have this choice is one of the most important differentiators with the traditional closed source ETL companies.

Now, don’t get me wrong, obviously as the Pentaho Chief Data Integration, I think that our architecture is a lot better. :-)  However, the people from Talend have the same opinion about their ETL tool.  That’s the way it should be.  This is a good and healthy thing.

That should not relieve us from our responsibility as good behaving citizens in Open Source land to at least try and find a common ground on certain issues.
If KDE and Gnome can agree on certain desktop standards, if Compiz and Beryl can join forces again after a rocky episode, anything should be possible.

So it was a pleasure to find out that indeed we did find some small points to work on together. I’m hoping we’ll be able to let you in on the details real soon.

I do hope that both communities read this message and act accordingly in the same sharing, cooperating and dignified fashion.  In that regard I want to remind my readers that Pentaho Data Integration was only possible thanks to the incredible work put into dozens of fine open source libraries.  I’m pretty sure that the same goes for Talend.

Think about it for a minute… We’re both outpacing and outselling the traditional closed source ETL vendors, probably by a factor of 10 to 1.  That is not because of our differences, but because our similarities.

Until next time,

Matt

(**) Driving home at 280km/h on the High Speed Train, I also came to the conclusion I really enjoyed that red wine during lunch ;-)

3 Comments »

October 27th 2007

If you don’t know… you don’t know!

Dear friends,

We had a good time the last couple of days in Lyon. It was nice to meet people actually using Kettle every day in their day-to-day job. It was great to hear all about the succes stories and the future plans.
I also had the pleasure to meet Alain, project manager at BPM, all around nice guy and philosophical genius. At a certain point in our conversation he dropped the almost Zen-like knowledge on us:

If you don’t know, you don’t know!

Boy, it doesn’t get any deeper than that, does it?  In one simple sentence, Alain described the core problem of BI gap analyses AND the number one problem in software development. If you don’t know, you don’t know and you need to ask your customer, you need to get to know the requirements. All too often developers, data warehouse designers and ICT people in general decide for the end-users what is right for them. All too often it causes serious strain on the projects. In fact I’m absolutely sure it’s the absolute #1 cause for project failures. (**)

So next time you decide what is the best thing for your customers and end-users, remember Alain’s wise words.

Until next time!

Matt

(**) I don’t have any statistics available on this subject of-course. If I’m proven wrong later I’ll blame it on Alain “The Crazy Swiss” Debecker.

6 Comments »

October 24th 2007

Kettle version 2.5.2 available

Dear Kettle fans,

A lot of things have happened in terms of development in the the 3.x codebase.
However, that doesn’t mean we don’t maintain the 2.5 tree anymore. A lot of fixes where back-ported from the 3.x tree and we managed to add a few new job entries as well.
Click here for a full list of the changes versus 2.5.1. A version 2.5.3 source tree has been opened in which we will continue to fix problems, at least for a while longer. We will no longer add new features to this version.

Grab the goodies over here:

Binary: Kettle-2.5.2.zip
Source: Kettle-src-2.5.2.zip
Windows installer: PDI-2.5.2.exe

Instructions on how to install using the executable are over here. Just make sure to use a Microsoft Windows operating system to install on. :-)

Version 2.5.2, like before runs on version 1.4 or higher of the Java Runtime Environment.

Enjoy the release!

Best regards,

Matt

No Comments yet »

October 23rd 2007

Ongoing spam assault

Hi friends,

There’s always a price to pay.  It seems that my blog has been getting more and more popular lately.  However, it has also become a spam magnet.  It’s not just that the number of spam comments has increased to +65000…

Apparently, there are some unfixed security holes in the Wordpress software for this blog.  Those can be exploited by the spammers to drop all kinds of stupid links into the blog roll.  That’s the reason that section on the right is gone now.

I wasn’t really in need of or trying to sell you those items you may have seen on occasion ;-)  If that’s the case, my apologies.

If anyone knows a good solution for this problem, feel free to drop me a note!

Until next time,

Matt

4 Comments »

October 19th 2007

Test case : fast parallel flat file reading

Version 3 of Pentaho Data Integration will feature 2 news steps to load flat files:

  • CSV Input: to read delimited text files
  • Fixed Input: to read fixed width flat files

Both steps where not designed to be as versatile as possible. We already have the regular “Text File Input” step for that. These steps where designed to be as fast as possible.

Here are the reasons why these steps are fast:

  • They use non-blocking I/O (NIO) to read data in big chunks at the time
  • They have greatly simplified algorithms without any bloat
  • They allow us to read data using lazy conversions

Besides these items we also greatly reduced the overhead of garbage collection in the java virtual machine and the metadata handling in our new version.

So where does that leave us? How fast can we read data now? Well, let’s take a really big file and try it out for ourselves to see.

Generate a text file

A test file is a problem because for reference you want everyone to use the same text file and yet you can’t just post a 15GB text file on an FTP server somewhere. (it’s just not practical).

So you need to generate one. I created a small C program to handle this: printFixedRows.tar.gz

A Linux (x86) binary is included in the archive, but you can compile the C program for your own platform with the following command:

cc -o printFixedRows printFixedRows.c

Then you can launch this command to get a 15GB test file:

./printFixedRows 10000000 150 N > bigfixedfile.txt

This generates 10 million rows with 1529 bytes on each row. (a size of 15.290.001.529 bytes)

Reading a text file

Now that we’re certain that the system cache is not going to make too much of a difference in the results, we’re going to read this file.

Here is the transformation to read the file.

Parallel reading transformation

Performance monitoring

If we adjust the transformation and point it to the correct filename, we can run it.

We can then see that the performance is around 12k rows/s or 12000×1529 bytes/s = 18MB/s.
Each row has 152 fields in it of 4 different data types (String, Integer, Date and Number). As such, this step generates 1.8 million data fields per seconds.

In fact if we look at the results of “iostat -k 5″ we basically find the same results:

iostat results

We also note that the CPU usage is very low at around 22% and that the I/O Wait % is very high. In other words: our transformation is waiting for the disk to give back data. The NIO system really pays off here by reading large blocks at the same time from disk, offering great performance.

Parallel reading

At first glance, parallel reading wouldn’t do us any good here since the disk can’t go any faster. Having 2 processes read the same file at the same time is not the solution.

So I copied the file over to another disk and made symbolic links in the /tmp folder

  • /tmp/0 for step copy 0 with a sym-link to biginputfile.txt on disk 1
  • /tmp/1 for step copy 1 with a sym-link to biginputfile.txt on disk 2

In my case it’s this:

0/bigfixed.txt -> /bigone/bigfixed.txt
1/bigfixed.txt -> /parking/bigfixed.txt

Please forget for a minute the fact that copying the data to another disk is slower than reading it in in the first place. I’m sure you all have fancy RAID drives that do this transparently, I’m just a poor blogger, remember? :-) If you have a fast disk, you can just fire up 2 or more copies of the step without the trick above to get the speedup.

As you all figured out by now, the speed of the transformation just doubled simply by doubling the number of copies to start to 2 for the “Fixed Input” step . Throughput went from 18MB/s to around 40MB/s. The second disk is a bit faster as it’s an internal disk, not an USB drive, and we’re now reading data at around 26k row/s. (4 million data fields per second)

iostat results

Possibilities and conclusions

Because of all the optimizations that we did, we are now reading data in faster than what our limited I/O system can give us. That means that using faster disks or in general getting more “spindles” is a good idea if you want to read data faster.

Obviously, the same goes for the output part as well, but that’s a whole different story for another time. For now, we demonstrated that the new “Fixed Input” step can read a 15GB file in 400 seconds flat. That’s not a bad result for my laptop system, any way you look at it. The 2 CPU are even 75% unused leaving room for lots of interesting transformations you want to do on the data.

Until next time,

Matt

No Comments yet »

Next »

Pentaho world image