April 16th 2007 09:59 am
Handling 500M rows
We’ve been doing some tests with medium sized data sets lately. We extracted around half a year of data (514M rows) from a warehouse where we’re doing a database partitioning and clustering test.
Below is an example where we copy +500M rows from one database to another one that is partitioned. (MS SQL Server to MySQL 5.1). This is done using the following transformation. In stead of just using one partitioned writer, we used 3 to speed up the process. (lowers latency).

Copying 500M rows is just as easy as copying a thousand, it just takes a little longer…

It would have completed the task a lot faster if we wouldn’t have been copying to a single table on DB4 at the same time. (yep, again 500M rows) This slowed down the transformation to the maximum speed of DB4. That being said, if you still had any doubt about Pentaho Data Integration being able to copy large volumes of data, this blog post should pretty much clear those doubts from your mind.
I’m posting these examples to boost your interest for my afternoon talk at the MySQL conference in Santa Clara. I’m going to present some query performance results on the partitioned database, showing near-linear scalability.
Until then!
Matt
5 Comments »

Ryu on 16 Apr 2007 at 10:31 #
Can you give us a ftp link to downoad your process please?
mcasters on 16 Apr 2007 at 10:40 #
You can download here:
http://www.javaforge.com/proj/doc.do?proj_id=318
or here:
http://s3.amazonaws.com/kettle/Kettle-2.4.0.zip
Ryu on 16 Apr 2007 at 15:23 #
Excuse me, I don’t want a link to download kettle, I want a link to download your process (the .ktr .kjb .sql… all a need to test on my plateform!)
MySQL Conference and Expo 2007, Day 2 at Xaprb on 17 Jun 2007 at 21:23 #
[…] you been reading Matt’s blog? Do you remember his understated post on processing a large volume of data in parallel with near-linear scalability? I’ve been eagerly reading his articles for a while and it was great to hear him speak and […]
MySQL 5.1 Article Recap · Internet Articles ? on 23 Feb 2008 at 17:26 #
[…] Starting off the coverage of partitioning, Peter Gulutzan wrote a lengthy article on the feature. Over the past year and a half, a number of people have written about partitioning, which seems to be the most popular new feature in MySQL 5.1. Robin Schumacher wrote a series of articles on partitioning, beginning with Improving Database Performance with Partitioning and then More On MySQL Partitioning. His final installment was about Partitioning With Dates in MySQL 5.1. Partha Dutta, over at RightMedia, wrote a couple blog posts that confirmed some fairly massive performance improvements with partitioning — MySQL 5.1 Partitioning (Part 3), MySQL 5.1 Partitioning (Part 4), and Matt Casters (Pentaho) wrote an article on using Kettle and MySQL 5.1 Partitioning for large data sets: Handling 500M Rows […]