MySQL bulk load

Pretty Sick Slick

The last week I was under the weather and a year ago that would have meant that development of Pentaho Data Integration (PDI) would pretty much stop. These days I’m happy to say that this is absolutely not true anymore. In fact, hundreds of commits where done in the last week.

MySQL bulk load

To pick one example, Samatar Hassan added a job entry that allows you to configure a MySQL Bulk load job entry:

MySQL Bulk load

This job entry loads data as fast as possible into a MySQL database by using the LOAD DATA SQL command. It’s not as flexible as the Text File Input step, but it sure is fast. In certain cases, it might actually be up to ten times as fast. In short: another great job by Samatar!

I’m being told that Samatar is also writing a bulk loader for Microsoft SQL Server and that Sven is working on an Oracle SQL*Loader wrapper.

Wait, there’s more…

In addition to that, I saw the following job entries appear in the last couple of weeks: File Compare, SFTP Put, Display Messagebox Info, Wait for, Zip File and last but not least: XSLT Transformation. We also added the Formula and Abort steps. I’ll get back to you on the Formula step later as it’s an interesting option, although far from complete.

Evil Voices

Evil voices among my readership might suggest to get sick a bit more often. However, because of the highly modular nature of PDI, it is perfectly possible to develop code in parallel in a safe way. I can assure you all that it is not that I’m now forced to allow other developers to contribute. Everyone that has a great idea and wants to donate code to the PDI project, is welcome to do so at any time. The latest avelanche of code is just more proof that open source works and that by opening up as a project you gain a lot in the long run.

Today there are around 48 people that have write access to the Subversion code repository, and around 5-15 people commit code in any given month.

Release management

That is all great, but it does make the release management a bit more difficult. I think that we should probably take into account a 2 to 3 week delay in getting all the new stuff translated, documented and tested a bit more. Of-course, you can help out with that as well. Or you can just let us know how you feel about all these new developments.

Another small problem is that by adding all these new features it’s almost ridiculous to do a (2.4.1) point release now.
Until next time,

Matt

10 comments

  • Hi Matt,

    I took a look at the bulk loader for MySQL, and it’s neat that you’re taking advantage of “LOAD DATA INFILE”.

    I wonder if you could also take advantage of the fact that the bulk loader doesn’t need to be on the same host as the MySQL server with “LOAD DATA LOCAL INFILE” (the client, in this case the JDBC driver, reads the file and streams it to the server), or the fact that you can load from _URLs_ with that command with the JDBC driver?

  • Hi Mark,

    As usual, these job entries and steps come about because someone had an itch to scratch. Each time new functionality is added we see feature requests to make extra parameters useful.

    In this case I have to wonder if the “local” checkbox is not enough to make it happen already. Perhaps Samatar can comment on that.

    All the best,

    Matt

  • Hi Matt,

    Another non-cursory look at the sourcecode shows that the “LOCAL” functionality is there (should have looked closer the first time around).

    I don’t know how it fits into your framework, but in MySQL-5.0 you can apply transformations to the LOAD DATA statement, similar in fashion to an UPDATE statement, which could be handy in some cases.

    Once again, thanks for the support of MySQL as a source and target in the Kettle project!

    Regards,

    -Mark

  • Hey Mark,

    I noticed that as well in the MySQL docs. I think we can relatively easily add support for the transformations as well. Making it all user-friendly (browse to table etc) is another issue as well.
    However, I’m confident that we will get there eventually.

    Take care,

    Matt

  • Samatar

    Hi Matt,Mark,
    As you noticed the “local” functionnality is already available (via local checkbox).
    If you select local, file must be in your local host (PDI check it) and the JDBC driver read the file and stream it.

    If local is not specified, the file must be located in the server host (PDI don’t check the file). In that case the server host try to get file :

    According to mysql doc, here is the rule :
    ————————————–

    -If the filename is an absolute pathname, the server uses it as given.

    - If the filename is a relative pathname with one or more leading components, the server searches for the file relative to the server’s data directory.

    - If a filename with no leading components is given, the server looks for the file in the database directory of the default database.

    —————————————

    As Matt suggested, the next step should be i think to make it more user-friendly (browse,…)

    Another news….I finised the opposite Bulk (From MYSQL to FILE)

    see (Tracker: Change Request – [# 4912] Mysql Bulk Load to a file)

    http://www.javaforge.com/proj/tracker/itemDetails.do?task_id=4912&orgDitchnetTabPaneId=task-details-comments

    Take care

  • Pingback: Matt Casters on Data Integration » MySQL Bulk export to file

  • Wow nice timing on this! I was about to write my own code for this. It will save me some time, thank you.

    Quick Question, if you don’t mind. I’m moving data from a transactional system into a new datawarehouse staging area for a client. On a weekly basis, large quantities of transactional data from Oracle 10g is to be migrated into staging tables in MySql. I don’t have the option of FastReader/FACT or the like, so for the moment I’m spooling customized extracts from tables into files that I was originally going to write some sql to slurp up (load data infile) and stick into the staging area, MySql. It’s imperative I get what I need from Oracle and disconnect as quickly as possible, so I choose not to go the route of heterogenous services (“dblink” between the two). The data volumes are sizable.

    This new job entry looks perfect for the import/MySql part, thanks! My question is do you I think I can leverage Kettle for the Oracle exporting as well if speed is my biggest concern? Will Kettle extract from Oracle as quickly as a custom spool sqls? The kettle will run on the mysql localhost. One of the advantages to this I can easily start the importing while the exporting is running, and not impact Oracle with the importing work. Second question is do you know of a better way than what I’m doing?

    FYI, a recent job for a client included creating an array Kettle jobs/transformations to push data into a warehouse. Because speed was a major consideration, we did the same work in Java (and system calls like sort) so we could compare. It’s a very complex series of steps working on large data volumes (200Gb). Kettle was almost literally identical in speed to what we came up with with custom code, once we more wisely employed the multithreading in Kettle. Until I took the class on Kettle I wasn’t aware of all the multithreading abilties. Very nice work. In fact, the recent plan we had to leverage Amazon EC2 is no longer necessary with the optimizations we made in Kettle.

  • Hi Dan,

    This news was actually posted exactly one year ago :-)
    Usually I prefer to answer these questions on our forum, but since it’s BreadBoardBI I’m talking with here, I’ll make an exception. :-)

    As far as these exports are concerned, there are many ways to go about this depending on the situation.
    If the requirement is to get in and out as quickly as possible, you should time how long it takes to do an incremental export and how long it takes to do a full export of the table in question.

    Obviously, if you don’t have a changed date field in the table in question, you’re going to have to do a full export.
    Fortunately, these days, Kettle can export data at hundreds of thousands of records per second. (if your source system can handle it)

    I’m guessing here, but I think that dumping the data to a text file would be the fastest.

    Picking up this data and doing a diff using a “Merge Rows” step should be a relatively easy exercise.

    YMMV, it depends on the situation, etc, etc.

    Cheers,
    Matt

  • Too funny. I guess I forgot how to read dates, let alone how to spell my own name. That’s great becuase that likely means this is already in the latest release. ;) Thank you Matt. I’ll keep any future questions to the forum. I think my search for kettle/’load data infile’ led me here.

    Right, the extraction will be date field related. I’ve already got the exporting setup via a shell script and some sql spool scripts. If I can get them to purchase FastReader, I’d go that route, but that’s not going to happen it appears.

    I think I’ll keep that in place then tie in kettle for the importing and later etl work for the dw.

    Greatly appreciate your fast feedback!

  • Abid

    Hi,

    What is the correct syntax to write this in source file name path:

    ${localdir}/ix.*${TODAYYYMMDD}rguro\.txt

    Thanks

    Abid