Handling errors

In the next milestone build of Pentaho Data Integration (2.4.1-M1) we will be introducing advanced error handling features. (2.4.1-M1 is expected around February 19th)
We looked hard to find the easiest and most flexible way to implement this, and I think we have found a good solution.

Here is an example:

Error handling table output sample

The transformation above works as follows: it generates a sequence between -1000 and 1000.  The table is a MySQL table with a single “id” column defined as TINYINT.  As you all know, that data type only accepts values between -128 and 127.

So what this transformation does is, it insert 256 rows into the table and divert all the others to a text file, our “error bucket”.

How can we configure this new feature?  Simple: click right on the step where you want the error handling to take place, in this case, the “Table output” step.  If the step supports error handling, you will see this popup menu appear:
Error handling popup menu

Selecting the highlighted option will present you with a dialog that allows you to configure the error handling:

Error handling dialog

As you can see, you can not only specify the target step to which you want to direct the rows that caused an error.  You can also include extra information in the error rows so that you know exactly what went wrong.  In this particular case, these are the extra fields that will appear in the error values error rows:

  • nrErrors: 1
  • errorDescription:  be.ibridge.kettle.core.exception.KettleDatabaseException:
    Error inserting row
    Data truncation: Out of range value adjusted for column ‘id’ at row 1
  • errorField: empty because we can’t retrieve that information from the JDBC driver yet.
  • errorCode: TOP001 (placeholder, final value TBD)

At the moment we have only equiped the “Script Values” (easy to cause errors with) and “Table Output” steps with these new capabilities.  However, in the coming weeks, more steps will follow suit.

Until next time,

Matt

8 comments

  • Hi Matt,
    That’s a great new feature !!!
    I want it now ! No more tricky procs to manage bad and rejected records (have to manage + 20 M rows a day …)
    PDI is going BIG.

    Vincent Teyssier
    http://www.decisionsystems-studio.fr
    http://open-bi.blogspot.com

  • Hi Vincent,

    If you want it NOW, download the dev build of Kettle-2.4.1-M1.zip (link in blog post). After that, update with a recent kettle.jar.

    Please be careful, we already discovered small issues with this. Also, Apache Commons VFS support was also added in this version. More on that later, but it allows you to directly read from files like: zip:file:///C:/testfiles/testfiles.zip with a wildcard! (.*txt$) This will give you all the text files in the zip archive. I’ll blog about this when we stabalised it a bit in a few days.

    :-)

    Matt

  • Frank

    Hi Matt,

    great new feature!

    I would suggest to enhance the error handling with an abort condition, i.e. abort the transformation after n rows have been rejected. If you are processing millions of rows, and a lot of these are bad (shit happens… ;-) ), you usually don’t want the data to be processed anyway and as a bonus you are saving time and resources this way.

    Regards
    Frank

  • Thanks Matt. Already running it now … :)
    I agree with Frank on the abort condition.
    Moreover, a summary of the process would be a great thing for data quality management : simple stats on rejected / bad rows, just to have a quick look on the data source quality.
    Anyway, that’s minor and welcome to this new feature.

    Vincent Teyssier
    http://www.decisionsystems-studio.fr
    http://open-bi.blogspot.com

  • Thanks for the suggestion Frank. You can clearly see with these things why we went with a complete new dialog instead of some new hop type. (the original idea) It would be almost trivial to add an abort threshold, so I’ll add it later today. :-)

    All the best,

    Matt

  • Data quality: I’m seriously thinking of adding a column in the view to show the nr of “diverted” rows. That way you have access to that data in the audit trails & in the job evaluations.

    We can add all sorts of cool stuff later, but for now, let’s make sure it’s stable first ;-)

    Matt

  • sathish

    it’s a fine feature.. can u extend this feature for other important steps like Insert/Update and LookUp.

  • Yes, those 2 will follow.