Re-Introducing UDJC

Dear Kettle fans,

Daniel & I had a lot of fun in Orlando last week. Among other things we worked on the User Defined Java Class (UDJC) step.  If you have a bit of Java Experience, this step allows you to quickly write your own plugin in a step. This step is available in recent builds of Pentaho Data Integration (Kettle) version 4.

Now, how does this work?  Well, let’s take Roland Bouman‘s example : the calculation of the the date of Easter.  In this blog post, Roland explains how to calculate Easter in MySQL and Kettle using JavaScript.  OK, so what if you want this calculation to be really fast in Kettle?  Well, then you can turn to pure Java to do the job…

import java.util.*;

private int yearIndex;
private Calendar calendar;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
Object[] r=getRow();
if (r==null)
{
setOutputDone();
return false;
}

if (first) {
yearIndex = getInputRowMeta().indexOfValue(getParameter("YEAR"));
if (yearIndex<0) {
throw new KettleException("Year field not found in the input row, check parameter 'YEAR'!");
}

calendar = Calendar.getInstance();
calendar.clear();

first=false;
}

Object[] outputRowData = RowDataUtil.resizeArray(r, data.outputRowMeta.size());
int outputIndex = getInputRowMeta().size();

Long year = getInputRowMeta().getInteger(r, yearIndex);
outputRowData[outputIndex++] = easterDate(year.intValue());

putRow(data.outputRowMeta, outputRowData);

return true;
}

private Date easterDate(int year) {
int a = year % 19;
int b = (int)Math.floor(year / 100);
int c = year % 100;
int d = (int)Math.floor(b / 4);
int e = b % 4;
int f = (int)Math.floor(( 8 + b ) / 25);
int g = (int)Math.floor((b - f + 1) / 3);
int h = (19 * a + b - d - g + 15) % 30;
int i = (int)Math.floor(c / 4);
int k = c % 4;
int L = (32 + 2 * e + 2 * i - h - k) % 7;
int m = (int)Math.floor((a + 11 * h + 22 * L) / 451);
int n = h + L - 7 * m + 114;

calendar.set(year, (int)(Math.floor(n / 31) - 1), (int)((n % 31) + 1));
return calendar.getTime();
}

All you then need to do is specify a return field in the Fields tab called “Easter” (a Date) and a parameter YEAR (the field to contain the year).

Screen shot of the UDJC step

The performance on my machine (Dual Core 2 Duo 2.33Ghz) is 134,000 rows/s for the JavaScript version and 450,000 rows/s for the UDJC version.  That’s over 3 times faster to do exactly the same thing.

Here is a link to the Kettle test transformation for those that want to give it a try.  As you can see, the deployment issue of having a plugin around is completely gone since now you can do anything you can do with a plugin from within the comfort of the UDJC step in Spoon.

The UDJC step uses the wonderful Janino library to compile the entered code to Java byte-code that gets executed at the same speed as everything else in Kettle.  This gives us pretty much optimal performance.

You can expect some tweaks to the UDJC step before 4.0 goes into feature freeze.  However, the bulk of the changes are in there and working great.  Thank you Daniel, for an outstanding job!

Until next time,

Matt

17 comments

  • Hi!

    thanks for linking, and great work on this example!
    I must say, I played around with Daniels step a while back, and it’s great to see you’re working together on it.

    I am a little bit surprised by the performance increase: 3 to 4 times is nice of course, but I expected to see about 10x speedup, based on the findings described in my earlier Javascript performance blog (http://rpbouman.blogspot.com/2009/11/pentaho-data-integration-javascript.html) comparing java user defined expressions and javascript, and of course Daniels video (http://people.mozilla.com/~deinspanjer/KettleJSPerformance.mov) where he introduced his UJC step.

  • There’s actally quite some room for improvement in the java code.
    You don’t need the (int)Math.floor() calls – simply do integer division, and get decimal truncation for free – it’s safe in this case.
    (That’s exactly what I do in the MySQL stored procedure BTW – using DIV instead of / as division operator)

  • I don’t think that’s the reason. I actually trust the Java optimizer to take care of those pesky things.
    The issue at hand is that the more you burn CPU in a task, the less the performance difference is going to be.
    I’m sure the calculations + the Date handling is particularly costly to do.
    Java takes pretty much anything into account including leap seconds, leap years, etc, etc. As such, Date handling is notoriously S.L.O.W.

    Anyway, you can try to optimize, but I just couldn’t work myself up over it if you know what I mean :-)

  • Ok. Well I might try it myself in my copious free time ;)

    At any rate, thanks for the reply. BB!

  • Raj

    Hi I am trying to use udjc plugin.

    I’ve downloaded the janino.jar and placed it under /plugins/steps/

    Should I be doing something more? I am using Kettle 4

    So, how do I add this plugin?

  • Raj

    Hi

    This is the error I am getting.

    Error reading object from XML file

    Unable to load step info from XML step nodeorg.pentaho.di.core.exception.KettleStepLoaderException:
    Unable to load class for step/plugin with id [UserDefinedJavaClass]. Check if the plugin is available in the plugins subdirectory of the Kettle distribution.

    Unable to load class for step/plugin with id [UserDefinedJavaClass]. Check if the plugin is available in the plugins subdirectory of the Kettle distribution.

  • Raj, you don’t need to install anything, this is Kettle4 we’re talking about.
    The step is available in the experimental step category until we finish the re-design of it.
    All that’s left to do currently is the user interface. The rest is pretty much done.

  • Raj

    Matt

    Thanks on your prompt help!

    I ‘ve the latest PDI from sourceforge.net “4.0.0-M2″. So, thats not the right one? If so, sure I’ll wait.

    Thanks a lot.

    Raj!

  • dhartford

    that is awesome! ’nuff said.

  • TM

    I need a clarification. Can the UDJC be able to reference other Java classes? Is it possible to take a bunch of Java classes and package them up for use with UDJC?

  • Of course TM, all classes in your class path can be accessed.
    In practice, all you need to do is drop an extra JAR file in the libext/ folder and you’re good to go.

  • TM

    Thanks Matt for the quick response. But I want to be sure I understand this exactly right. I will have a class, let’s say JC, which references class JC1 and JC2. JC1 references classes JC11 and JC12. JC2 references JC21 and JC22 and son. The point is only one class will be the UDJC, but that class would be referencing multiple classes. Your response indicates that there would not be any problem using this class, but I wanted to be sure. This is really exciting to me since it will make my life much easier. I was reluctanly planning to learn how to implement a plugin.

    Apparently this is in ‘incubation’. When is this likely to be released?

  • There is a 4.0.0 Preview Release out there, an RC1 around the end of this month and a few weeks later we will have a stable release.

  • Renato Back

    Hello Matt,

    Is there any chance you might get the same thing using PDI 3.2?
    I’m having some serious issues running Java classes inside PDI and this example seems like the perfect solution to be based upon.
    Thanks in advance,

    Renato Back

  • Renato, there is a plugin available for 3.2. It’s not as evolved but should do the basic trick.
    You can see the code over here:

    http://source.pentaho.org/svnkettleroot/plugins/UserDefinedJavaClass/branches/3.2x/

    No guarantees. If you feel like back-porting or working on it, let us know ;-)

  • Bradlee

    Hi matt,
    i use PDI 4.0.1 and need to load data from oracle to flat file..but now on i still can get max loading 3000/s
    i work with large data about 200mil record..i want to use this technic.
    – what should i do first ?
    – how can i implement this on linux system?

    thx.
    Bradlee

  • It’s a good question Bradlee, you should try asking it on the forum.