March 2nd 2010

Kettle log text capturing

Dear Kettle fans,

As you know, Kettle 4.0 received a new logging framework not so long ago.  It allows us to know exactly where a log-line comes from, even in complex ETL situations.

So when codek asked to know the cause of errors in a job, it was quite easy to implement this.

Here is a single screen shot that should explain it all (click to open image):

Until next time!

Matt

No Comments yet »

January 27th 2010

Re-Introducing UDJC

Dear Kettle fans,

Daniel & I had a lot of fun in Orlando last week. Among other things we worked on the User Defined Java Class (UDJC) step.  If you have a bit of Java Experience, this step allows you to quickly write your own plugin in a step. This step is available in recent builds of Pentaho Data Integration (Kettle) version 4.

Now, how does this work?  Well, let’s take Roland Bouman’s example : the calculation of the the date of Easter.  In this blog post, Roland explains how to calculate Easter in MySQL and Kettle using JavaScript.  OK, so what if you want this calculation to be really fast in Kettle?  Well, then you can turn to pure Java to do the job…

import java.util.*;

private int yearIndex;
private Calendar calendar;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
Object[] r=getRow();
if (r==null)
{
setOutputDone();
return false;
}

if (first) {
yearIndex = getInputRowMeta().indexOfValue(getParameter("YEAR"));
if (yearIndex<0) {
throw new KettleException("Year field not found in the input row, check parameter 'YEAR'!");
}

calendar = Calendar.getInstance();
calendar.clear();

first=false;
}

Object[] outputRowData = RowDataUtil.resizeArray(r, data.outputRowMeta.size());
int outputIndex = getInputRowMeta().size();

Long year = getInputRowMeta().getInteger(r, yearIndex);
outputRowData[outputIndex++] = easterDate(year.intValue());

putRow(data.outputRowMeta, outputRowData);

return true;
}

private Date easterDate(int year) {
int a = year % 19;
int b = (int)Math.floor(year / 100);
int c = year % 100;
int d = (int)Math.floor(b / 4);
int e = b % 4;
int f = (int)Math.floor(( 8 + b ) / 25);
int g = (int)Math.floor((b - f + 1) / 3);
int h = (19 * a + b - d - g + 15) % 30;
int i = (int)Math.floor(c / 4);
int k = c % 4;
int L = (32 + 2 * e + 2 * i - h - k) % 7;
int m = (int)Math.floor((a + 11 * h + 22 * L) / 451);
int n = h + L - 7 * m + 114;

calendar.set(year, (int)(Math.floor(n / 31) - 1), (int)((n % 31) + 1));
return calendar.getTime();
}

All you then need to do is specify a return field in the Fields tab called “Easter” (a Date) and a parameter YEAR (the field to contain the year).

Screen shot of the UDJC step

The performance on my machine (Dual Core 2 Duo 2.33Ghz) is 134,000 rows/s for the JavaScript version and 450,000 rows/s for the UDJC version.  That’s over 3 times faster to do exactly the same thing.

Here is a link to the Kettle test transformation for those that want to give it a try.  As you can see, the deployment issue of having a plugin around is completely gone since now you can do anything you can do with a plugin from within the comfort of the UDJC step in Spoon.

The UDJC step uses the wonderful Janino library to compile the entered code to Java byte-code that gets executed at the same speed as everything else in Kettle.  This gives us pretty much optimal performance.

You can expect some tweaks to the UDJC step before 4.0 goes into feature freeze.  However, the bulk of the changes are in there and working great.  Thank you Daniel, for an outstanding job!

Until next time,

Matt

8 Comments »

January 14th 2010

Job drill down & sniff testing

Dear Kettle fans,

Besides refactoring and cleaning up code, I fortunately can write some new code once in a while as well.

Today, I’m happy to demo job drill down and step sniff testing for you.

The first feature, Job drill down, allows you to drill down into a running job entry, into sub-jobs, transformations and even mappings (sub-transformations). All the time, you’ll see the logging for that part of the root job as well as the usual metrics:

The second feature is also a lot of fun. It allows you to execute a transformation in Spoon and see the rows that are coming out of a step in real time. I called this feature a sniff test since that’s what it seems to be doing:

Hope you like these small usability features.  If you want to try it out yourself, get Hudson build 1598 or later.

Until next time,

Matt

4 Comments »

December 8th 2009

Open Source BI : Pentaho Rules!

Dear Pentaho friends,

I just wanted to share some good news with you regarding a new study that got published today.

Mark Madsen, an independent business intelligence analyst, together with the BeyeNETWORK took an in depth look at what is going in companies with respect to open source BI.**

You can find the complete report over at the BeyeNETWORK, but here is the graph that particularly interested me:

My congratulations to the Pentaho community for pulling this off.

Until next time,
Matt

** This study was sponsored by JasperSoft, KickFire, Talend and Pentaho

4 Comments »

October 30th 2009

Book Review : Pentaho Reporting 3.5 for Java Developers

Hi Pentaho fans,

These are exciting times for Pentaho for sure.  These are also extremely busy times.  However, that doesn’t mean we can’t look around once in a while.  Today we’ll take a quick look at a new book that arrived on my doorstep a few weeks ago.  It’s titled

Pentaho Reporting 3.5 for Java Developers

I’m very pleased to be able to review this book as it is written by one of the smartest but more importantly also one of the nicest people at Pentaho: Will Gorman.  Not only that, he apparently had help from KC (Kurtis Cruzada) and Jem (Matzan) completing the dream team for this book.

And what a great book it turned out to be.  It covers pretty much everything from basic reporting, over mobile reporting, calculations and formula, sub-reporting, cross-tabs, charting down to the Java API.

Obviously, this book as been reviewed many times before by various people and websites. (Yes, it’s that popular)   To me that means that I can’t just do a quick review, I’m going to have to actually use and read the book.  And that’s what we’ll do today for this review.

We’re going to create a report in the form of a PDF.  The data for the report comes from a Kettle transformation.  We’re going to do it with my favorite programming language (Java) and a complete stack of Open Source Software…

I began by creating a new Eclipse project called KettleBook, download the source over here.
To make sure I didn’t miss any library dependencies, I used the complete “lib” folder of Pentaho Report Designer 3.5 as my class path. (not included in the download)

First, I went to Chapter 10 in the book and started reading the paragraph titled “Building a report using Pentaho Reporting’s API” as that seems to fit the bill. (page 266)

That part explains plain and simple how to create a new Master Report, how data sources work.  But wait, I don’t want a DefaultTableModel, I want to read from Kettle!  Well, a few page flips later we find ourselves on page 143 reading about the KettleDataFactory.  That got me quite far actually as the sample is quite descriptive.

So then I created a small transformation to read from a sample customer file using Pentaho Data Integration 3.2.  This is it:

It reads 100 rows of sample customer data, filters out the people from California, Florida and New York state.  That gives us 91 records.  We’re going to read from the RESULT step placeholder.

The part on page 147 I needed was this block:

KettleTransFromFileProducer producer = new KettleTransFromFileProducer("Customer data", transFile, stepName, "", "", new String[0], new ParameterMapping[0]);
KettleDataFactory factory = new KettleDataFactory();
factory.setQuery("default", producer);

This part describes a producer to the engine.

I then proceeded on page 269 and put a document header and footer on the report and an item band.  Then I put 4 columns on the page and the report was written.  This took me all of about 30 minutes. The nice folks at Pentaho Orlando will have to forgive me, reporting is not my specialty. Personally I was quite pleased that it was that easy to do.

So, with the report definition ready, I now wanted to create an actual PDF out of that.  More reading revealed that we needed a PDF Output processor (to generate the actual file) and a page-able report processor to paginate and process the report definition.  This is how it looks in my case:

  FileOutputStream fos = new FileOutputStream("files/output.pdf");
DefaultConfiguration configuration = new DefaultConfiguration();
PdfOutputProcessor processor = new PdfOutputProcessor(configuration, fos);
PageableReportProcessor reportProcessor = new PageableReportProcessor(report, processor);
reportProcessor.processReport();

5 lines of code to generate a PDF! Suffice it to say I was very happy.

In total I spent a little over an hour to produce this document:

It’s quite simple: if it weren’t for the book I would have a really hard time figuring out where to begin.  I probably would have had to talk to Thomas Morgner, the brain child of Pentaho Reporting.  A nice fellow as he is, communicating to him is not for the faint hearted. (Fortunately he recently moved to Ireland so things will get better soon)

All joking aside, if you are planning to create reports using the Java API, do yourself a favor and buy this book right away.  Even if you’re not going to use the API, Pentaho Reporting principles and concepts are explained in great detail.

Many thanks to Packt publishing for sending me the book to review and congratulations to Will Gorman and the reviewers for an excellent job.  Congratulations to Thomas and his community too for making Pentaho Reporting 3.5 a smash hit.

Until next time,
Matt

P.S. I’ll be obviously covering more of this Java API sample at the upcoming Devoxx conference in Antwerp.

1 Comment »

Next »

Pentaho world image