May 8th 2012

Better Data for Better Analytics

Dear Kettle friends,

Thursday May 10th, in a few days, I’ll be joining my friend Kasper Sørensen (the founder and lead architect of DataCleaner, a Human Inference data profiling project) in our web seminar (webinar).  We’ll be going over a bit of history, our cooperation model as well as the architecture behind the new data quality features.

Register herehttp://www.pentaho.com/resources/events/20120510-better-data-for-better-analytics/

Kasper will also be doing 3 cool live demos on the subjects of data profiling and data quality.

I hope you’ll be able to join the crowd this Thursday May 10th, 10am PST (Los Angeles), 1pm EST (New York) or 7pm CET (Brussels).

We’ll be doing our best to answer your data quality questions simultaneously with the presentation.

See you there!

Cheers,
Matt

No Comments yet »

March 8th 2012

DM-Radio today

Dear Kettle fans,

Today I’ll be joining Eric Kavanagh and Jim Ericson on the DM Radio show for an episode titled “On the move: Why ETL is here to stay”.

If you’re interested in listening in, don’t forget to register at the landing page over here.

All the best,

Matt

No Comments yet »

January 30th 2012

Big Kettle News

Dear Kettle fans,

Today I’m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape. To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself…

First of all…

Pentaho is again open sourcing an important piece of software.  Today we’re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so on.

This is important to you because it means that you can now use Kettle to integrate a multitude of technologies, ranging from files over relational databases to big data and NoSQL.  You can do this in other words without writing any code.  Take a look at how easy it is to program for Hadoop MapReduce:

In other words, this part of the big news of today allows you to use the best tool for the job, whatever that tool is. You can now combine the large set of steps and job entries with all the available data sources and use that to integrate everything. Especially for Hadoop the time it takes to implement a MapReduce job is really small taking the sting out of costly and long training and testing cycles.

But that’s not all…

Pentaho Data Integration as well as the new big data plugins are now available under the Apache License 2.0. This means that it’s now very easy to integrate Kettle or the plugins in 3rd party software. In fact, for Hadoop, all major distributions are already supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition.
The change of Kettle from LGPL to Apache License 2.0 was broadly supported by our community and acts as an open invitation for other projects (and companies) to integrate Kettle. I hope that more NoSQL, Big Data and Big Search communities will reach out to us to work together to even broaden our portfolio. The way I see it, the Kettle community just got a whole lot bigger!

Where are the goodies?

The main landing page for the Big Data community is placed on our wiki to emphasize our intention to closely work with the various communities to make Pentaho Big Data a success. You can find all information over there, including a set of videos, PDI 4.3.0 preview download (including Big Data plugins), Hadoop installation instructions, PRD configuration information and much more.

Thanks for your time reading this and thanks for using Pentaho software!

Matt

6 Comments »

November 3rd 2011

Data Modeling

Dear data integration fans,

I’m a big fan of “appropriate” data modeling prior to doing any data integration work.  For a number of folks out there that means the creation of an Enterprise Data Warehouse model in classical Bill Inmon style.  Others prefer to use modern modeling techniques like Data Vault, created by Dan Linstedt.  However, the largest group data warehouse architects use a technique called dimensional modeling championed by Ralph Kimball.

Using a modeling technique is very important since it brings structure to your data warehouse.  The techniques used, when applied correctly of-course, are helping you in a big way to avoid all sorts of pitfalls in the design of a data warehouse.

From my own experience and from what I see in my own Kettle community, dimensional modeling is by far the most popular technique used to create data warehouses.  For that reason (and the fact that I’m a huge fan of Kimball) I’ve always made sure to properly support the most complex part of technique: the slowly changing dimension.  For the better part that has made Kettle an excellent choice when it comes to easy translation of your dimensional model to ETL.

However, where these days you have open source tools like Quipu and RapidACE for data vault modelling I was sad to see that not too much exists for dimensional modeling in combination with Kettle.

So a few weeks ago I was doing some basic modeling for a new Pentaho logging data mart for PDI 4.3 EE.  This data mart will be responsible for the delivery of easy to digest reports, analyses views and dashboards on the subjects of monitoring and logging of Pentaho servers.  Initially I started doing this in a nice Eclipse plugin called UMLet which resulted in a data model like this:

While this result isn’t the worst diagram you can possibly imagine there are a number of problems with the approach:

  • The information about dimensions, attributes, relationships, … is not captured in a structured way.
  • Export of the metadata is not possible in any usable format except for PDF and images.
  • UMLet, like so many UML and modeling tools is a generic tool that also supports many other features that I’m not interested in when I’m doing dimensional modeling.  As a result, creating a model takes time and real effort.
  • The work needs to be used in your favorite ETL tool so it makes sense to be have it handy there, instead of having to use a third party tool.
So I thought: wouldn’t it be great if I had some sort of perspective in Spoon where I could do a bit dimensional modeling based on a logical Pentaho metadata model?

Wouldn’t it be great if I could create a new metadata domain to hold all the star models for a certain data mart?

Then wouldn’t it be great if you could edit your star models in there?

The graphics don’t have to be anything fancy, I thought.  It just needs to automatically position the fact table in the middle and the dimensions around it…

Obviously, I would like to be able to edit the name, description and type of the dimension …

and depending on the type of dimension I would like to insert a bunch of default attributes…

Using standard Kettle data grid I should be able to copy attributes and other metadata back and forth between dimension dialogs and a spreadsheet as well.

In the fact table definition it would be cool if we could not only specify the facts but also the relationships to the dimension…

Because that way we wouldn’t have to worry about how to draw the star model and we would know everything we would need to know.

If we would have a tool like that we would be able to generate the SQL to generate the physical tables against a certain target database…

Because if we would have all sorts of knowledge in metadata of the dimensions we could really nicely generate all the required data types, indexes and what not.

And then it would be cool to also generate a template transformation to update the dimension and fact tables in the models…

Well, I thought it would be nice to have that sort of functionality.

Perhaps we could also create physical Pentaho metadata domain (XMI) from the star domain as well as Mondrian schemas and a PDF with documentation.

OK, so this is coming to a PDI release near you in the short term.  I’ve only been working on it for a few weeks on and off but you can try an early version here.  Simply unzip it in the plugins folder of a PDI 4.3 build.  The plugin needs 4.3 since that version already includes a lot of libraries like Pentaho metadata and reporting and that way I don’t need to package all those libraries with it.  We can see later how we can deploy on 4.2 as well.

Please provide feedback here or in PDI-6890.

Until next time,

Matt

12 Comments »

August 12th 2011

Streaming XML content parsing with StAX

Today, one of our community members posted a deviously simply XML format on the forum that needed to be parsed.  The format looks like this:

<RESPONSE>
<EXPR>USD</EXPR>
  <EXCH>GBP</EXCH>
  <AMOUNT>1</AMOUNT>
  <NPRICES>1</NPRICES>
  <CONVERSION>
    <DATE>Fri, 01 Jun 2001 22:50:00 GMT</DATE>
    <ASK>1.4181</ASK>
    <BID>1.4177</BID>
  </CONVERSION>

  <EXPR>USD</EXPR>
  <EXCH>JPY</EXCH>
  <AMOUNT>1</AMOUNT>
  <NPRICES>1</NPRICES>
  <CONVERSION>
    <DATE>Fri, 01 Jun 2001 22:50:02 GMT</DATE>
    <ASK>0.008387</ASK>
    <BID>0.008382</BID>
  </CONVERSION>
  ...
</RESPONSE>

Typically we parse XML content with the “Get Data From XML” step which used XPath expressions to parse this content.  However, since the meaning of the XML content is determined by position instead of path, this becomes a problem.  To be specific, for each CONVERSION block you need to pick the last preceding EXPR and EXCH values.  You could solve it like this:

Unfortunately, this method requires a full parsing of your file 3 times and once extra for each additional preceding element.  The joining and all also slows things down considerably.

So this is another case where the new “XML Input Stream (StAX)” step comes to the rescue.  The solution using this step is the following:

Here’s how it works:

1) The output of the “positional element.xml” step flattens the content of the XML file so that you can see the output of each individual SAX event like “start of element”, “characters”, “end of element”.  Every time you get the path, parent path, element value and so forth.  As mentioned in the doc this step is very fast and can handle files with just about any size with a minimal footprint.  It will appear in PDI version 4.2.0GA.

2) With a bit of scripting we collect information from the various rows that we find interesting.

3) We filter out only the result lines (the end of the CONVERSION element).  What you get is the following desired output:

The usage of JavaScript in this example is not ideal but compared to the reading speed of the XML I’m sure it’s fine for most use-cases.

Both examples are up for download from the forum.

The “XML Input Stream (StAX)” step has also shown to work great with huge hierarchical XML structures, files of multiple GB in size.  The step was written by colleague Jens Bleuel and he documented a more complex example on his blog.

Have fun with it!

Matt

7 Comments »

Next »

Pentaho world image