November 3rd 2011 04:07 pm
Data Modeling
Dear data integration fans,
I’m a big fan of “appropriate” data modeling prior to doing any data integration work. For a number of folks out there that means the creation of an Enterprise Data Warehouse model in classical Bill Inmon style. Others prefer to use modern modeling techniques like Data Vault, created by Dan Linstedt. However, the largest group data warehouse architects use a technique called dimensional modeling championed by Ralph Kimball.
Using a modeling technique is very important since it brings structure to your data warehouse. The techniques used, when applied correctly of-course, are helping you in a big way to avoid all sorts of pitfalls in the design of a data warehouse.
From my own experience and from what I see in my own Kettle community, dimensional modeling is by far the most popular technique used to create data warehouses. For that reason (and the fact that I’m a huge fan of Kimball) I’ve always made sure to properly support the most complex part of technique: the slowly changing dimension. For the better part that has made Kettle an excellent choice when it comes to easy translation of your dimensional model to ETL.
However, where these days you have open source tools like Quipu and RapidACE for data vault modelling I was sad to see that not too much exists for dimensional modeling in combination with Kettle.
So a few weeks ago I was doing some basic modeling for a new Pentaho logging data mart for PDI 4.3 EE. This data mart will be responsible for the delivery of easy to digest reports, analyses views and dashboards on the subjects of monitoring and logging of Pentaho servers. Initially I started doing this in a nice Eclipse plugin called UMLet which resulted in a data model like this:

While this result isn’t the worst diagram you can possibly imagine there are a number of problems with the approach:
- The information about dimensions, attributes, relationships, … is not captured in a structured way.
- Export of the metadata is not possible in any usable format except for PDF and images.
- UMLet, like so many UML and modeling tools is a generic tool that also supports many other features that I’m not interested in when I’m doing dimensional modeling. As a result, creating a model takes time and real effort.
- The work needs to be used in your favorite ETL tool so it makes sense to be have it handy there, instead of having to use a third party tool.

Wouldn’t it be great if I could create a new metadata domain to hold all the star models for a certain data mart?

Then wouldn’t it be great if you could edit your star models in there?

The graphics don’t have to be anything fancy, I thought. It just needs to automatically position the fact table in the middle and the dimensions around it…

Obviously, I would like to be able to edit the name, description and type of the dimension …

and depending on the type of dimension I would like to insert a bunch of default attributes…

Using standard Kettle data grid I should be able to copy attributes and other metadata back and forth between dimension dialogs and a spreadsheet as well.
In the fact table definition it would be cool if we could not only specify the facts but also the relationships to the dimension…

Because that way we wouldn’t have to worry about how to draw the star model and we would know everything we would need to know.
If we would have a tool like that we would be able to generate the SQL to generate the physical tables against a certain target database…

Because if we would have all sorts of knowledge in metadata of the dimensions we could really nicely generate all the required data types, indexes and what not.
And then it would be cool to also generate a template transformation to update the dimension and fact tables in the models…

Well, I thought it would be nice to have that sort of functionality.
Perhaps we could also create physical Pentaho metadata domain (XMI) from the star domain as well as Mondrian schemas and a PDF with documentation.
OK, so this is coming to a PDI release near you in the short term. I’ve only been working on it for a few weeks on and off but you can try an early version here. Simply unzip it in the plugins folder of a PDI 4.3 build. The plugin needs 4.3 since that version already includes a lot of libraries like Pentaho metadata and reporting and that way I don’t need to package all those libraries with it. We can see later how we can deploy on 4.2 as well.
Please provide feedback here or in PDI-6890.
Until next time,
Matt
20 Comments »



Scott Rosenbaum on 03 Nov 2011 at 18:10 #
This is awesome. A good simple dimensional modeling tool integrated with Kettle? It’s like Christmas for Halloween.
Carlos on 03 Nov 2011 at 23:00 #
I agree with Scott! Awesome!
Log Buffer #244, A Carnival of the Vanities for DBAs | The Pythian Blog on 04 Nov 2011 at 8:09 #
[…] Mat Casters is a big fan of “appropriate” data modeling prior to doing any data integration work. […]
Jeroen on 07 Nov 2011 at 10:52 #
Hi Matt,
Great addition! Would really help out a lot of people.
Have you checked out our latest version 2.0 of Quipu? We’ve added a datamart creation assistant in our generator as a first step. Maybe we should join forces?
Matt Casters on 07 Nov 2011 at 11:51 #
Hi Jeroen,
Ideas friom the Quipu project are obviously welcome.
For now I want to keep this nice and simple however.
Kind regards,
Matt
Jens Bleuel about Kettle aka Pentaho Data Integration (PDI) & Pentaho BI » Blog Archive » First German Pentaho Customer Meeting in Munich on 07 Dec 2011 at 1:08 #
[…] As part of the Pentaho Product Management team, I presented the product road map and gave some quick insights in the Kettle Star Modeler. […]
Fabio on 02 Mar 2012 at 1:42 #
OMG!OMG!OMG!OMG!!!
AAAAAHHHH!!!! :DDDDDDDDDD
YER DA DUDE, MAN!
cedric on 07 Mar 2012 at 15:34 #
Sterke feature. Eigenlijk ben je goed op weg om het beste open source product aller tijden te maken, puur op basis van hoeveel value je creëert met PDI voor bedrijven die data moeten verwerken (ik maak de vergelijking simpel met hoeveel IBM Data Manager, SSIS, Datastage of Informatica kosten). Thanks!
Matt Casters on 07 Mar 2012 at 15:40 #
Bedankt voor de complimenten Cedric!
Bodays on 08 Mar 2012 at 12:48 #
Hi Matt,
This look awesome!!!
As I’m a beginner in BI, I’m currently trying to improve my skills by using PDI. I already started to touch and play with the tool and I would like now to modelize my dimension and fact table (I already made my analysis on the relational DB of an OpenERP installation).
My point is that I’m using the version 4.2 of PDI and based on your remarks, this plugin needs the version 4.3.
Within your post, you said “We can see later how we can deploy on 4.2 as well.” - Is it now available?
Thanks in advance.
Bodays
Matt Casters on 08 Mar 2012 at 13:54 #
It’s going to be a plugin Bodays but for now the star modeler is still in development. Pentaho is re-aligning it’s complete metadata architecture so it will take a few months to flesh out.
Bodays on 08 Mar 2012 at 15:31 #
Dank u voor uw antwoord en schade voor mij
Have a good day Matt
Vishal on 04 Jun 2012 at 12:20 #
Hi Matt,
I extracted the Star Modeler from http://kettle4.s3.amazonaws.com/starmodeler.zip and extracted it to the plugins folder post which I restarted Spoon, but I still dont see the Star Modeler perspective. Am i missing anything.
Thanks
Vishal S
VK on 11 Jun 2012 at 8:48 #
Hi Matt,
I agree with Vishal — extracted to plugin folder, restarted, can not see the perspective… I suspect its a conflict with AgileBI plugin?
Matt Casters on 11 Jun 2012 at 10:35 #
As I mentioned, this plugin is far from finished. If you want to try it out, simply download Kettle 4.4 from CI http://ci.pentaho.com/view/Data%20Integration/job/Kettle/ and it will be included by default. Then you can play with it a bit.
Giovanni Botta on 19 Oct 2012 at 15:46 #
Cool plugin! Great work!
Mohamed Ennahdi El Idrissi on 30 Oct 2012 at 18:02 #
Pentaho’s Kettle Data Integration Tool is great!
It’s true that not all the functionalities are obvious (especially for someone who is discovering the field of business intelligence, generally), but including a star modeling tool withing DIT is genuine initiative!
Good luck!
Aashish on 10 Dec 2012 at 1:49 #
Hi Matt,
is this plugin available for the latest stable community build of 4.4? it does not seem to be included.
Matt Casters on 10 Dec 2012 at 12:16 #
4.4 was renamed 5.0 in the mean time since we wanted to do another major release in between for the BA server 4.8 release.
Avinash on 02 Apr 2013 at 0:23 #
I hope this is part of open source version…