Archive for the 'metadata' Category

November 3rd 2011

Data Modeling

Dear data integration fans,

I’m a big fan of “appropriate” data modeling prior to doing any data integration work.  For a number of folks out there that means the creation of an Enterprise Data Warehouse model in classical Bill Inmon style.  Others prefer to use modern modeling techniques like Data Vault, created by Dan Linstedt.  However, the largest group data warehouse architects use a technique called dimensional modeling championed by Ralph Kimball.

Using a modeling technique is very important since it brings structure to your data warehouse.  The techniques used, when applied correctly of-course, are helping you in a big way to avoid all sorts of pitfalls in the design of a data warehouse.

From my own experience and from what I see in my own Kettle community, dimensional modeling is by far the most popular technique used to create data warehouses.  For that reason (and the fact that I’m a huge fan of Kimball) I’ve always made sure to properly support the most complex part of technique: the slowly changing dimension.  For the better part that has made Kettle an excellent choice when it comes to easy translation of your dimensional model to ETL.

However, where these days you have open source tools like Quipu and RapidACE for data vault modelling I was sad to see that not too much exists for dimensional modeling in combination with Kettle.

So a few weeks ago I was doing some basic modeling for a new Pentaho logging data mart for PDI 4.3 EE.  This data mart will be responsible for the delivery of easy to digest reports, analyses views and dashboards on the subjects of monitoring and logging of Pentaho servers.  Initially I started doing this in a nice Eclipse plugin called UMLet which resulted in a data model like this:

While this result isn’t the worst diagram you can possibly imagine there are a number of problems with the approach:

  • The information about dimensions, attributes, relationships, … is not captured in a structured way.
  • Export of the metadata is not possible in any usable format except for PDF and images.
  • UMLet, like so many UML and modeling tools is a generic tool that also supports many other features that I’m not interested in when I’m doing dimensional modeling.  As a result, creating a model takes time and real effort.
  • The work needs to be used in your favorite ETL tool so it makes sense to be have it handy there, instead of having to use a third party tool.
So I thought: wouldn’t it be great if I had some sort of perspective in Spoon where I could do a bit dimensional modeling based on a logical Pentaho metadata model?

Wouldn’t it be great if I could create a new metadata domain to hold all the star models for a certain data mart?

Then wouldn’t it be great if you could edit your star models in there?

The graphics don’t have to be anything fancy, I thought.  It just needs to automatically position the fact table in the middle and the dimensions around it…

Obviously, I would like to be able to edit the name, description and type of the dimension …

and depending on the type of dimension I would like to insert a bunch of default attributes…

Using standard Kettle data grid I should be able to copy attributes and other metadata back and forth between dimension dialogs and a spreadsheet as well.

In the fact table definition it would be cool if we could not only specify the facts but also the relationships to the dimension…

Because that way we wouldn’t have to worry about how to draw the star model and we would know everything we would need to know.

If we would have a tool like that we would be able to generate the SQL to generate the physical tables against a certain target database…

Because if we would have all sorts of knowledge in metadata of the dimensions we could really nicely generate all the required data types, indexes and what not.

And then it would be cool to also generate a template transformation to update the dimension and fact tables in the models…

Well, I thought it would be nice to have that sort of functionality.

Perhaps we could also create physical Pentaho metadata domain (XMI) from the star domain as well as Mondrian schemas and a PDF with documentation.

OK, so this is coming to a PDI release near you in the short term.  I’ve only been working on it for a few weeks on and off but you can try an early version here.  Simply unzip it in the plugins folder of a PDI 4.3 build.  The plugin needs 4.3 since that version already includes a lot of libraries like Pentaho metadata and reporting and that way I don’t need to package all those libraries with it.  We can see later how we can deploy on 4.2 as well.

Please provide feedback here or in PDI-6890.

Until next time,

Matt

20 Comments »

July 18th 2008

Pentaho changes

I’m back at my favorite spot at the Orlando airport:

This week has gone bye so fast it’s kinda scary.  I got dragged into one meeting after another design session after another knowledge transfer opportunity for 5 days in a row.  After our long working days, the discussions and talks just continued over dinner and beers.

It was great to meet everyone and as always we had a good time around the office and at the Ale House.  I even managed to stay sober this time around.  Well at least most of the time.

As always, the thing that struck me the most was how fast Pentaho changes.  It’s almost like visiting a different company every time I drop in.  Since I don’t see the day-to-day changes around the office, the difference between the first time I visited (15 people) and now (70+) is striking.  The office space occupied more than doubled for example.

Well, let me tell you, it’s an honour to be able to witness this great growth process in person.  I’m looking forward to things to come!

Until next time,
Matt

2 Comments »

July 13th 2008

Another hacking trip

Dear Pentaho friends,

So, here I am again at Brussels Airport, waiting for the plane to Frankfurt and then to Orlando.

I’ll probably be the only person on the plane that doesn’t go to Orlando for the love of Disney, Universal and Cape Canaveral.  And NO, I will not be wearing Mickey Mouse ears on the way back like most of the 10,000 hollering kids all around me.

On the contrary, we’ll be working on serious stuff at Pentaho, things like row level security in metadata, Management services for Kettle and hopefully work a bit on a new engine for transformations.

Besides that it’s going to be a nice opportunity to meet old friends and see new ones I never even met.  Pentaho keeps growing pretty fast and every time I walk into the building I see more and more unfamiliar faces.  It’s a good thing that from past digital contacts, I know their all good people, the lot of them.

For those in Europe that want to give me a call this week: please do this in the afternoon!  The last job recruiter (I never even heard of the man nor the company) that called my cell phone at 3:30AM Orlando time, had to get special ear surgery. (for which I refuse to apologize)  Let me spell it out : the time in Orlando is a six hours earlier.

Oh well, it’s a good thing my annual vacation starts right after this trip.  With the planned construction work and move out of our house with the whole family, that’s a Good Thing(tm).

Until next time,
Matt

4 Comments »

March 14th 2008

Describing outer joins in metadata

Dear Metadata fans,

It’s been a while that I blogged about Pentaho Metadata. This is undeservedly so because the last couple of months, a lot of things have been moving. Most of that is not really visible to the end-user. The GUI part of the metadata suite was attacked last year and doesn’t really need all that much work. What we have been doing is extending the underlaying architecture by making it more flexible, more robust and easier to program from an API viewpoint. Most of that work has been in the capable hands of Pentaho rock star Will Gorman. The work he did last year for example was building in support for libformula (Open Office formula) by Pentaho reporting wizard Thomas Morgner.

Lately, Will and I started on the next big thing: adding support for outer joins in Pentaho Metadata models. We knew going in that it was Pandora’s box that we were opening. It’s so much easier to say: people, just write a sane star schema, but in the end it can make sense to just throw a few models over an ODS or source system, for example for prototyping and evaluation reasons.

The problem with outer joins…

The problem with outer joins is that you need to know exactly what you are doing if you want to get good reliable results out of your metadata model. That is because the order at which you execute the various outer joins determines the eventual outcome of the data.

Take a look at the Wikipedia entry on joining and outer joining to know what we’re talking about. The situation with 2 tables can be explained quite easily: Table A has 5 records, you do a left outer join with Table B but only 3 records match. Well, you still have 5 records.

However, if you start to get 3, 4 or heaven forbid even more (outer) joins in the same query, the order in which you are going to execute the joins is going to have a strong impact on the actual result of the query. (See also: this scenario) Suppose you have a Table C with 2 records. B is left outer joined with C. If you do ( A Left Outer Join B ) and then Left outer Join that result to C, you have 5 records. If you do B left outer join with C (3 records) and then right outer join with A, you get 3 records.

In the distant past I’ve had to deal with those situations a few times in closed source BI applications and I swore I would never mess with outer joins again because of this join order problem. In an ad-hoc reporting situation, this is a real issue since the users don’t care and don’t want to know about this.

How did we solve it?

Well, if the join execution order determines the result, that order in itself has to become part of the metadata, and so that’s what we did add to the model.

We think that the order in outer join situations is as important as the relationships themselves and as such, the selected join order key that is entered is also displayed in the graphical model view.

You can enter this join order key in the relationship dialog:

As you can see, we also added the 0:0 relationship type that corresponds to the “Full Outer” join type.

What does the SQL look like?

If we were to select columns from all databases, apply a condition to the second table, add a condition to an aggregate and set a sort order, we get an SQL query like this:

SELECT

BT_TABLE1_TABLE1.PRIMARYKEY AS COL0
,BT_TABLE2_TABLE2.PRIMARYKEY AS COL1
,SUM(BT_TABLE1_TABLE1.PRIMARYKEY) AS COL2
,BT_TABLE4_TABLE4.PRIMARYKEY AS COL3

FROM TABLE4 BT_TABLE4_TABLE4 LEFT OUTER JOIN

(

TABLE1 BT_TABLE1_TABLE1 FULL OUTER JOIN TABLE2 BT_TABLE2_TABLE2
ON ( BT_TABLE1_TABLE1.PRIMARYKEY = BT_TABLE2_TABLE2.FOREIGNKEY AND ( BT_TABLE2_TABLE2.FOREIGNKEY = 2 ) )

)
ON ( BT_TABLE2_TABLE2.FOREIGNKEY = BT_TABLE4_TABLE4.PRIMARYKEY )

GROUP BY

BT_TABLE1_TABLE1.PRIMARYKEY
,BT_TABLE2_TABLE2.PRIMARYKEY
,BT_TABLE4_TABLE4.PRIMARYKEY

HAVING

(
SUM(BT_TABLE1_TABLE1.PRIMARYKEY) = 2
)

ORDER BY

COL0

Note that we use a nested join syntax to make sure that the requested join order is followed. What we also try to do is place the conditions as close to the join as possible for performance reasons.

What else did we change?

As you can see from the sample query above, we abandoned the aliasing of business tables with their name. Because most if not all names contain spaces, it uses quoting all over the place and it runs into some limitations if your names are too long on Oracle for example. As such, we replaced it by the business table ID. The end user never sees these queries anyway.

Another thing we modified is the way that formulas in business columns are processed. You can now include business columns from any columns in the business model in a single expression. This is obviously something that is also important when you do reporting on a 3NF model. Almost all objects you want to calculate with are going to be in a different business table. See this wiki page on Pentaho Metadata Formulas for more information.

What’s next?

The next things on the agenda are undoubtedly things like row level security and a special section for predefined conditions and calculated members on the business model level. There are also rumors that the long awaited Formula editor is nearing completion so we’ll be able to use that in the Pentaho Metadata Editor in the future as well. As always, you can play an important role in the evolution of Pentaho Metadata by letting us know what you love or hate about Pentaho Metadata. In the end, it is feedback from our customers and our community that drives all our software development!

Until next time,

Matt

1 Comment »

October 10th 2007

The lighter side of reporting

I’m sure you’ve all heard about the new release of the Pentaho Open Source BI suite!   Version 1.6 went GA yesterday and this is certainly exciting news.  I’ve been so busy with data integration lately that the new release almost came as a surprise to me.

The first thing I wanted to check out is how the Web based Ad-hoc Query and Reporting (WAQR) tool would look like.  Well, here it is:

WAQR in action

For the first ever there is a semantic (metadata) layer available as open source!  What this means for BI end-users is that reporting just became a lot easier. It was dead simple for me in fact.  I was able to create a report in a few drags, a few drops and a few clicks.

Make sure to check this out  yourself if you are interested.  I’m sure that this is the beginning of a new era and that the power of open source is going to expand these new features into endless new possibilities for the end-users out there.

Until next time!

Matt

2 Comments »

Next »

Pentaho world image