Archive for March, 2008

March 20th 2008

Organizing job entries

Hi Kettle friends,

One of the tasks that has been on my plate for a long time was the creation of categories for Job Entries, like we have for steps:

This was done as part of this feature request: PDI-125.  I guess the main problem was that we were getting too many job entries and I needed to improve our support for job entry plugins anyway.  As such, this feature got into the 3.1.0-M1 version as well.

We’re going to do a number of other GUI improvements and reviews over the next couple of months, feel free to chip in at any given time.

Until next time,

Matt

1 Comment »

March 14th 2008

Describing outer joins in metadata

Dear Metadata fans,

It’s been a while that I blogged about Pentaho Metadata. This is undeservedly so because the last couple of months, a lot of things have been moving. Most of that is not really visible to the end-user. The GUI part of the metadata suite was attacked last year and doesn’t really need all that much work. What we have been doing is extending the underlaying architecture by making it more flexible, more robust and easier to program from an API viewpoint. Most of that work has been in the capable hands of Pentaho rock star Will Gorman. The work he did last year for example was building in support for libformula (Open Office formula) by Pentaho reporting wizard Thomas Morgner.

Lately, Will and I started on the next big thing: adding support for outer joins in Pentaho Metadata models. We knew going in that it was Pandora’s box that we were opening. It’s so much easier to say: people, just write a sane star schema, but in the end it can make sense to just throw a few models over an ODS or source system, for example for prototyping and evaluation reasons.

The problem with outer joins…

The problem with outer joins is that you need to know exactly what you are doing if you want to get good reliable results out of your metadata model. That is because the order at which you execute the various outer joins determines the eventual outcome of the data.

Take a look at the Wikipedia entry on joining and outer joining to know what we’re talking about. The situation with 2 tables can be explained quite easily: Table A has 5 records, you do a left outer join with Table B but only 3 records match. Well, you still have 5 records.

However, if you start to get 3, 4 or heaven forbid even more (outer) joins in the same query, the order in which you are going to execute the joins is going to have a strong impact on the actual result of the query. (See also: this scenario) Suppose you have a Table C with 2 records. B is left outer joined with C. If you do ( A Left Outer Join B ) and then Left outer Join that result to C, you have 5 records. If you do B left outer join with C (3 records) and then right outer join with A, you get 3 records.

In the distant past I’ve had to deal with those situations a few times in closed source BI applications and I swore I would never mess with outer joins again because of this join order problem. In an ad-hoc reporting situation, this is a real issue since the users don’t care and don’t want to know about this.

How did we solve it?

Well, if the join execution order determines the result, that order in itself has to become part of the metadata, and so that’s what we did add to the model.

We think that the order in outer join situations is as important as the relationships themselves and as such, the selected join order key that is entered is also displayed in the graphical model view.

You can enter this join order key in the relationship dialog:

As you can see, we also added the 0:0 relationship type that corresponds to the “Full Outer” join type.

What does the SQL look like?

If we were to select columns from all databases, apply a condition to the second table, add a condition to an aggregate and set a sort order, we get an SQL query like this:

SELECT

BT_TABLE1_TABLE1.PRIMARYKEY AS COL0
,BT_TABLE2_TABLE2.PRIMARYKEY AS COL1
,SUM(BT_TABLE1_TABLE1.PRIMARYKEY) AS COL2
,BT_TABLE4_TABLE4.PRIMARYKEY AS COL3

FROM TABLE4 BT_TABLE4_TABLE4 LEFT OUTER JOIN

(

TABLE1 BT_TABLE1_TABLE1 FULL OUTER JOIN TABLE2 BT_TABLE2_TABLE2
ON ( BT_TABLE1_TABLE1.PRIMARYKEY = BT_TABLE2_TABLE2.FOREIGNKEY AND ( BT_TABLE2_TABLE2.FOREIGNKEY = 2 ) )

)
ON ( BT_TABLE2_TABLE2.FOREIGNKEY = BT_TABLE4_TABLE4.PRIMARYKEY )

GROUP BY

BT_TABLE1_TABLE1.PRIMARYKEY
,BT_TABLE2_TABLE2.PRIMARYKEY
,BT_TABLE4_TABLE4.PRIMARYKEY

HAVING

(
SUM(BT_TABLE1_TABLE1.PRIMARYKEY) = 2
)

ORDER BY

COL0

Note that we use a nested join syntax to make sure that the requested join order is followed. What we also try to do is place the conditions as close to the join as possible for performance reasons.

What else did we change?

As you can see from the sample query above, we abandoned the aliasing of business tables with their name. Because most if not all names contain spaces, it uses quoting all over the place and it runs into some limitations if your names are too long on Oracle for example. As such, we replaced it by the business table ID. The end user never sees these queries anyway.

Another thing we modified is the way that formulas in business columns are processed. You can now include business columns from any columns in the business model in a single expression. This is obviously something that is also important when you do reporting on a 3NF model. Almost all objects you want to calculate with are going to be in a different business table. See this wiki page on Pentaho Metadata Formulas for more information.

What’s next?

The next things on the agenda are undoubtedly things like row level security and a special section for predefined conditions and calculated members on the business model level. There are also rumors that the long awaited Formula editor is nearing completion so we’ll be able to use that in the Pentaho Metadata Editor in the future as well. As always, you can play an important role in the evolution of Pentaho Metadata by letting us know what you love or hate about Pentaho Metadata. In the end, it is feedback from our customers and our community that drives all our software development!

Until next time,

Matt

No Comments yet »

March 7th 2008

Nihon Kettle : ja_JP

Dear friends,

I bring you more good news from the Kettle i18n efforts : Hiroyuki Kawaguch has contributed the first translations in Japanese to trunk:

After English, French, Italian, Spanish (Argentina), German, Spanish (Spain), Chinese, Dutch, Portuguese (Brazil & Portugal), this is the 11th translation effort.  It’s impossible to thank our team of translators enough for all the efforts that they did to translate the various parts of Kettle.

Until next time,

Matt

1 Comment »

March 1st 2008

IRC ##pentaho

Of-course there are the crazies, but usually we have a good time over on ##pentaho IRC.

Yesterday we had our very first community event when Doug “Spanky” Moran hosted a dial-in to talk about what was up in the community.

Today, I learned about regular andresF his blog.

Internet Relay Chat is old technology that has existed for quite a while now, but to me it doesn’t lose it’s appeal. Over at FOSDEM I learned that there are companies like MySQL that have private channels to communicate “non-intrusively” with colleagues. “Maybe some developer can help me with this stupid problem. I’ll just drop a question on the channel.” It’s a good idea, we should consider it for Pentaho too.

Until next time,

Matt

4 Comments »

Pentaho world image