Making the case for Kettle
Dear data integration fans,
Once in a while, there are discussions on various blogs (usually with me smack in the middle of it) debating the differences between code generation and model based execution, how this impacts the way we approach databases, the open nature of it all, etc.
With this blog entry I want to push the notion that Pentaho Data Integration (Kettle) didn’t just evolve by chance into the state it is today as a streaming, metadata driven, model based engine. I made some careful design choices early on…
Open as possible
The goal of Kettle from the beginning was to be as open as possible. My definition of “as open as possible” included:
- open source with an LGPL license (see this JBoss link [PDF] for a nice explanation)
- open, readable XML format
- open, readable relational repository format
- open API
- easy to set up (<2 minutes)
- open to all kinds of databases
- easy-to-use GUI
- easy to pass data around
- easy to convert data from/to any possible format
That list didn’t include any code-generation. Although it was an option I seriously considered, I didn’t like the other code-generators I dealt with until that date. (Mainly OWB and some home-brew stuff) The problem with those is that when things go wrong (and things always do go wrong) you’ll have to become an expert in the generated language/code pretty soon.
That was most certainly the case in Oracle Warehouse Builder (OWB) at the time. Re-arranging operators in a certain order would yield inexplicable different results and you found yourself digging through mega-bytes of generated PL/SQL code. That was OK for the simple exercises like loading a bit of data, it became horrible for handling more complex transformations, including slowly changing dimensions.
As such, I considered this way of working not as open as possible since the generated code is by definition not transparent and highly subject to change at re-generation. Specifically the re-generation invalidates the claim that it is possible to change the code by hand “if you don’t like the generated code”.
All those considerations lead to the decision to go for a stream-based meta-data driven engine. That decision comes with many advantages and a few possible drawbacks. There are things to consider, implications towards databases, etc. It’s only honest to say that there probably isn’t a perfect solution anyway.
Our engine is not perfect either, let’s take a look at a few specific issues areas regarding databases. Let’s also see how we handle those issues when they occur.
RDBMS / ELT
Even though splitting tasks up into different pieces allows you to run a lot of things in parallel, there is an overhead because you need to pass data from one step to another. We also suffer because we don’t have random access to all the data like an RDBMS does. Because of that it is not unheard of for us to recommend that you run joins or sorts on the database server where the data is coming from. A database can typically sort faster because it does have random access to all the data. As such it can sort indexes and doesn’t have to go through all the data.
By allowing the Kettle user/developer to execute whatever SQL of procedure he or she wishes we are again as open as possible to this scenario.
Now, there are people that take this to the extreme to point to the miracles of ELT. Well, let me give another view on that. Suppose you wrote a nice ELT program and everything is being processed on the database server.Â While that might a little bit faster on occasion, chances are very high that either one of these conditions is going to occur:
- the database server is not tuned for high CPU load (rather for I/O)
- the source system you are reading from is already too occupied
- parallelism is low or non-existing in a typical stored procedure
- stored procedure languages are typically interpreted byte-code languages and CPU-wise pretty slow/costly.
- debugging becomes very hard or impossible
- you hit some limit on the database (there are many, even on Oracle), even long after you went live with your solution simply because tables and indexes grow.
- deployment must be carefully managed or something goes wrong with that: stale/not-compiled procedures because tables changed, referenced functions changed, etc)
Personally I have faced all of these issues and a few more too. I had enough reasons to skip ELT completely at the time. Heck, I still don’t like it and personally I think for all the right reasons. This is going to sound harsh, but the reasoning I had and still have to some extent is that if you can’t write an SQL query, I don’t think you should be doing any heavy database lifting. Let the database do what it does best and do the rest outside of the database: pick the best tool for the job.
One other possible problem regarding databases is latency. Since we don’t run inside the database engine we can feel it at times. Sometimes is not a problem and sometimes it is. Calls to the database have a certain turn-around time. We solve that for the better part by allowing you to launch multiple copies of a step in parallel. The “Database Lookup” step comes to mind here. This effectively opens multiple connections to the database and reduces overall latency. Of course a bigger reduction of latency is obtained with the use of memory lookups and large caches in PDI. In a lot of cases, it’s even faster to lookup the data in an in-memory cache than doing a join on the database. Again, you should pick the right tool for the job.
As we keep pushing our platform forward with many (many!) new things in the pipe-line, we sometimes forget why certain choices where made and how we arrived where we are today. Well, I for one am happy to say that 6 years after the decision to go with the current Kettle architecture, I’m still very happy with it and so should you 😉
P.S. This is a re-hash of some of my opinions in a post over on Marc Russel’s blog.