The problem with major releases…

In the next couple of days, probably tomorrow, we’re going to release Pentaho Data Integration patch version 3.0.1.  It’s been only 3-4 weeks since the 3.0.0 release and already we’ve fixed over 80 smaller and larger issue and implemented a few new things too.  The amazing feedback to the v3 release has been absolutely fantastic. I’m also very glad we have been able to take into account most if not all feedback you gave us on the bug trackers and out on the forum.

You would think that after 3 months of code-freeze and 3 release candidates we could have tracked and fixed more problems, but we’re facing the same big problem as larger projects like KDE. That porblem is that only a small fraction of your target audience downloads and tests a milestone or release candidate. (a few hundred)  The bulk of the bug reports are filed against GA versions. By the way, that release is already at more than 10000 downloads today!

So at a certain point you make a judgment call to release your software, in our case when we fixed all bugs from the top severity levels : Blocker, Critical, Severe and Medium.  However, I knew in advance that a code overhaul as profound as v3 would give people head-aches here and there.  There was just no way of telling what it would be.  And that is the reason why you’ll see a quick 3.0.1. update release appear so shortly after the previous drop.  And that is also why we’ve been dropping patches and test-builds all over the forum to help out people; it’s important that an early adopter doesn’t become a victim.  That is also why you saw me a lot on the forum and on IRC (##pentaho channel) lately to help out.

On a side note, I must say I can only imagine what it must be like for the KDE 4 development team.  They want to implement some radically new things, a better architecture included.  However, they know in advance that an all-encompassing piece of software like a complete desktop environment will never be perfect without people downloading it, installing it, testing it and filing bug reports.  So at a certain point in time, they will have to release a 4.0.  They have been getting quite a bit of bad press for telling people in advance that there will be problem with their 4.0 release.  However, if it’s hard for a isolated piece of software like PDI to prevent bugs, it must be downright impossible for the KDE project.  I think it’s quite brave and honest of them to tell people that there might still be bugs in the software.

But anyway, I think that the biggest hurdle in Kettle history has been taken with a serious architecture change and that we can now start to continue to build upon that new architecture without making any major changes to it anymore.  In fact I just arrived in Orlando to join the rest of the Pentaho team to brainstorm about what cool things we can pull out of our sleeve next.  We’ve been getting great feedback from the community, from our customers as well as our partners and there are a lot of change requests to take into account.  However, I can assure you that we will be doing a lot of new stuff in the first quarter of 2008.

So keep that feedback coming and enjoy PDI!

Until next time,



  • lars

    Regarding future features- how easily can PDI be used with other extraction frameworks:
    Tika, Aperture, etc. – I’m sure there are others?

    Also, it is unclear to me why I would want to use Pentaho’s process language to describe an integration scenario when there is already e.g. jpdl through jbpm, etc.. Presuming it is a linux host, can’t I just map jpdl graph nodes as objects that issue cmd line calls to a worker node with access to standard unix tools (e.g. grep, sort, and other local binaries specific to my domain) to provide functinality equivalent to most of the PDI screenshots I’ve seen? Just registering arbitrary cmd line tools seems simpler – like unix pipes with some additional semantics. So, why should I use PDI? I need to run jobs on pools of worker machines (with SGE, I’ve used cluster-fork to batch the jobs), but now need a visual abstraction and perhaps will replace SGE as job manager – I just found Kettle but haven’t played enough to know whether it will be useful for our project, seems interesting though


  • Lars, the main goal is to eliminate scripting and ugly hacks as much as possible as they cost too much over time in terms of maintenance and manpower.
    The goal of a typical ETL tool is also not to solve workflow problems like jBPM and others do as you propose. An ETL tool is mainly deployed in BI situations in a data warehousing context. As a former Unix admin and shell scripting guru I can guarantee you that piping data around is not going to get you anywhere in that world.

    Actually, this is really strange. I’m just wondering what on earth made you want to attack Kettle and all other ETL tools alike without even knowing what you are talking about and with such a display of ignorance. All from a single screen shot! Very strange.