Archive for the 'Rant' Category

July 20th 2009

The kindness of strangers

Dear Kettle fans,

There isn’t a week that goes by where I don’t find myself amazed by the number of contributions and help that the Pentaho Data Integration project receives in all kinds of forms.  There are people contributing anything from small patches to complete steps, folks helping out others on the forum, writing documentation, writing books, translating PDI, etc.  Without any question, this has been a truly amazing experience, not just for me but for the whole Kettle project.

It’s because of that overwhelmingly positive experience that I’ve always tried to be accessible and in contact with my community in all sorts of possible ways.  And because of that positive vibe I have refrained from commenting on the negative flip side to that story for the longest time.

The problem is really that lately things have been changing.  It’s probably caused in general by an increasing attention to open source and specifically by an increase in popularity of Kettle.  In any case, certain types of people do the following:

  • Send me personal email
  • IM me on skype/Yahoo!/MSN/AIM/…
  • Send me all sorts of messages and questions through the forums
  • Ask questions on this blog

Usually it’s a combination of any of the above.  Any time now I expect folks to be sending me direct twitter messages.  The questions are always the same:

I have an urgent Pentaho porblem.  I am incapable of using the forum for some stupid reason and so you have to help me, preferable now or within the next 15 minutes!!!!

This way, the meaning of “The kindness of strangers” becomes more and more like the one from the Nick Cave song.

I’ve just finished reading Linus‘ book “Just for fun” (Thanks again Domingo!) and his approach to the problem of staying in reach for people to contribute code and at the same time allowing yourself to have a life and a job is simple : if it ain’t fun, don’t do it.  Well, the barrage of this sort of questions has stopped being fun for me a long time ago.

As such, I’m going to try this approach: any question that could or should be asked on the forum is from now on silently ignored and deleted from my mailbox.  Any person that is not part of my “community” and that needlessly contacts me over IM gets blocked indefinitely.  And yes, that goes for twitter as well.  Off-topic questions on this blog go to the spam folder as well.  I will simply refuse to spend time on non-interesting topics.

I thought about creating a standard response e-mail, but any sort of replying is simply an encouragement to certain types of people and will only make matter worse. (been there, done that)

I’m sure everyone understands that this is the only way to free up time to work on the real problems at hand.  Thank you for your understanding in any case.

Until next time,

Matt

9 Comments »

January 27th 2009

Gartner DI MQ

Dear Data Integration fans,

A few weeks ago, Yves de Montcheuil from Talend took a shot across the bow of Gartner for not including Talend in their Magic Quadrant (MQ) for data integration.  After that post, Andreas Bitter from Gartner (rightfully) felt personally under assault and felt the need to set the record straight.

I think the discussion itself is very interesting, but misses very important point:

The Magic Quadrant contains companies not trends nor communities nor people nor software!

Think about it for a second.  In the early days of JBoss there were complaints from Marc Fleury about the fact that only a small percentage of the “JBoss the software” users paid anything to “JBoss the company”.  Numbers that floated around back then were 0.01% or 0.1%, can’t remember exactly.

Those numbers make sense, I’ve heard about similar figures from other commercial open source companies.  Anything in the range 0.01% to 1% is possible.

Let’s be “optimistic” here and claim that a company like Pentaho converts 1% of all users into customers. (trust me, that figure would be really great given the millions of users out there :-))  That would mean that we’re disturbing the market of our competitors for the turnover x 100.  So if Pentaho would do a dollar turnover, we’re disturbing the closed source vendors for 100 dollars.

Pentaho and yes indeed Talend see that they are being a serious disturbance to the market dominance from the traditional DI vendors.  And that is why Yves feels a bit mistreated by Gartner.  However, since companies like Pentaho and Talend use a disruptive business model it is only normal that the Gartner MQ itself is also disrupted by our models. You simply can’t be part of the system if you want to disrupt it I guess. (*)

All that being said, it’s only a matter of time before something has got to give: open source or the Gartner DI MQ.  Yves, Andreas, let it be noted I’m betting on the former to come out of this as a winner.

Until next time,

Matt

(*) This also partly explains why Kettle and TOS are not really competitors: we’re using the same business model and are not disrupting each other.  We offer 2 completely different choices to our users.

2 Comments »

December 4th 2008

Pentaho Data Integration vs Talend (part 1)

Hello data integration fans,

In the course of the last year or so there have been a number of benchmarks on blogs here and there that claimed a certain “result” pertaining to performance of both Talend and Pentaho Data Integration (PDI a.k.a. Kettle).  Usually I tend to ignore these results a bit, but a recent benchmark got so far off track that I had to finally react.

Benchmarking itself is a very time-consuming and difficult process and in general I advice people to do their own.  That being said, let’s attack a first item that appears in most of these benchmarks: reading and copying a file.

Usually the reasoning goes like this: we want to see how fast a transformation can read a file and then how fast it can also write it back to another file.  I guess the idea behind it is to get a general sense of how long it takes to process a file.

Here is a PDF that describes the results I became when benchmarking PDI 3.1.0GA and TOS 3.0.2GA.  The specs of the test box etc are also in there.

Reading a file

OK, so how fast can you read a file, how scalable is that process and what are the options?  In all situations we’re reading a 2.4GB test file with random customer data.  The download location is in the PDF on page 4 and elsewhere on this blog.

Remember that the architectures of PDI and Talend are vastly different so there are various options we can set, various configurations to try…

1) Simply reading with the “CSV File Input” step, lazy conversion enabled, 500k NIO buffer : 150,8 seconds on average for PDI. Talend performs this in 112,2 seconds.

2) This test configuration is identical to 1) except that PDI now runs 2 copies of the input step.  Results: 94,2 seconds for PDI.  This test is not possible in Talend since the generated Java software is single threaded.

Reading a delimited file, time in seconds, lower is better

There is a certain scalability advantage of being able to read and process files on multiple CPU and even multiple systems across a SAN.  There is a serious limitation in Talend since they can’t do that.  A 19% speed advantage for PDI is inconsequential for simple reads but brutal for more complex situations, very large files and/or lots of CPUs/systems involved.  For example, we have customers that read large web log files in parallel over a high speed SAN across a cluster of 3 or 4 machines.  Trust me, a SAN is typically faster than what any single box can process.

Writing back the data

The test itself is kinda silly but since it is being carried around in the blogosphere, let’s set a reference, a copy command.   I simply copied the file and timed the duration.  That particular copy set a reference time of 122.2 seconds: a copy from my internal disk to an external USB 2.0 disk. (for the exact configurations see the PDF)

3) If reading in parallel is the fastest option for PDI, we retain that option.  Then we write the data back with a single target file.  PDI handles this in 196.2 seconds.  Talend can’t read in parallel so we don’t have any results there.

4) A lot of times, these newly generated text files are just temporary files for upstream processes.  As such it might (or might not) be possible to create multiple files as target.  This would increase the parallelism in both this challenge as the upstream tasks.  PDI handles this task in 149.3 seconds.  Again I didn’t find any parallelization options in TOS.

5) Since neither 3) and 4) are possible in Talend I tried the single delimited reader / writer approach.  That one ran for 329.4 seconds.

Reading/writing a delimited file, time in seconds, lower is better

CPU utilisation

I also monitored the CPU utlisation of the various Talend jobs and Kettle transformations and came to the conclusion that Talend will never utilize more than 1 CPU while Kettle uses whatever it needs and get its hands on.  For the single threaded scenario, the CPU utilization is on par with the delivered performance of both tools.  There doesn’t seem to be any large difference in efficiency.

Conclusion

Talend wins in the first test with their single threaded reading algorithm.  I think their overhead is lower because they don’t run in multiple threads. (Don’t worry, we’re working on it :-))  In all the other situations where you have more complex situations, where you can indeed run in multiple threads, there is a severe performance advantage to using Kettle.  In the file reading/writing department for example, PDI runs in 3 threads and lazy conversion beats Talend by being more than twice as fast in the best case scenario and 65% faster in the worst case.

Please remember that my laptop is hardly and not by any definition “high end” equipment and that dual and quad core CPUs are commonplace these days.  It’s important to be able to use them properly.

The source code please!

Now obviously, I absolutely hate it when people post claims and benchmarks without backing them up.  Here you can find the used PDI transformations and TOS jobs.  With a little bit of modification I’m sure you all can run your own tests.  Just remember to be critical, even concerning these results!  Trust me when I say I’m not a TOS expert :-)  Who knows, perhaps I used a wrong setting in TOS here or there.  All I can say is that I tried various settings and that this seemed the fastest for TOS.

Remember also that if even a simple “file copy” can be approached with various scenarios, that this certainly goes for more complex situations as well.  Even the other tools out there deserve that much credit.  Just because Talend can’t run in multiple threads, that doesn’t mean that Informatica, IBM, SAP and all are not capable of doing so.

If I find the time I’ll post a part 2 and 3 later on.  Feel free to propose your own scenarios to benchmark as well.  Whatever results come of it, it will lead to the betterment of both open source tools and communities.

Until next time,
Matt

18 Comments »

December 2nd 2008

Flighing high in economic storms

Next week I’ll be in Orlando for another week of brainstorming, planning scheming, plotting for world domination and yes, even coding.

Q : “What are you going to do next week?”
A : “The same thing I do every time when I’m in Orlando - Try to take over the world!”

So I went to kayak and entered my flight preferences: leave and return on Sunday giving me a full week over there.  I was almost shocked to see that the same flight I took 2 months ago now costs less than a third:

  • July 13th 2008 : BRU/MCO - MCO/BRU (over FRA) : 2,400 USD (summer time folks!)
  • October 12th 2008 : BRU/MCO - MCO/BRU (over IAD) : 2,000 USD
  • December 7th 2008 : BRU/MCO - MCO/BRU (over PHL) : 600 USD

Typically I’m not selecting the cheapest flight as that would put me on 18 hour layovers in Bankok or something like that. In the past I’ve once spent 8 hours at Chicago airport and trust me, it’s not worth the 100 USD you can save.  You’ll spend it on Internet access, food, “beverages”, magazines, etc.

That being said, the December 7th flight is the cheapest flight with “only” 1 layover.

In the past I’ve noticed that the airlines added more and more options for me to fly to Orlando or at least across the Atlantic ocean.  Now that the economic downturn is upon us, perhaps there’s finally a bit of over-capacity.  After all, the last 5 flights I took from Brussels to the US had been fully booked flights.  That’s right folks: listening to hollering kids with Mickey Mouse ears for 9 hours straight.  Even noise canceling headset have a hard time with that kind of noise.

Electronic System for Travel Authorization

Another thing of interest for the geeks among you is that you are now encouraged to apply for authorization to enter the US well in advance to replace the manually written green “Visa Waiver” documents.  Nobody makes a fuss about it, but registration for us Europeans is obliged or so you can read from January 12th 2009 on.  I’m sure there are going to be freedom fighters here and there that are going to be up in arms over this sort of program, but personally I’m glad that we can finally fill in those green “waiver” documents electronically at home.  From the looks of it, there’s nothing on there that you don’t already fill in manually now. (it felt kinda familiar filling them in)

I’ll let you know how they perceive my eagerness to fill in these “hidden” electronic documents at the US border next week :-)

Until then,
Matt

4 Comments »

October 28th 2008

Canonical: take my money

Dear Canonical,

You claim that there is little money in the desktop software business and more in services.  Well here is something I would pay money for:

Take the top selling business laptops from Dell, Acer, HP, Lenovo and offer customized distributions for them.

I would pay for that in an instant.  All too often people confuse open source with free of charge.  I’m perfectly capable of making that distinction.  In fact, I use my machines for my work and don’t want to spend days configuring all the devices on them.  As such, I would pay something like 50 USD for a customized (K)Ubuntu or perhaps 150-200 USD if it came with some sort of (e-mail) support contract for a year.

I don’t use Linux / Ubuntu because it costs less, I use it because I prefer it over Windows to do my job.  I would pay that kind of money because I would save time and money in the long run.

Until the major hardware vendors offer decent (worldwide) support for Linux on their machines (out of the box that is), I think this is an idea with potential and I hope at least someone picks it up.  Go ahead, let me spend money on it!

Until next time,
Matt

6 Comments »

Next »

Pentaho world image