A nice chat

Earlier today I had a nice IM chat with someone.  He or she is referred to below as Question and I’m Answer.  There where interesting questions and perhaps others find the answers interesting as well.  I seemed a shame to let the information in the chat log go to waste, so I’m posting it here on my blog.
Question: I have a qestion for you about the possibility of creating custom transformations.
Answer: sure
Question: my company already has quite a buit of business logice that is coded in C and/or C++ and this logic then calls some Corba services.  Would it be possible for use to integrate that logic into Kettle?
Answer: Not directly, however, it’s not that hard to write wrappers in JNI (java native interface) and create a plugin for it.
Question: That was my idea.  I was just wondering if there were any other ways
Answer: What you need is some .dll or .so shared library to talk to and then you can go from there.
Answer: If these are standealone applications and you can talk to them over sockets, http, what not, that is possible as well.
Question: you’ve given me some good ideas.
Answer: In any case, you always need to write some java code.  I would go with JNI as it’s the easiest to do.
Question: i agree
Answer: Then you can even call those wrapped methods from Javascript.
Question: i totally agree
Question: when i was asked if it was possible, my first answer was JNI, and I was told to look for some alternatives (if there were any)
Answer: Well like i said, it highly depends.  If the only way for the application to communicate with the outside world is through C/C++ API, then you need JNI.
Question: I’ve used JNI to call C methods before.  It is pretty simple.
Answer: yeah
Question: I have some generic transformation questions for you, if you don’t mind my asking.
Answer: ok
Question: is there any limit on the number of steps and hops in a transformation?
Answer: no I’ve had reports of (crazy!) people that put 400 steps in a single transformation and they claimed it worked fine ;-)
Question: lol, that is impressive.  honestly, i can’t think of 400 things i would want to do to Data in a single transformation
Answer: yeah, I had some screenshots a while back, can’t find them anymore.
Answer: That’s it right, it’s bad practice to say the least :-)
Question: i see too much room for error (human error, not kettle) in 400 steps
Answer: Yes, it’s better to separate the logic a bit.  We even have mappings (subtransformation) to solve that these days.  Then again…  Oh well.
Answer: I guess they thought it was cool or something.
Question: the data that comes out could look nothing like the data that went in
Answer: The thing was one giant case switch: 30 filters in a row all splitting off to different functionality.
Question: I see that kettle handles flat file and Database connections.  What is the largest flat file that you have used as input for a transformation?
Answer: Hard to say, probably a couple of hundred MB or so.
Question: what if I wanted to run a GB file?
Answer: Hey, we only read one line at a time, it will be OK ;-)
Question: 1 line at a time, do you mean 1 record at a time?
Answer: Yes.
Question: does that 1 record pass all the way through the transformation before the next record is read?
Answer: no, all steps work as separate threads at the same time in parallel.
Question: ok
Answer: The text file input step just puts rows of data (records if you like) in a buffer.  The next steps(s) read from that.
Question: ok.  So the next step reads rows from a buffer.  Does it read them 1 at a time from the buffer or does it grab several rows at a time?
Answer: It depends on the step, but usually it’s one row at a time. (that’s how I prefer it in any case to ensure data isolation)
Question: you say it depends, does that mean that it is possible to configure steps to read more than 1 at a time, or that some steps are defined to read more than 1 at a time?
Answer: It is possible for a step to own more than one row at a time yes.  For example the sort steps reads all available rows before it starts to write to the next step.  However, the API provides a method to read a single row.
Question: let’s say i have a reading step, some other step, and a writing step.  The reading step reads a row and places it in the buffer.  How does the next step know that there is a record in the buffer?
Answer: It blocks when the input buffer is empty or the output buffer is full.  A row is returned to the step when there is one available.
Answer: This is all synchronized of course.
Question: that’s good
Question: so the reading step puts rows into the buffer.  When that buffer is full, the next step begins to read records from the buffer.  is that correct?
Answer: No, not really, even if the buffer is half full (or half empty if you are a pessimist) the next step(s) read from it.
Question: ah
Answer: All steps work completely independend from each other, there is no interaction or communication between the steps at all.  It is very important.
Question: does that mean that the step is constantly checking the buffer for rows?
Answer: Actually yes it is.  Every couple of nano-seconds it sees if there is anything in there.  The sleep time between checks does become larger if the buffer stays empty for longer periods of time.
Question: ok
Question: that is the information i was looking for.
Answer: It’s very light weight though.  It’s just checking the size of a list and if that list is not flagged as “done” by the previous step.
Question:  that would have been my next question.  When the reader reads the last row, how does it tell the next steps that there are no more records.  You say it flags the buffer as “done” ?
Answer: yeah, just a boolean: class RowSet (setDone())
Question: i love this architecture.  The more i hear about kettle, the more i like it.
Answer:  Listen, can I post this conversation on the web somewhere, I’m sure other people would love this information as well.  They were good questions.
Question: yeah, that is fine with me
Answer: I’ll clean it up, remove the names of the involved parties ;-)
Question: i would appreciate that
Answer: np
Answer: The thing about the architecture is that it’s kept extremely simpel.  If works by letting go control.
Answer: That however, is also the biggest drawback.
Answer: It’s the reason why will go to a different architecture in version 3.0.

I hope you found this informative or interesting.  If so, let me know and I’ll post more on ETL internals in the future.

Until next time,
Matt

4 comments

  • Frank

    Matt,

    I would love to hear more about the internals/concepts of Kettle/PDI in the future!
    I guess it would also be a good idea to include these infos into the documentation. This knowledge will help Kettle users to develop well designed transformations.

    Keep on the good work and all the best
    Frank

  • Jens Bleuel

    A good tool to build JNI wrappers is jacoZoom. I used it in a Kettle plugin and it works very well.
    “jacoZoom contains a typelibrary parser, which produces java wrapper classes directly corresponding to the COM-objects and interfaces.”
    see http://infozoom.de/en_jacoZoom.shtml
    Happy wrapping,
    Jens

  • For one of the transformation, I have to use input file of 140GB and it just works fine for the 21 steps. Although, it takes upto 8 hours to finish the task. I am trying to improve the performance. No good luck yet :(

  • Mradul, why don’t you post your problem on the Kettle forum with more details and we’ll give some comments.