Pentaho Kettle Solutions Overview

Dear Kettle friends,

As mentioned in my previous blog post, copies of our new book Pentaho Kettle Solutions are finally shipping.  Roland, Jos and myself worked really hard on it and, as you can probably imagine, we were really happy when we finally got the physical version of our book in our hands.

Book front

So let’s take a look at what’s in this book, what the concept behind it was and give you an overview of the content…

The concept

Given the fact that Maria’s book, called Pentaho Data Integration 3.2, was due when we started, we knew that a beginners guide would be ready by the time that this book was going to be ready.  As such we opted to look at what the data warehouse professional might need when he or she would start to work with Kettle.  Fortunately there is already a good and well known check-list out there to see if you covered everything ETL related and it’s called The 34 subsystems of ETL, a concept by Ralph Kimball that was first featured in his book The Data Warehouse Lifecycle Toolkit.  And so we asked Mr Kimballs permission to use his list which he kindly provided.  He was also gracious enough to review the related chapter of our book.

By using this approach we allow the users to flip to a certain chapter in our book and directly get the information they want on the problem they are facing at that time. For example, Change Data Capturing (subsystem 2, a.k.a. CDC) is handled in Chapter 6: Data Extraction.

In other words: we did not start with the capabilities of Kettle. We did not take every step or feature of Kettle as a starting point.  In fact, there are plenty of steps we did not cover in this book.  However, everywhere a step or feature needed to be explained while covering all the sub-systems we did so as clearly as we could.  Rest assured though; since this book handles just about every topic related to data integration, all of the basic and 99% of the advanced features of Kettle are indeed covered in this book ;-)

The content

After a gentle introduction into how ETL tools came about and more importantly how and why Kettle came into existence, the book covers 5 main parts:

1. Getting started

This part starts with the a primer that explains the need for data integration and takes you by the hand into the wonderful world of ETL.
Then all the various building blocks of Kettle are explained.  This is especially interesting for folks with prior data integration experience, perhaps with other tools, as they can read all about the design principles and concepts behind Kettle.
After that the installation and configuration of Kettle is covered. Since the installation is a simple unzip, that includes a detailed description of all the available tools and configuration files.
Finally, you’ll get hands-on experience in the last chapter of the first part titled “An example ETL Solution – Sakila”.  This chapter explains in great detail how a small but complex data warehouse can be created using Kettle.
2. ETL
In this part you’ll first encounter a detailed overview of the 34 sub-systems of ETL after which the art of Data Extraction is covered in detail.  That includes extracting information from all sorts of file types, databases, working with ERP and CRM systems, Data profilng and CDC.
This is followed by chapter 7 “Cleansing and Conforming” in which the various data cleansing and validation steps are covered as well as error handling, auditing, deduplication and last but not least scripting and regular expressions.
Finally this second part of the book will cover everything related to star schemas including the handling of dimension tables (chapter 8), loading of fact tables (chapter 9) and working with OLAP data (chapter 10).
3. Management and deployment
The third main part of the book deals with everything related to the management and deployment of your data integration solution.  First you’ll read all about the ETL development lifecycle (chapter 11), scheduling and monitoring (chapter 12), versioning and migration (chapter 13) and lineage and auditing (chapter 14).  As you can guess from the titles of the chapters, a lot of best practices, do’s-and-don’ts are covered in this part.
4. Performance and scalability
The 4th part of our book really dives into the often highly technical topics surrounding performance tuning (chapter 15), parallelization, clustering and partitioning (chapter 16), dynamic clustering in the cloud (chapter 17) and real-time data integration (chapter 18).
It’s personally hope that the book will lead to more performance related JIRA cases since chapter 15 explains how you can detect bottlenecks :-)
5. Advanced topics
The last part conveniently titled “Advanced topics” deals with things we thought were interesting to a data warehouse engineer or ETL developer that is faced with concepts like Data Vault management (chapter 19), handling complex data formats (chapter 20) or web services (chapter 21).  Indispensable in case you want to embed Kettle into your own software is chapter 22 : Kettle integration.  It contains many Java code samples that explain to you how you can execute jobs and transformations or even assemble them dynamically.
Last but certainly not least since it’s probably one of the most interesting chapters for a Java developer is chapter 23: Extending Kettle.  This chapter explains to you how you can develop step, job-entry, partitioning or database type plugins for Kettle in great detail so that you can get started with your own components in no time.

I hope that this overview of our new brain-child gives you an idea of what you might be buying into. Since all books are essentially a compromise between page count, time and money I’m sure there will be the occasional typo or lack of precision but rest assured that we did our utmost best on this one.  After all, we did each spend over 6 months on it…

Feel free to ask about specific topics you might be interested in to see if they are covered ;-)

Until next time,

Matt

12 comments

  • This is an excellent book. Thank you for getting this out. It was on the Amazon pre-order list for a while but finally I’ve it. I’ve read through most of it and it is going to be a reference book for me when I get stuck and need to figure out a solution fast. This beats going through forums postings. Everything is nicely explained in one place.

    Between your book and Maria’s book, I do not have to search through hundreds of forum posts to find an example.

    Excellent job.

  • Ron

    This appears to be an excellent book for experienced Java programmers who want to prototype an ETL application quickly. We are total beginners with Pentaho, but have deep experience with XML technologies, open source tools, open standards, and the Java EE.

    My specific question is whether the text describes how to integrate developer-supplied rules engines. I have read elsewhere that it does. I’m hoping that your book at least touches on this.

    Thanks for your labor of love in writing this book. (Done that myself for another technology.)

  • Sunil George

    Dear Matt,

    I am looking forward to buy this book. But this book is not available in India at the moment. I searched for it in http://wileyindia.com/ but not seen. I can see the book in wiley international edition site. Is there any chance in getting this book published in India?

    Regards,
    Sunil George.

  • I’m sorry Sunil but I have no idea about shipping to India. I would suggest to try and order the book at Amazon or any of the other online book stores.
    A good technical book store (if there are any in your neighbourhood) should be able to get you Pentaho Kettle Solutions too.
    Good luck,
    Matt

  • Sunil George

    Thank Matt for your quick reply. I have contacted Wiley Bangalore(India) sales person and he gave me a contact person near to my place. They told me that they have ordered the book from their Singapore store. Need to wait for one month for the shipment I guess. I had gone through the PDI 3.2 Beginner’s guide earlier and I am very much interested in knowing and learning more about Kettle and I am sure that the new Book will have enough practical examples related to DW for us.

    Does the book covers how we design a dimensional model from an ER model and how Kettle steps can be used for creating a transformation that will load the FACT and DIMENSIONS tables?

    I had a question from one person recently regarding Kettle and the question was “If we have a Job that consists of lot of transformation calls and due to some reason one of the transformation got aborted. Now we need to re-run the job and we need to start from the aborted transformation this time skipping the rest those were executed successfully. How will you do it in PDI?”. My reply was we will have to log the details after a successful execution of a transformation and these log has to be read first before decising whether that transformation within the job has to be executed or not. Is this correct?

    What I have meant is we will create a folder that will hold a log file ( say in the name of transformation name) after a successful execution of the transformation. This happens for all the transformation. After the last transformation successful execution we will clear this log file in this folder. If any transformation aborts in b/w and we will have to restart the job again, before the transformation is called we will put a condition in the job to check with a log file exists in the log folder in the name of that transformation. If exists we will skip the execution of that transformation and move to the next and so on.. Is my logic correct or do we have any other straight forward option for this?

    Regards,
    Sunil George.

  • Hi Sunil,

    Questions regarding PDI itself are best placed on the Kettle forum. As a short reply I’m working on providing better support restartability for version 4.3. The logic you mention is precisely what we had in mind since we have log tables for job entries now.

    Regards,

    Matt

  • I mean if we translate the mantle:org/pentaho/mantle/public/messages/messages and save it, we also want to put it directly in the corresponding directory in the server so we can use it right away (and of course share it with the community.)

  • Sunil George

    Dear Matt,

    I managed to get the book shipped from Singapore Wiley shop. The book is nicely written and it covers many parts required for Pentaho ETL beginners. I have few queries for you. I really expect your valid comments and suggestions for the same.

    1. Where we can find more documentation regarding Pentaho Enterprise Repository? Is this the best option for a real project execution where we have multiple poeple working for a project. I heard that it manages referential integrity also. So if we change the name of a transformation and if that transformation is referenced by a Job will the repository handles this referential integrity automatically. Do people has access to this Repository file path? or is it managed only by Data integration server? For a lineage analysis in a transformation is it possible for a File input step to connect to this Enterprise repository and reads the .ktr files as we do using a normal file system repository path? Or do we need to export the transformations from repository to XML is the only option?

    2. We have a requirement to create a txt file output extracting data from a number of tables. The transformations has just 3 steps. Table input Step, a Javascript step to remove all newline and carriage return characters if present in any fields and a Text file output step. Say if we have 16 tables we can do this as 16 transformations. The source query SQL is not just a Select * type as it is needed to do some Case and type cast changes due to difference in data types in source and target. What we have finalised is to have a generic transformation instead of 16. What we did is we created a transformation T1 which will do a query in the source db for the supplied table name and will take the table column names and data types. The Select fields are then build as a string with required case and type casts applied in T1 and passed to T2 as a variable. T2 gets this variable and uses it in the Table input step in the SQL query. We have a Javascript step where I have managed to do the cleaning process with the help of getInputRowmeta() method. Created a for loop and for each string type fields will do the cleaning process and finally the existing value is modified like row[i]= ….; Wanted to know whether this is a correct approach as we are not using javascript step specifying the input fields names and not using var and Get output Variables options inside the Javascript step since we need to generalize it.

    3. Finally for the above project I can have a Master job where there is a Transformation T0 which will read a SOR table which has the 16 source table names. So the output of T0 is 16 rows and we store that to Copy to Result Step. Then we connect this to another sub job which has T1 and T2. The sub job will be set to run for each row output of T0. I hope this will do the job for us. The query that I have here is this job looping for 16 rows is a sequential thing? Or is it possible for me to parallelise it? The sub job will get receive a table name and will create the sql and outputs the txt file. Since it is a single Transformation that is going to run for all 16 table extraction how Can I parallelise it?

    Many thanks in Advance. Looking forward for your valuable comments.

    Regards,
    Sunil George

  • Hi Sunil,

    It’s great to hear that you managed to buy our book, thanks for your support.
    As far as free Kettle support for your questions is concerned, it’s best that you try the Kettle forums:

    http://forums.pentaho.com/forumdisplay.php?135

    Best of luck with your projects!

    Matt

  • Sunil George

    Dear Matt,

    Thank you so much for the guidance. I have posted my queries in the forum as you have suggested. We do have an enterprise edition so we will post the same to the pentaho support team as well.

    I request your attention also to these queries if possible. At least a quick glance will be helpful for us. I am sharing the forum post link here:
    http://forums.pentaho.com/showthread.php?82354-PDI-Clustering
    http://forums.pentaho.com/showthread.php?82355-Generic-Transformation-for-loading-all-one-to-one-table-loading
    http://forums.pentaho.com/showthread.php?82356-Parallel-execution-of-looped-Transformation-in-a-PDI-Job
    http://forums.pentaho.com/showthread.php?82358-Pentaho-Enterprise-Repository

    By the way I have found printing mistakes in quire a few places in the Kettle solutions book. If you want I can share those page number with you so that it can be corrected for the next edition printing or so. Please let me know.

    Many thanks in advance.

    Regards,
    Sunil George.

  • Ndash

    Matt,

    Thanks for the excellent book. Though I’m a newbie to this tool, I managed to learn a lot from your book as it was very lucidly written.

    I’m currently working on CDC. In chapter 6, you have described about that. While I’m able to carry out all that is mentioned in the book, I got stuck in the following.

    1. After Merge Rows (Diff), how do we get the “Column Names” and their Values those are changed/altered?

    Apologies for posting here, as I can’t access the forum.

    Thanks a zillion for your help.

    Best Regards,
    NDash

  • NDash, try again on the forum. Thank you for your understanding.