Kettle data in a browser

Dear Kettle fans,

As you can tell from the Kettle JDBC driver project and also from the Talend job execution job entry (if you’re still wondering, that was NOT a joke) we announced a few weeks ago, we’re constantly looking for new and better ways to integrate Kettle into the wide world.

Today I’m blogging to spread the word about a new exciting possibility: expose Kettle data over a web service.

Here’s the situation: you have a 3rd party vendor that wants to read data from you.  However, you don’t want to spend a lot of time coding web services and what not to deliver the data in a language neutral format.  Anybody can read data from a simple web service.  You can use any programming language and it’s easy to test in a browser.

The way that it works arguably couldn’t be easier…  Let’s take a look.

We start with a simple transformation that reads some customer data.  We take the “Read customer data” example from the samples:

samples/transformations/CSV Input – Reading customer data.ktr

Next, we’re going to replace the dummy step with a “Text File Output” step (or “XML Output” if you prefer that):

Save the transformation in the same samples folder under the “Servlet Data Example.ktr” filename.  In my case the full filename is: (from the transformation settings dialog)

/home/matt/svn/kettle/trunk/samples/transformations/Servlet Data Example.ktr

Now you can use an instance of Carte that runs on your local host on port 8282 (execute carte.sh 127.0.0.1 8282) to get the data in your browser.  The URL is easy to construct.

For file (XML) based transformations:

http://username:password@hostname:port/kettle/executeTrans/?trans=PathToFile

For transformations stored in a repository:

http://username:password@hostname:port/kettle/executeTrans/?rep=RepositoryName&user=RepositoryUser&pass=RepositoryPassword&trans=PathToTransformationInRepository

Please note that you need to replace forward slashes with %2F in your browser.  That leads to this URL in our example:

http://cluster:cluster@127.0.0.1:8282/kettle/executeTrans/?trans=%2Fhome%2Fmatt%2Fsvn%2Fkettle%2Ftrunk%2Fsamples%2Ftransformations%2FServlet Data Example.ktr

The result is can be tested with your browser:

For the script kiddies among you it is possible to get a hold of the servlet print writer in JavaScript too:

var out = _step_.getTrans().getServletPrintWriter();
out.println(“<H1>Hello, world!\n</H1>”);

Well, there you have it. I hope you like this idea.  You can try it out yourself if you download a new build of Pentaho Data Integration from our Jenkins CI build server.

Now excuse me while I put the same button in the “JSON Output” step…

Until next time,

Matt

UPDATE Apr 27: “JSON Output” now also supports the new “Servlet output” option.  Furthermore it is now also possible to pass parameters and variables through the URL.  Simply add &PARAM=value and the appropriate parameter or variable in the transformation will be set prior to execution.

36 comments

  • Wow.. that is an interesting spin on transformation execution. What is the expected behavior if you try to run the transformation in a context other than carte such as Spoon/Kitchen/Pan/BIServer ?

  • Well Daniel, since there is no output to write to I would expect a big fat null pointer exception 🙂
    I guess we could handle it more graceful and write to stdout or something like that.

  • Sean

    Matt,

    I’m regular reader who has converted from INFA and Talend to PDI. Have your book to prove it! Just this week, I used PDI and saved hours by doing a complex mapping reading from Excel.

    This is really nice. So I can have a transformation that does a bunch of stuff yet gives the simple output as a json/text web service to be consumed by anyone on the internet. This will be really useful in so many ways (I think Talend has that but I have not used it). I think we (PDI) are getting way ahead of the other ETL tools.

    Q: Can I write a job that does the same or has to be a transformation?

    Regards,

  • Hi Sean,

    Thanks for the kind remarks. Adding the same functionality for a job would indeed make sense however it doesn’t work yet.
    If you feel like you have a use-case for it, please file a feature request in JIRA and I’ll certainly look at it 🙂

    Cheers,
    Matt

  • Sean

    Matt,

    I used your process above and it worked very nicely. Per my suggestion, if we can execute a job that outputs a file, that will be even better. The case is simple:
    1. Customer uploads a file to a webserver
    2. Then executes a URL which really execute a job (instead of just a trans).
    3. The job does a bunch of stuff! And the last trans posts the file back to the webpage.

    So the only difference will be for carte to execute a job but return the transformation file just like your example.

    Does this make sense at all? If so, I can make a feature request. Also I don’t know how the password information on the URL can be hidden.

  • Hi Sean,

    That request totally makes sense and TBH I half expected it to come. Please create a feature request for it at http://jira.pentaho.com and I’ll try to do it as soon as I have time for it.

    Thanks for the feedback!

    Matt

  • John

    Hi Matt,

    I think it is very good idea and it works fine. But what do You think abou possibility to join “Text File Output” step with project called “Apache Pivot” (also very good project like Kettle)

    http://pivot.apache.org/demos/kitchen-sink.html

    I have already tried to test example http://pivot.apache.org/demos/large-data.html – this application reads CSV files containing from 10 to 1,000,000 rows and reports the amount of time it takes to load each file. It loads the rows on a background thread using a streaming API such that rows can be presented to the user as they are read.
    In this example, the CSV files are static but if can the “CSV” be dynamically generated from kettle trans of job, it will be very cool !!! I am not java programer and I like very much kettle tool. I think Kettle together with “Apache Pivot” can be very interesting app for BI solution or other online web app conected to SQL database via kettle trans or jobs.
    Matt, please, can you try to conect large-data.html examle to some CSV file on local HDD and think about this collaboration (Kettle & Apache Pivot). My programming knowledge are not so good for it.

    Than You very much in advace.

    John

  • Hi John,

    I’m not the sort of person that is going to object to combining different open source projects. Sometimes it feels like that’s all I’m doing all day long. However, this particular exercise I’m going to leave up to some other script kid to figure out.
    Please understand that Pentaho delivers a complete BI stack that goes far beyond displaying a CSV file in a browser.
    You might as well ask that we enable this option in the “Pentaho Reporting Output” step so that we can drop a PDF straight into a browser window. It’s not that you couldn’t do it if you wanted to but really I feel like we shouldn’t do it since all that logic is already written on the Pentaho BI Server.

    Have fun!

    Matt

  • This is a fantastic functionality. Superb easy way to make data available on demand. I’m no expert on Carte but something that would be nice is to be able to set different permission to different transformations. You might want different users to have the right to run different transformations. Maybe this is possible today, I have to read up on the Carte-server.

    I love to work with Kettle/PDI and this new functionality is making me so happy, thanks for a great work!

    /Björn

  • Lou

    Great stuff.
    I am a total Kettle fresher, so apologies if I’m asking obvious things.
    Is it possible to expose Kettle as a web service? I seem to remember that you can do that in BODI.
    An application could call Kettle with a web service and a (small) data set to process; or it could call a one or a range of jobs via web service…

    thanks for the help
    Lou

  • Hi Lou,

    That’s not exactly what this blog post was about. It’s already trivial to execute a transformation on the Pentaho BI server using a simple action sequence (xaction).
    You can execute those with a simple web service call, pass parameters, read data, pass it to a report and so on. However, the data is passed in-memory in a non-streaming fashion there.

    In this case we’re using the the web service to pass high volumes of data in a streaming fashion over the web service.

    HTH,
    Matt

  • Wow – another awesome, spectacular example of power, flexibility and creativity.

  • Lou

    Hi Matt

    Thanks for the reply. My question is really: can Kettle configure transformations to be executed as web-services, while data and parameters are passed dynamically as part of the web-service call?
    Would this be considered a valid approach for real time data processing (push mode)?

    (I’ve just ordered your book but I’haven’t got it yet – I’m sure everything will be clearer once I’ll read it!)

    Also, on which Pentaho forum should I ask questions about the ability of the BI server to process a continuous stream of data in a report / scorecard?

    Thanks Lou

  • Sure, if you have say a bit of JavaScript in a browser you can use the transformation to real time stream data. Take the twitter example from the book. You can continue to stream data with this option.

    Even so, questions like this are usually asked by analysts who don’t know what they’re talking about in the sense that they NEVER have an actual use case 🙂

  • Lou

    Hi Matt

    The book has arrived and it is great (possibly one of the best I’ve seen!). Interestingly for this discussion though, is how the “push model” is not even mentioned, not even in the real time chapter.

    I take positively your challenge and mention few user cases, one from the past and one from the present.

    A petrol company with offshore platforms needed to collect daily statistics on oil output; water quality etc. from the platforms. This was collected offshore by various systems at any moment of the day, then beamed onshore and consolidated in a transactional DB on SQL Server. The company wanted to move the data to a Oracle DW as soon as it arrived and have a report generated from the DW on water quality. We used triggers to ultimately call BODI via web service.

    A logistics company wanted to offer statistics to the drivers on their driving skills at the end of each shift. Data is streamed by a black box on the truck and beamed to a central location in an OLTP-type DB. As soon as the driver’s shift finishes the shift totals need to be pushed to a DW and a report needs to be available for the driver.

    An application needs to check the spelling of addresses in real time, e.g. when the user keys-in the address. Each time an ETL tool is called via a web service, sending the address as part of the call. This is checked against a reference list of identical and similar addresses (fuzzy logic). The results is showed in a pop up where the operator confirms if the address is really as typed or any of the similar one stored as reference (there is a SAP demo somewhere shoving this involving CRM, Data Service and master data – I evaluated a similar design for a project a while ago but in the end it wasn’t implemented).

    Technically you could resolve the first two cases using pull methods (e.g. looping over a table every x minutes), but if a push method is available this has in general my preference.

    According to the book I should be able to call a transformation using the Java API. This is great but it is nice to have options…

    thanks

    Lou

  • Hi Lou,

    First, thanks for the feedback & the book purchase!

    The reason why the “push model” is not mentioned is the same reason as why I was reluctant to even go into the discussion: outside of a (too vocal) analyst crowd, it’s rarely if ever requested by our users. We already had too much content for our book so we had to cut smaller subjects left and right. Besides, calling a web-service in a PL/SQL trigger that executes a job or transformation has been possible for years as mentioned before. You don’t really need the new feature described in this blog post for that, although you indeed very much could do just that.

    The problem with these kinds of solutions is that the more plumbing is involved, the harder it becomes to set it up and to manage it. It’s like real-time data warehousing: everybody needs it until they see the work and cost involved of the solution compared to doing it near real-time or in micro-batches.

    By the way, fuzzy logic matching is a lousy way to clean addresses. We recently had Melissa Data present their plugin for Pentaho Data Integration that can use reference data and proper algorithms to do the job. There are a number of other partners we’re working with in various parts of the world to provide solid data quality solutions.

    So in the end we have the following options to call a transformations:
    – Using Java
    – Web service on a DI/BI server (action sequence, pretty simple)
    – Web service to stream text data (this blog post)
    – Remote execution using Carte/DI server
    – Using JDBC, see http://code.google.com/p/jdbckettle/

    I’m sure something can be built with those options 😉

    Take care,
    Matt

  • Christian

    Hi Matt,

    i m a big kettle fan and i m using it on a daily basis. I have to admit that i m not even doing a lot of BI stuff with it. It’s just a great tool that could solve a lot of problems. To redirect the output to the carte servlet is just another step ahead to a even more flexible Kettle.

    I m also highly interested in the Kettle JDBC driver. Do you know if there is still a lot of ongoing development? The google code website says that it’s still something like a milestone build. Is it worth playing around with it now or do you thing it makes sense to wait a few months? (i m not sure of how much you are involved in development but i ve seen your name on the list of committers :-))

    thanks
    Christian

  • Hi Christian,

    I updated the JDBC driver to the 4.0 API a few months ago. However, to further improve adoption and improvements I’m considering to move the JDBC driver into the Kettle project itself. I tried the driver a couple of times and for sure the basics works as advertised.

    As always, if there are any specific issues, let us know!

    Best of luck,

    Matt

  • Peter Mengaziol

    This is a really elegant solution! It would really be great to be mainstreamed into Kettle itself…

  • Hi Peter, it’s been mainstreamed into version 4.2.0.

  • Gian Luca Agus

    Here is my url to get transformation Scorte_simulazione_prova running.
    On the server side carte get stuck and I can’t see any result.

    172.19.10.134:8082/kettle/executeTrans/?rep=sviluppo&user=admin&pass=admin&trans=Scorte_simulazione_prova&PARAM=”data_ini=2012-02-28″

    I think is the wrong way to use PARAM.
    I would like to set the parameter ( put in the transformation ) data_in to the value “2012-02-08”.

    I couldn’t find the syntax for it.

    Where can I find that information ?

  • Macin

    Dear Matt,

    I hope that this comment still gets through to you. (I could not find a contact mail or something similar)
    I am a Belgian geo-ict developer/analyst and will be starting soon on a big project and we chose geoKettle as the ETL tool.
    But it appears that the functionality you describe in this blog post (passing parameters is something we really need) is not part of geokettle.
    I am aware that you are the developer of kettle and not geokettle, but is there any chance that you recall where I would need to be starting to implement the same functionality in geoKettle? A starting point would be great.

    Cheers,
    Macin

  • Hi Macin,

    The changes done for this feature are fairly simple but spread over the following areas:
    – A new Carte servlet
    – A place in the Trans.java class to hold the servlet print writer (set/getServletPrintWriter())
    – A set of options in the various steps that support this.

    Obviously it would be better if GeoKettle was following the developments of Kettle a bit quicker so you wouldn’t have to go through this pain.
    From my side I offered GeoKettle help on more than one occasion with the conversion of their code to a set of plugins. I’m still hoping we could make that happen for PDI 5.0

    Your alternative now is to simply use 2 software packages: GeoKettle and Kettle.

    All the best,

    Matt

  • Macin

    Hi Matt,

    I thank you very much for this quick response! I am going through the source code right now and hopefully I’ll be able to implent the changes in geoKettle.
    I wish the geokettle team had implemented the spatial data in a more modular fashion or at least follow up kettle more than they do now. It would be a great step forward if their code would be a set of plugins, so I wonder why they did not take you up on it.

    We really need the spatial capabilities of geokettle (the ETL actions only happen on spatial data) so how would you approach that using the two software packages?

    Thank you for your help and hard work, Kettle is an amazing ETL tool.

    Cheers,
    Macin

  • tarkan

    salut
    je veut bien integrer sugarcrm avec openbravo en utilisant pentaho data integration.est ce que c’est possible?merci d’avanace 🙂

  • tarkan

    salut
    j’aimerai bien inetégrer sugarcrm avec openbravo en utilisant pentaho data integration,est ce que c’est possible,aider moi svp.merci d’avance 🙂

  • Tarkan, please post your question on the forum and we’ll have a look.

  • Hi Matt!

    This looks really awesome and I have a perfect use case for it. Unfortunately, every time I try it, I get redirected to the login page. Is there a different/newer way to pass the credentials? I’ve tried all of the following formats with a variety of cluster/cluster, joe/password, and our actual admin username/password and none of them are working. We’re using the enterprise edition of PDI (version 4.2 I believe).

    http://cluster:cluster@mydomain.com:9080/pentaho-di/kettle/startTrans?name=Transformation+1
    http://mydomain.com:9080/pentaho-di/kettle/startTrans?name=Transformation+1&user=cluster&password=cluster
    http://mydomain.com:9080/pentaho-di/kettle/startTrans?name=Transformation+1&j_username=cluster&j_password=cluster

    Thanks very much, and thanks for all of your hard work in making Pentaho/Kettle awesome!

  • I should have mentioned: I can actually login with the proper username/password and start the transformation after I login using the URLs above; it’s just passing the credentials in the URL that doesn’t seem to work 🙂

    Thanks again!

  • Hi Topher,

    The service is probably not available for DI server 4.2. It’s best that you route this through your support portal so that we can figure it out for you.

    Thanks,

    Matt

  • BALAJI RANGAN

    Hi Matt, is there an example ktr which shows how an xml file to a webservice via HTTP client or HTTP post

  • Thomas

    Hi everybody,

    I’m looking for since several weeks the way to execute a job remotely without the use of slave/master combination, the goal is to give users the possibility to launch a job by a simple click.

    I found a solution that could help some people here:

    A ) First, use the “carte” program on the server side (exemple SERVER1) : ./carte.sh 0.0.0.0 8080
    On this server you have to define a tranformation “/opt/remote_exemple1.ktr” with a “Rhino javascript” step. Include in this script the following command : execProcess(“/opt/data-integration/kitchen.sh -rep=REPO -job=ORDERS_IN”)

    B) On the client side, you simply have to use an html link like this : http://cluster:cluster@SERVER1:8080/kettle/executeTrans/?trans=/opt/remote_exemple1.ktr

    Then your job “ORDERS_IN” will be executed on the server

    Thanks for Kettle which is a great tool.
    Thomas

  • Please note that version 5 of Kettle will add 2 new methods for executing transformations and jobs stored in a repository:

    http://wiki.pentaho.com/display/EAI/Carte+Configuration

  • ketkat

    Hi Mat, everybody,
    i red the article and the comments above, but i still don’t get it (maybe my English is to bad) how to execute a Job from a Browser.

    Lets say I’m using PDI 4.4 Standalone on Windows,
    I created a main.kjb which contains two other Jobs which run 2 or 3 Transformations inside.
    All files are linked through relative paths and some Parameters. The job is running without any errors in Kettle (everything is filebased, no repository!)

    I’ve got Carte.bat up and running and i’m able to execute a single Transformation via:
    http://username:password@hostname:port/kettle/executeTrans/?trans=PathToFile

    But it does not work with Jobs.

    Is it actually possible? Whats the syntax for it?

    All I want is a costumer to enter a url in Browser which runs a job and sends him an output email.

    Thank you

  • Rubel

    Hi,

    It seems when we use ‘pass data to servlet’ then we do not get UTF-8 chars. Instead we see some question marks.

    Can u suggest us the solution.

    Note: we are using version 4.4.x

    thanks
    Rubel