Resource exporter

Dear Kettle fans,

One of the things that’s been on my TODO list for a while was the creation of a resource exporter

Resource exporter?

It’s called “Resource exporter” and not “Job exporter” or “Transformation exporter” because it is intended to export more than just a single job or transformation.  It exports all linked resources of a job or transformation.

The means that if you have a job that has 5 transformation job entries, you will be exporting 6 resources (1 job and 5 transformations).  If those transformations use 3 sub-transformations (mappings) you will in total export 9 resources.

The whole idea behind this exercise is to be able to create a package (for example to send to someone) that has all needed resources contained in a single zip file.

Let’s look at an example!  We have a job to load/update a complete data warehouse.  It loads source files, updates dimensions and a fact table, in total 31 transformations and jobs.

The top level job we want to export is the “Load data warehouse.kjb” (can be in a repository too!).  Thanks to the very recently added “export” option in Kitchen, we can run this:

sh kitchen.sh -file=’/parking/TDWI/PDI/Load data warehouse.kjb’ -export=/tmp/foo.zip

This generates the file “/tmp/foo.zip” that contains all the used resources.  Please note you can also do this in Spoon under the “File” menu.

What about job and transformation filenames?

If you look in the ZIP file with “unzip -l” you will notice entries like this one:

    33107  03-10-09 18:31   Update_Customer_Dimension_023.ktr
Originating file : /parking/TDWI/PDI/Update Customer Dimension.ktr (file:///parking/TDWI/PDI/Update Customer Dimension.ktr)

This resource gets called in the “Update dimensions” job, so let’s look inside of the generated XML to see how this is solved.  We see that the <filename> entries have been replaced by the correct link:

<filename>${Internal.Job.Filename.Directory}/Update_Customer_Dimension_023.ktr</filename>

This is interesting, because the originating transformation could have been located anywhere.  Once it’s exported to the zip file, it’s referenced with a relative path (using PDI internal variables).  That in turn means you can locate the zip file anywhere, even on a remote web server and it would still be executable.  In fact, Kitchen gives us advice on how to run the “Load data warehouse” job in the ZIP file:

This resource can be executed inside the exported ZIP file without extraction.
You can do this by executing the following command:                           

sh kitchen.sh -file='zip:file:///tmp/foo.zip!Load_data_warehouse_001.kjb'

What about input file names?

Obviously, you can’t go about zipping input files that can sometimes be quite large.  So we opted to create a set of named parameters that you can use to define the location of the input files.

In our example, we have a set of files read with “CSV Input” and “Text File Input” steps that are located in 2 folders:  “/parking/TDWI/” and “/parking/TDWI/Source Data”.  During the export, the step metadata will be changed to read:

${DATA_PATH_x}/<filename>

In this specific case we then create 2 parameters in the job, sub-jobs and sub-transformations:

DATA_PATH_1, default=/parking/TDWI/Source Data
DATA_PATH_2, default=/parking/TDWI

These named parameters can then be used during execution with kitchen.  If you send the “foo.zip” file to someone else along with the data in “/bar and “/bar/Source Data” you can execute the job as follows:

sh kitchen.sh -file=’zip:file:///bar/foo.zip!Load_data_warehouse_001.kjb’
-param:DATA_PATH_1=”/bar/Source Data”
-param:DATA_PATH_2=”/bar”

The subject of named parameters is worthy of a complete article all by itself.  It’s the brain child of Kettle star Sven Boden.  It would take us too far to explain the details, but you can see what parameters are defined for the job like this:

sh kitchen.sh -file='zip:file:///bar/foo.zip!Load_data_warehouse_001.kjb' -listparam

Parameter: DATA_PATH_1=, default=/parking/TDWI/Source Data : Data file path discovered during export
Parameter: DATA_PATH_2=, default=/parking/TDWI : Data file path discovered during export            

Because the default values are set you can in fact test the job before you send it over.

What’s next?

Next on the agenda (after the 3.2 release) is to make this function available in the execution dialog so that we can more easily do remote execution.  Another interesting execution option is to store the generated zip files in a folder or even in a database so that we can always see exactly what was executed at a certain given time.

Until next time,
Matt

5 comments

  • Hi!

    Sound like a great addition!

    Congrats ;)

  • Congratulations, Matt!

  • stefan meyer

    Hi Matt,

    i did post this sort of export functionality a couple months ago and sven had a look at it. I also implemented and provided an import functionality, so that the resources can be managed with a source code repository. Is that function available?

    Greetings,
    Stefan

  • Hi Stefan,

    Sorry, I have no idea what you’re talking about. I didn’t come across a JIRA case like that. There are a lot of things going on behind the scenes with respect to life cycle management and versioning. However those are things for a next release :-)

    All the best,
    Matt

  • Wow. I’ve been seeing commit messages flit by mentioning the “resource exporter”, but I hadn’t yet taken the time to learn what it was exactly.

    This is really great because it is quite similar to what we do to deploy jobs to production machines. It will be very interesting to play with it and see how it interacts with the way we build file lists and use command line arguments.

    Great work!