Dear Kettle friends,
For a while now we’ve been getting requests from users to support a system called “Data Federation” a.k.a. a “Virtual Database”. Even though it has been possible for a while to create reports on top of a Kettle transformation, this system could hardly be considered a virtual anything since the Pentaho reporting engine runs the transformation on the spot to get to the data.
The problem? A real virtual database would have to understand SQL and a data transformation engine typically doesn’t. It’s usually great at generating it, parsing it not so.
So after a lot of consideration and hesitation (you don’t really want to spend too much time in the neighborhood of SQL/JDBC code unless you want to go insane) we decided to build this anyway, mainly because folks kept asking about it and because it’s a nice challenge.
The ultimate goal is to create a virtual database that is clever enough to understand the SQL that the Mondrian ROLAP engine generates.
Here is the architecture we’re in need of:

In other words, here’s what the user should be able to do:
- He/she should be able to create any kind of transformation that generates rows of data, coming from any sort of database.
- It should be possible to use any kind of software that understands the JDBC and SQL standards
- It should have a minimal set of dependencies as far as libraries are concerned
- Data should be streamed to allow for massive amounts of data to be passed from server to client
- The SQL should be able to understand basic SQL including advanced WHERE, GROUP BY, ORDER BY, HAVING clauses. (anything that an OLAP engine needs)
Not for the first time, I though to myself (and the patient ##pentaho community on IRC) : “This can’t be that hard!!”. After all, you only need to parse SQL that gets data from a single (virtual) database table since joining and so on can be done in the service transformation.
So I started pounding on my keyboard for a few weeks (rudely interrupted by a week of vacation in France) and a solution is now more or less ready for more testing…
You can read all details about it on the following wiki page:
The cool thing about Kettle data federation is that anyone can test this in half an hour time following the next few simple steps:
- Download a recent 5.0-M1 development build from our CI system (any left failed unit tests are harmless but an indication that you are in fact dealing with non-stable software in development)
- Create a simple transformation (in .ktr file format) reading from a spreadsheet or some other nice and simple data source
- Create a Carte configuration file as described in the Server Configuration chapter on the driver page specifying
- The name of the service (for example “Service”)
- the transformation file name
- the name of the step that will deliver the data
- Then start Carte
- Then configure your client as indicated on the driver page.
For example, I created a transformation to test with that delivered some simple static data:
I have been testing with Mondrian on the EE BI Server 4.1.0-GA, and as indicated on the driver page, simply replaced all the kettle jar files in the server/biserver-ee/tomcat/webapps/pentaho/WEB-INF/lib/ folder.
Then you can do everything from inside the user interface.
Create the data source database connection:
Follow the data source wizard, select “Reporting and Analyses” at the bottom:
Select one table only and specify that table as the fact table:
Then you are about ready to start the reporting & analyses action. Simply keep the default model (you
can customize it later)…
You are now ready to create interactive reports…
… and analyzer views:
So get started on this and make sure to give us a lot of feedback, your success stories and failures as well. You can comment on the driver page or in the corresponding
JIRA case PDI-8231
The future plans are:
- Offer easy integration with the unified repository for our EE users so that they won’t have to enter XML or have to restart a server when they want to add or change the services list. (arguably an important requisite for anyone seriously considering this to be run in production)
- Implement service and SQL data caching on the server.
- Allow writable services and “insert into” statements on the JDBC client
Enjoy!
Matt