Dear Kettle friends,
For a while now we’ve been getting requests from users to support a system called “Data Federation” a.k.a. a “Virtual Database”. Even though it has been possible for a while to create reports on top of a Kettle transformation, this system could hardly be considered a virtual anything since the Pentaho reporting engine runs the transformation on the spot to get to the data.
The problem? A real virtual database would have to understand SQL and a data transformation engine typically doesn’t. It’s usually great at generating it, parsing it not so.
So after a lot of consideration and hesitation (you don’t really want to spend too much time in the neighborhood of SQL/JDBC code unless you want to go insane) we decided to build this anyway, mainly because folks kept asking about it and because it’s a nice challenge.
The ultimate goal is to create a virtual database that is clever enough to understand the SQL that the Mondrian ROLAP engine generates.
Here is the architecture we’re in need of:
In other words, here’s what the user should be able to do:
- He/she should be able to create any kind of transformation that generates rows of data, coming from any sort of database.
- It should be possible to use any kind of software that understands the JDBC and SQL standards
- It should have a minimal set of dependencies as far as libraries are concerned
- Data should be streamed to allow for massive amounts of data to be passed from server to client
- The SQL should be able to understand basic SQL including advanced WHERE, GROUP BY, ORDER BY, HAVING clauses. (anything that an OLAP engine needs)
- Download a recent 5.0-M1 development build from our CI system (any left failed unit tests are harmless but an indication that you are in fact dealing with non-stable software in development)
- Create a simple transformation (in .ktr file format) reading from a spreadsheet or some other nice and simple data source
- Create a Carte configuration file as described in the Server Configuration chapter on the driver page specifying
- The name of the service (for example “Service”)
- the transformation file name
- the name of the step that will deliver the data
- Then start Carte
- Then configure your client as indicated on the driver page.
- Offer easy integration with the unified repository for our EE users so that they won’t have to enter XML or have to restart a server when they want to add or change the services list. (arguably an important requisite for anyone seriously considering this to be run in production)
- Implement service and SQL data caching on the server.
- Allow writable services and “insert into” statements on the JDBC client