Data federation

Dear Kettle friends,

For a while now we’ve been getting requests from users to support a system called “Data Federation” a.k.a. a “Virtual Database”.   Even though it has been possible for a while to create reports on top of a Kettle transformation, this system could hardly be considered a virtual anything since the Pentaho reporting engine runs the transformation on the spot to get to the data.

The problem?  A real virtual database would have to understand SQL and a data transformation engine typically doesn’t.  It’s usually great at generating it, parsing it not so.

So after a lot of consideration and hesitation (you don’t really want to spend too much time in the neighborhood of SQL/JDBC code unless you want to go insane) we decided to build this anyway, mainly because folks kept asking about it and because it’s a nice challenge.

The ultimate goal is to create a virtual database that is clever enough to understand the SQL that the Mondrian ROLAP engine generates.

Here is the architecture we’re in need of:

In other words, here’s what the user should be able to do:

  • He/she should be able to create any kind of transformation that generates rows of data, coming from any sort of database.
  • It should be possible to use any kind of software that understands the JDBC and SQL standards
  • It should have a minimal set of dependencies as far as libraries are concerned
  • Data should be streamed to allow for massive amounts of data to be passed from server to client
  • The SQL should be able to understand basic SQL including advanced WHERE, GROUP BY, ORDER BY, HAVING clauses. (anything that an OLAP engine needs)
Not for the first time, I though to myself (and the patient ##pentaho community on IRC) : “This can’t be that hard!!”.  After all, you only need to parse SQL that gets data from a single (virtual) database table since joining and so on can be done in the service transformation.
So I started pounding on my keyboard for a few weeks (rudely interrupted by a week of vacation in France) and a solution is now more or less ready for more testing…
You can read all details about it on the following wiki page:
The cool thing about Kettle data federation is that anyone can test this in half an hour time following the next few simple steps:
  • Download a recent 5.0-M1 development build from our CI system (any left failed unit tests are harmless but an indication that you are in fact dealing with non-stable software in development)
  • Create a simple transformation (in .ktr file format) reading from a spreadsheet or some other nice and simple data source
  • Create a Carte configuration file as described in the Server Configuration chapter on the driver page specifying
    • The name of the service (for example “Service”)
    • the transformation file name
    • the name of the step that will deliver the data
  • Then start Carte
  • Then configure your client as indicated on the driver page.
For example, I created a transformation to test with that delivered some simple static data:
I have been testing with Mondrian on the EE BI Server 4.1.0-GA, and as indicated on the driver page, simply replaced all the kettle jar files in the server/biserver-ee/tomcat/webapps/pentaho/WEB-INF/lib/ folder.
Then you can do everything from inside the user interface.
Create the data source database connection:
Follow the data source wizard, select “Reporting and Analyses” at the bottom:
Select one table only and specify that table as the fact table:
Then you are about ready to start the reporting & analyses action.  Simply keep the default model (you
can customize it later)…
You are now ready to create interactive reports…
… and analyzer views:
So get started on this and make sure to give us a lot of feedback, your success stories and failures as well.  You can comment on the driver page or in the corresponding JIRA case PDI-8231
The future plans are:
  • Offer easy integration with the unified repository for our EE users so that they won’t have to enter XML or have to restart a server when they want to add or change the services list. (arguably an important requisite for anyone seriously considering this to be run in production)
  • Implement service and SQL data caching on the server.
  • Allow writable services and “insert into” statements on the JDBC client
Enjoy!
Matt

5 comments

  • Nick Baker

    Wow Matt! That’s some ambitious and impressive work.

  • Robert Mack

    The “big problem” with data federation is that it depends upon joining disparate databases that were not designed to be joined. Any data set commonality between disparate databases results from accidental data commonality. There is no intentional referential integrity between disparate databases! Until now….

    Data integration by design methods are used to design integrated databases or to augment disparate databases to form integrated databases. Integrated database share data access paths between databases with the enforcement of referential integrity. Please check it out @ http://www.strins.com/

  • It sounds good in theory Robert. In practice you obviously run into a serious ROI problem with putting excessive metadata and metamodels in a central place. That’s obviously MHO but since this is my blog … :-)

  • Robert Mack

    Greetings;

    Thank you for your response. You are thinking that one data model needs to contain the entire set of integrated data models. However, this is not the case at all.

    You missed a very important aspect of integrated data models. Each integrated data model shares a common set of data entities as their boundary. These shared data entities anchor the integrated data model relative to any other integrated data model. Now each integrated data model may exist as an individual encapsulated data model that may be related to any other individual encapsulated data model via the common set of data entities. Since commonality relationships are peer to peer without foreign key inheritance, as many integrated data models as desired may be related without violating the encapsulation of any integrated data model. This is a very object oriented approach to data modeling that you may wish to learn.

    Since you are interested in return on investment (ROI) you will be happy to understand that the integrated data architecture, based upon integrated databases, are far, far superior to the disparate data architectures based upon disparate databases. Integrated data architectures provide an agility, efficiency, and effectiveness that is not possible with the prior art disparate data architectures.

    Please stop by my website (http://www.strins.com) and request the data integration by design whitepaper. I guarantee you will learn things about data architecture and data modeling that even you don’t know.

    Thanks again for the blog comment.

  • Robert, I’m sorry if you felt like I was implying that you need to have a single data model, I was not. However, now you’re implying that you do need to have all data models available which is IMO only slightly less labor intensive. It’s still a single meta-model containing in this case multiple data models.

    That being said, I’m a big fan of model-driven data integration so where it makes sense Pentaho will be pushing in the direction of integrating data models as well.

    Good luck!

    Matt