April 18th 2013

The Pentaho Big Data Forum

Dear friends,

If you’re in the Washington DC area next Tuesday, April 23rd, why not drop in on our complementary Big Data Forum:

http://events.pentaho.com/Big-Data-Forum-Registration.html

Come and listen to us and our partners Cloudera, 10gen and Unisys and see what we can do for you in the Big Data space.

See you soon in DC!

Matt

No Comments yet »

March 8th 2013

Celebrating 10 Years of Kettle Coding

Dear Kettle friends,

The other week Jens and I were wondering how long it had been since I first started coding the current version of Kettle.  So I started a thorough computing forensics investigation leading to the discovery of a  backup of the first ever version of Kettle.

The date that comes up from that backup is March 4th, 2003, just about 10 years ago.  The development of Kettle started earlier with analyses documents (most probably lost but nothing much was actually lost if you know what I mean) and even a version written in C as that was the main programming language I used back then to get things done.

Java was at mainstream version 1.3 and 1.4 but lots of “Applets” still ran at 1.1 or 1.2, generics didn’t exist, computers had in general 1 CPU, 512MB RAM… and I had a book called something like “Java in 21 days” to teach me how to get going.  From there it took another 2 and a half years, lots of re-factoring and lots of help to get to the open sourcing of version 2.2 in December 2005.

While going back to the beginning of Kettle’s history it’s easy to understate the importance of Pentaho. After all, of those 10 years of the current code-base, over 7 have been spent working with the rest of the Pentaho team to build the best data integration tool on the planet.  Programming alone is fine but in general you get more things done in a team.  It’s absolutely fantastic to see the whole team chip in alongside the community on things like bug fixing, builds, continuous integration, UI, design, plugins, website, forums, JIRA triage, product management, marketing, events, sales, …

Thank you all for making Kettle the awesome tool it is today and the incredible tool that Kettle5 will be.

Cheers,

Matt

10 Years of Kettle Pie

9 Comments »

August 3rd 2012

Data federation

Dear Kettle friends,

For a while now we’ve been getting requests from users to support a system called “Data Federation” a.k.a. a “Virtual Database”.   Even though it has been possible for a while to create reports on top of a Kettle transformation, this system could hardly be considered a virtual anything since the Pentaho reporting engine runs the transformation on the spot to get to the data.

The problem?  A real virtual database would have to understand SQL and a data transformation engine typically doesn’t.  It’s usually great at generating it, parsing it not so.

So after a lot of consideration and hesitation (you don’t really want to spend too much time in the neighborhood of SQL/JDBC code unless you want to go insane) we decided to build this anyway, mainly because folks kept asking about it and because it’s a nice challenge.

The ultimate goal is to create a virtual database that is clever enough to understand the SQL that the Mondrian ROLAP engine generates.

Here is the architecture we’re in need of:

In other words, here’s what the user should be able to do:

  • He/she should be able to create any kind of transformation that generates rows of data, coming from any sort of database.
  • It should be possible to use any kind of software that understands the JDBC and SQL standards
  • It should have a minimal set of dependencies as far as libraries are concerned
  • Data should be streamed to allow for massive amounts of data to be passed from server to client
  • The SQL should be able to understand basic SQL including advanced WHERE, GROUP BY, ORDER BY, HAVING clauses. (anything that an OLAP engine needs)
Not for the first time, I though to myself (and the patient ##pentaho community on IRC) : “This can’t be that hard!!”.  After all, you only need to parse SQL that gets data from a single (virtual) database table since joining and so on can be done in the service transformation.
So I started pounding on my keyboard for a few weeks (rudely interrupted by a week of vacation in France) and a solution is now more or less ready for more testing…
You can read all details about it on the following wiki page:
The cool thing about Kettle data federation is that anyone can test this in half an hour time following the next few simple steps:
  • Download a recent 5.0-M1 development build from our CI system (any left failed unit tests are harmless but an indication that you are in fact dealing with non-stable software in development)
  • Create a simple transformation (in .ktr file format) reading from a spreadsheet or some other nice and simple data source
  • Create a Carte configuration file as described in the Server Configuration chapter on the driver page specifying
    • The name of the service (for example “Service”)
    • the transformation file name
    • the name of the step that will deliver the data
  • Then start Carte
  • Then configure your client as indicated on the driver page.
For example, I created a transformation to test with that delivered some simple static data:
I have been testing with Mondrian on the EE BI Server 4.1.0-GA, and as indicated on the driver page, simply replaced all the kettle jar files in the server/biserver-ee/tomcat/webapps/pentaho/WEB-INF/lib/ folder.
Then you can do everything from inside the user interface.
Create the data source database connection:
Follow the data source wizard, select “Reporting and Analyses” at the bottom:
Select one table only and specify that table as the fact table:
Then you are about ready to start the reporting & analyses action.  Simply keep the default model (you
can customize it later)…
You are now ready to create interactive reports…
… and analyzer views:
So get started on this and make sure to give us a lot of feedback, your success stories and failures as well.  You can comment on the driver page or in the corresponding JIRA case PDI-8231
The future plans are:
  • Offer easy integration with the unified repository for our EE users so that they won’t have to enter XML or have to restart a server when they want to add or change the services list. (arguably an important requisite for anyone seriously considering this to be run in production)
  • Implement service and SQL data caching on the server.
  • Allow writable services and “insert into” statements on the JDBC client
Enjoy!
Matt

5 Comments »

May 8th 2012

Better Data for Better Analytics

Dear Kettle friends,

Thursday May 10th, in a few days, I’ll be joining my friend Kasper Sørensen (the founder and lead architect of DataCleaner, a Human Inference data profiling project) in our web seminar (webinar).  We’ll be going over a bit of history, our cooperation model as well as the architecture behind the new data quality features.

Register herehttp://www.pentaho.com/resources/events/20120510-better-data-for-better-analytics/

Kasper will also be doing 3 cool live demos on the subjects of data profiling and data quality.

I hope you’ll be able to join the crowd this Thursday May 10th, 10am PST (Los Angeles), 1pm EST (New York) or 7pm CET (Brussels).

We’ll be doing our best to answer your data quality questions simultaneously with the presentation.

See you there!

Cheers,
Matt

No Comments yet »

March 8th 2012

DM-Radio today

Dear Kettle fans,

Today I’ll be joining Eric Kavanagh and Jim Ericson on the DM Radio show for an episode titled “On the move: Why ETL is here to stay”.

If you’re interested in listening in, don’t forget to register at the landing page over here.

All the best,

Matt

No Comments yet »

Next »

Pentaho world image