<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Matt Casters on Data Integration</title>
	<atom:link href="http://www.ibridge.be/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://www.ibridge.be</link>
	<description>Venting steam after a long day of writing code...</description>
	<pubDate>Thu, 18 Apr 2013 21:22:06 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5</generator>
	<language>en</language>
			<item>
		<title>The Pentaho Big Data Forum</title>
		<link>http://www.ibridge.be/?p=212</link>
		<comments>http://www.ibridge.be/?p=212#comments</comments>
		<pubDate>Thu, 18 Apr 2013 21:22:06 +0000</pubDate>
		<dc:creator>Matt Casters</dc:creator>
		
		<category><![CDATA[Data Integration]]></category>

		<category><![CDATA[Big Data]]></category>

		<category><![CDATA[Cloudera]]></category>

		<category><![CDATA[Hadoop]]></category>

		<category><![CDATA[Kettle]]></category>

		<category><![CDATA[MongoDB]]></category>

		<category><![CDATA[PDI]]></category>

		<category><![CDATA[pentaho]]></category>

		<guid isPermaLink="false">http://www.ibridge.be/?p=212</guid>
		<description><![CDATA[Dear friends,
If you&#8217;re in the Washington DC area next Tuesday, April 23rd, why not drop in on our complementary Big Data Forum:
http://events.pentaho.com/Big-Data-Forum-Registration.html
Come and listen to us and our partners Cloudera, 10gen and Unisys and see what we can do for you in the Big Data space.
See you soon in DC!
Matt
]]></description>
			<content:encoded><![CDATA[<p>Dear friends,</p>
<p>If you&#8217;re in the Washington DC area next Tuesday, April 23rd, why not drop in on our complementary Big Data Forum:</p>
<p><a href="http://events.pentaho.com/Big-Data-Forum-Registration.html">http://events.pentaho.com/Big-Data-Forum-Registration.html</a></p>
<p>Come and listen to us and our partners Cloudera, 10gen and Unisys and see what we can do for you in the Big Data space.</p>
<p>See you soon in DC!</p>
<p>Matt</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ibridge.be/?feed=rss2&amp;p=212</wfw:commentRss>
		</item>
		<item>
		<title>Celebrating 10 Years of Kettle Coding</title>
		<link>http://www.ibridge.be/?p=211</link>
		<comments>http://www.ibridge.be/?p=211#comments</comments>
		<pubDate>Fri, 08 Mar 2013 14:09:54 +0000</pubDate>
		<dc:creator>Matt Casters</dc:creator>
		
		<category><![CDATA[Data Integration]]></category>

		<category><![CDATA[10 Year]]></category>

		<category><![CDATA[Java]]></category>

		<category><![CDATA[Kettle]]></category>

		<category><![CDATA[pentaho]]></category>

		<guid isPermaLink="false">http://www.ibridge.be/?p=211</guid>
		<description><![CDATA[Dear Kettle friends,
The other week Jens and I were wondering how long it had been since I first started coding the current version of Kettle.  So I started a thorough computing forensics investigation leading to the discovery of a  backup of the first ever version of Kettle.
The date that comes up from that backup is March 4th, [...]]]></description>
			<content:encoded><![CDATA[<p>Dear Kettle friends,</p>
<p>The other week <a href="http://kettle.bleuel.com/">Jens</a> and I were wondering how long it had been since I first started coding the current version of Kettle.  So I started a thorough <a href="http://en.wikipedia.org/wiki/Computer_forensics">computing forensics</a> investigation leading to the discovery of a  backup of the <a href="https://s3.amazonaws.com/kettle/FirstVersion.zip">first ever version</a> of Kettle.</p>
<p>The date that comes up from that backup is March 4th, 2003, just about 10 years ago.  The development of Kettle started earlier with analyses documents (most probably lost but nothing much was actually lost if you know what I mean) and even a version written in C as that was the main programming language I used back then to get things done.</p>
<p>Java was at mainstream version 1.3 and 1.4 but lots of &#8220;Applets&#8221; still ran at 1.1 or 1.2, generics didn&#8217;t exist, computers had in general 1 CPU, 512MB RAM&#8230; and I had a book called something like &#8220;<em>Java in 21 days</em>&#8221; to teach me how to get going.  From there it took another 2 and a half years, lots of re-factoring and lots of help to get to the open sourcing of version 2.2 in December 2005.</p>
<p>While going back to the beginning of Kettle&#8217;s history it&#8217;s easy to understate the importance of Pentaho. After all, of those 10 years of the current code-base, over 7 have been spent working with the rest of the Pentaho team to build the best data integration tool on the planet.  Programming alone is fine but in general you get more things done in a team.  It&#8217;s absolutely fantastic to see the whole team chip in alongside the community on things like bug fixing, builds, continuous integration, UI, design, plugins, website, forums, JIRA triage, product management, marketing, events, sales, &#8230;</p>
<p>Thank you all for making Kettle the awesome tool it is today and the incredible tool that Kettle5 will be.</p>
<p>Cheers,</p>
<p>Matt</p>
<p><img class="alignleft" src="http://www.ibridge.be/images/Kettle10YearsPieCutout.jpg" alt="10 Years of Kettle Pie" width="300" height="250" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.ibridge.be/?feed=rss2&amp;p=211</wfw:commentRss>
		</item>
		<item>
		<title>Data federation</title>
		<link>http://www.ibridge.be/?p=210</link>
		<comments>http://www.ibridge.be/?p=210#comments</comments>
		<pubDate>Thu, 02 Aug 2012 22:17:45 +0000</pubDate>
		<dc:creator>Matt Casters</dc:creator>
		
		<category><![CDATA[Data Integration]]></category>

		<guid isPermaLink="false">http://www.ibridge.be/?p=210</guid>
		<description><![CDATA[Dear Kettle friends,
For a while now we&#8217;ve been getting requests from users to support a system called &#8220;Data Federation&#8221; a.k.a. a &#8220;Virtual Database&#8221;.   Even though it has been possible for a while to create reports on top of a Kettle transformation, this system could hardly be considered a virtual anything since the Pentaho reporting engine runs [...]]]></description>
			<content:encoded><![CDATA[<p>Dear Kettle friends,</p>
<p>For a while now we&#8217;ve been getting requests from users to support a system called &#8220;Data Federation&#8221; a.k.a. a &#8220;Virtual Database&#8221;.   Even though it has been possible for a while to create reports on top of a Kettle transformation, this system could hardly be considered a virtual anything since the Pentaho reporting engine runs the transformation on the spot to get to the data.</p>
<p>The problem?  A real virtual database would have to understand SQL and a data transformation engine typically doesn&#8217;t.  It&#8217;s usually great at generating it, parsing it not so.</p>
<p>So after a lot of consideration and hesitation (you don&#8217;t really want to spend too much time in the neighborhood of SQL/JDBC code unless you want to go insane) we decided to build this anyway, mainly because folks kept asking about it and because it&#8217;s a nice challenge.</p>
<p>The ultimate goal is to create a virtual database that is clever enough to understand the SQL that the Mondrian ROLAP engine generates.</p>
<p>Here is the architecture we&#8217;re in need of:</p>
<p><img src="http://wiki.pentaho.com/download/attachments/23536492/thin-kettle-jdbc-driver-architecture.png?version=1&amp;modificationDate=1343414178000" alt="" width="652" height="350" /></p>
<p>In other words, here&#8217;s what the user should be able to do:</p>
<ul>
<li>He/she should be able to create any kind of transformation that generates rows of data, coming from any sort of database.</li>
<li>It should be possible to use any kind of software that understands the JDBC and SQL standards</li>
<li>It should have a minimal set of dependencies as far as libraries are concerned</li>
<li>Data should be streamed to allow for massive amounts of data to be passed from server to client</li>
<li>The SQL should be able to understand basic SQL including advanced WHERE, GROUP BY, ORDER BY, HAVING clauses. (anything that an OLAP engine needs)</li>
</ul>
<div>Not for the first time, I though to myself (and the patient ##pentaho community on IRC) : &#8220;This can&#8217;t be that hard!!&#8221;.  After all, you only need to parse SQL that gets data from a single (virtual) database table since joining and so on can be done in the service transformation.</div>
<div>So I started pounding on my keyboard for a few weeks (rudely interrupted by a week of vacation in France) and a solution is now more or less ready for more testing&#8230;</div>
<div>You can read all details about it on the following wiki page:</div>
<blockquote>
<div><a href="http://wiki.pentaho.com/display/EAI/The+Thin+Kettle+JDBC+driver">http://wiki.pentaho.com/display/EAI/The+Thin+Kettle+JDBC+driver</a></div>
</blockquote>
<div>The cool thing about Kettle data federation is that anyone can test this in half an hour time following the next few simple steps:</div>
<div>
<ul>
<li>Download a recent 5.0-M1 development build <a href="http://ci.pentaho.com/job/Kettle/">from our CI system</a> (any left failed unit tests are harmless but an indication that you are in fact dealing with non-stable software in development)</li>
<li>Create a simple transformation (in .ktr file format) reading from a spreadsheet or some other nice and simple data source</li>
<li>Create a Carte configuration file as described in the Server Configuration chapter on <a href="http://wiki.pentaho.com/display/EAI/The+Thin+Kettle+JDBC+driver">the driver page</a> specifying
<ul>
<li>The name of the service (for example &#8220;Service&#8221;)</li>
<li>the transformation file name</li>
<li>the name of the step that will deliver the data</li>
</ul>
</li>
<li>Then start Carte</li>
<li>Then configure your client as indicated on the driver page.</li>
</ul>
<div>For example, I created a transformation to test with that delivered some simple static data:</div>
<div><img src="http://www.ibridge.be/images/thin-jdbc/spoon-test-data-kettle-jdbc.png" alt="" width="535" height="463" /></div>
<div>I have been testing with Mondrian on the EE BI Server 4.1.0-GA, and as indicated on the driver page, simply replaced all the kettle jar files in the server/biserver-ee<span>/tomcat/webapps/pentaho/WEB-INF/lib/ folder.</span></div>
<div>Then you can do everything from inside the user interface.</div>
</div>
<div>Create the data source database connection:</div>
<div><img src="http://www.ibridge.be/images/thin-jdbc/bi-dbdialog-kettle-jdbc.png" alt="" /></div>
<div>Follow the data source wizard, select &#8220;Reporting and Analyses&#8221; at the bottom:</div>
<div><img src="http://www.ibridge.be/images/thin-jdbc/bi-wizard1-kettle-jdbc.png" alt="" width="821" height="577" /></div>
<div>Select one table only and specify that table as the fact table:</div>
<div><img src="http://www.ibridge.be/images/thin-jdbc/bi-wizard2-kettle-jdbc.png" alt="" width="821" height="580" /></div>
<div>Then you are about ready to start the reporting &amp; analyses action.  Simply keep the default model (you</div>
<div>can customize it later)&#8230;</div>
<div><img src="http://www.ibridge.be/images/thin-jdbc/bi-wizard3-kettle-jdbc.png" alt="" width="495" height="276" /></div>
<div>You are now ready to create interactive reports&#8230;</div>
<div><img src="http://wiki.pentaho.com/download/attachments/23536492/pentaho-interactive-reporting-sql-kettle-jdbc1.png?version=1&amp;modificationDate=1343732011888" alt="" width="1126" height="696" /></div>
<div>&#8230; and analyzer views:</div>
<div><img src="http://www.ibridge.be/images/thin-jdbc/pentaho-analyzer-sql-kettle-jdbc1.png" alt="" /></div>
<div>So get started on this and make sure to give us a lot of feedback, your success stories and failures as well.  You can comment on the driver page or in the corresponding <a href="http://jira.pentaho.com/browse/PDI-8231">JIRA case PDI-8231</a></div>
<div>The future plans are:</div>
<div>
<ul>
<li>Offer easy integration with the unified repository for our EE users so that they won&#8217;t have to enter XML or have to restart a server when they want to add or change the services list. (arguably an important requisite for anyone seriously considering this to be run in production)</li>
<li>Implement service and SQL data caching on the server.</li>
<li>Allow writable services and &#8220;insert into&#8221; statements on the JDBC client</li>
</ul>
</div>
<div>Enjoy!</div>
<div>Matt</div>
]]></content:encoded>
			<wfw:commentRss>http://www.ibridge.be/?feed=rss2&amp;p=210</wfw:commentRss>
		</item>
		<item>
		<title>Better Data for Better Analytics</title>
		<link>http://www.ibridge.be/?p=209</link>
		<comments>http://www.ibridge.be/?p=209#comments</comments>
		<pubDate>Tue, 08 May 2012 19:14:58 +0000</pubDate>
		<dc:creator>Matt Casters</dc:creator>
		
		<category><![CDATA[Data Integration]]></category>

		<category><![CDATA[Data Cleaner]]></category>

		<category><![CDATA[Data Quality]]></category>

		<category><![CDATA[Human Inference]]></category>

		<category><![CDATA[Kettle]]></category>

		<category><![CDATA[pentaho]]></category>

		<guid isPermaLink="false">http://www.ibridge.be/?p=209</guid>
		<description><![CDATA[
Dear Kettle friends,
Thursday May 10th, in a few days, I&#8217;ll be joining my friend Kasper Sørensen (the founder and lead architect of DataCleaner, a Human Inference data profiling project) in our web seminar (webinar).  We&#8217;ll be going over a bit of history, our cooperation model as well as the architecture behind the new data quality features.
Register here: http://www.pentaho.com/resources/events/20120510-better-data-for-better-analytics/
Kasper [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft" style="border-style: initial; border-color: initial; float: left; margin-left: 20px; margin-right: 20px;" src="http://www.ibridge.be/images/app-icon-hires.png" alt="" width="120" height="150" /></p>
<p>Dear Kettle friends,</p>
<p>Thursday May 10th, in a few days, I&#8217;ll be joining my friend Kasper <span>Sørensen (</span>the founder and lead architect of <a href="http://datacleaner.eobjects.org">DataCleaner</a>, a <a href="http://www.humaninference.com">Human Inference</a> data profiling project) in our web seminar (webinar).  We&#8217;ll be going over a bit of history, our cooperation model as well as the architecture behind the new data quality features.</p>
<p><strong><span style="text-decoration: underline;">Register here</span></strong>: <a href="http://www.pentaho.com/resources/events/20120510-better-data-for-better-analytics/">http://www.pentaho.com/resources/events/20120510-better-data-for-better-analytics/</a></p>
<p>Kasper will also be doing 3 cool live demos on the subjects of data profiling and data quality.</p>
<p>I hope you&#8217;ll be able to <a href="http://www.pentaho.com/resources/events/20120510-better-data-for-better-analytics/">join the crowd</a> this <strong>Thursday May 10th, 10am PST (Los Angeles), 1pm EST (New York) or 7pm CET (Brussels).</strong></p>
<p>We&#8217;ll be doing our best to answer your data quality questions simultaneously with the presentation.</p>
<p>See you there!</p>
<p>Cheers,<br />
Matt</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ibridge.be/?feed=rss2&amp;p=209</wfw:commentRss>
		</item>
		<item>
		<title>DM-Radio today</title>
		<link>http://www.ibridge.be/?p=208</link>
		<comments>http://www.ibridge.be/?p=208#comments</comments>
		<pubDate>Thu, 08 Mar 2012 13:10:04 +0000</pubDate>
		<dc:creator>Matt Casters</dc:creator>
		
		<category><![CDATA[Data Integration]]></category>

		<category><![CDATA[ETL]]></category>

		<category><![CDATA[Kettle]]></category>

		<category><![CDATA[Marketing]]></category>

		<category><![CDATA[pentaho]]></category>

		<category><![CDATA[Radio]]></category>

		<category><![CDATA[Show]]></category>

		<guid isPermaLink="false">http://www.ibridge.be/?p=208</guid>
		<description><![CDATA[Dear Kettle fans,
Today I&#8217;ll be joining Eric Kavanagh and Jim Ericson on the DM Radio show for an episode titled &#8220;On the move: Why ETL is here to stay&#8221;.
If you&#8217;re interested in listening in, don&#8217;t forget to register at the landing page over here.
All the best,
Matt
]]></description>
			<content:encoded><![CDATA[<p>Dear Kettle fans,</p>
<p>Today I&#8217;ll be joining Eric Kavanagh and Jim Ericson on the DM Radio show for an episode titled &#8220;On the move: Why ETL is here to stay&#8221;.</p>
<p>If you&#8217;re interested in listening in, don&#8217;t forget to <a href="http://www.information-management.com/dmradio/-10022068-1.html">register at the landing page over here</a>.</p>
<p>All the best,</p>
<p>Matt</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ibridge.be/?feed=rss2&amp;p=208</wfw:commentRss>
		</item>
		<item>
		<title>Big Kettle News</title>
		<link>http://www.ibridge.be/?p=207</link>
		<comments>http://www.ibridge.be/?p=207#comments</comments>
		<pubDate>Mon, 30 Jan 2012 14:57:26 +0000</pubDate>
		<dc:creator>Matt Casters</dc:creator>
		
		<category><![CDATA[Data Integration]]></category>

		<category><![CDATA[Open Source]]></category>

		<category><![CDATA[Big Data]]></category>

		<category><![CDATA[NoSQL]]></category>

		<guid isPermaLink="false">http://www.ibridge.be/?p=207</guid>
		<description><![CDATA[Dear Kettle fans,
Today I&#8217;m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape.  To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle [...]]]></description>
			<content:encoded><![CDATA[<p>Dear Kettle fans,</p>
<p>Today I&#8217;m really excited to be able to <a href="http://www.marketwire.com/press-release/Pentaho-Open-Sources-Big-Data-Capabilities-to-Further-Fuel-Widespread-Adoption-1612600.htm">announce</a> a few really important changes to the Pentaho Data Integration landscape.  To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself&#8230;</p>
<p><strong><span style="text-decoration: underline;">First of all&#8230;</span></strong></p>
<p>Pentaho is again open sourcing an important piece of software.  Today we&#8217;re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so on.</p>
<p>This is important to you because it means that you can now use Kettle to integrate a multitude of technologies, ranging from files over relational databases to big data and NoSQL.  You can do this in other words without writing any code.  Take a look at how easy it is to program for Hadoop MapReduce:</p>
<p><object width="800" height="600"><param name="movie" value="http://www.youtube.com/v/KZe1UugxXcs&amp;hl=en&amp;fs=1&amp;rel=0&amp;border=1"><param name="allowFullScreen" value="true"><param name="allowscriptaccess" value="always"><embed src="http://www.youtube.com/v/KZe1UugxXcs&amp;hl=en&amp;fs=1&amp;rel=0&amp;border=1&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="800" height="600"></object></p>
<p>In other words, this part of the big news of today allows you to use the best tool for the job, whatever that tool is.  You can now combine the large set of steps and job entries with all the available data sources and use that to integrate everything.  Especially for Hadoop the time it takes to implement a MapReduce job is really small taking the sting out of costly and long training and testing cycles.</p>
<p><strong><span style="text-decoration: underline;">But that&#8217;s not all&#8230;</span></strong></p>
<p>Pentaho Data Integration as well as the new big data plugins are now available under the Apache License 2.0.  This means that it&#8217;s now very easy to integrate Kettle or the plugins in 3rd party software.  In fact, for Hadoop, all major distributions are already supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera&#8217;s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR&#8217;s M3 Free and M5 Edition.<br />
The change of Kettle from LGPL to Apache License 2.0 was broadly supported by our community and acts as an open invitation for other projects (and companies) to integrate Kettle.  I hope that more NoSQL, Big Data and Big Search communities will reach out to us to work together to even broaden our portfolio.  The way I see it, the Kettle community just got a whole lot bigger!</p>
<p><strong><span style="text-decoration: underline;">Where are the goodies?</span></strong></p>
<p>The main <a href="http://wiki.pentaho.com/display/BAD/Pentaho+Big+Data+Community+Home">landing page for the Big Data community</a> is placed on our wiki to emphasize our intention to closely work with the various communities to make Pentaho Big Data a success.  You can find all information over there, including a set of videos, PDI 4.3.0 preview download (including Big Data plugins), Hadoop installation instructions, PRD configuration information and much more.</p>
<p>Thanks for your time reading this and thanks for using Pentaho software!</p>
<p>Matt</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ibridge.be/?feed=rss2&amp;p=207</wfw:commentRss>
		</item>
		<item>
		<title>Data Modeling</title>
		<link>http://www.ibridge.be/?p=206</link>
		<comments>http://www.ibridge.be/?p=206#comments</comments>
		<pubDate>Thu, 03 Nov 2011 14:07:44 +0000</pubDate>
		<dc:creator>Matt Casters</dc:creator>
		
		<category><![CDATA[Data Integration]]></category>

		<category><![CDATA[Open Source]]></category>

		<category><![CDATA[metadata]]></category>

		<category><![CDATA[Kettle]]></category>

		<category><![CDATA[Kimball]]></category>

		<category><![CDATA[Multidimensional modeling]]></category>

		<category><![CDATA[PDI]]></category>

		<guid isPermaLink="false">http://www.ibridge.be/?p=206</guid>
		<description><![CDATA[Dear data integration fans,
I&#8217;m a big fan of &#8220;appropriate&#8221; data modeling prior to doing any data integration work.  For a number of folks out there that means the creation of an Enterprise Data Warehouse model in classical Bill Inmon style.  Others prefer to use modern modeling techniques like Data Vault, created by Dan Linstedt.  However, the [...]]]></description>
			<content:encoded><![CDATA[<p>Dear data integration fans,</p>
<p>I&#8217;m a big fan of &#8220;appropriate&#8221; data modeling prior to doing any data integration work.  For a number of folks out there that means the creation of an Enterprise Data Warehouse model in classical <a href="http://en.wikipedia.org/wiki/Bill_Inmon">Bill Inmon</a> style.  Others prefer to use modern modeling techniques like <a href="http://en.wikipedia.org/wiki/Data_Vault_Modeling">Data Vault</a>, created by <a href="http://danlinstedt.com/">Dan Linstedt</a>.  However, the largest group data warehouse architects use a technique called <a href="http://en.wikipedia.org/wiki/Dimensional_modeling">dimensional modeling</a> championed by <a href="http://en.wikipedia.org/wiki/Ralph_Kimball">Ralph Kimball</a>.</p>
<p>Using a modeling technique is very important since it brings structure to your data warehouse.  The techniques used, when applied correctly of-course, are helping you in a big way to avoid all sorts of pitfalls in the design of a data warehouse.</p>
<p>From my own experience and from what I see in my own Kettle community, dimensional modeling is by far the most popular technique used to create data warehouses.  For that reason (and the fact that I&#8217;m a huge fan of Kimball) I&#8217;ve always made sure to properly support the most complex part of technique: the <a href="http://en.wikipedia.org/wiki/Slowly_changing_dimension">slowly changing dimension</a>.  For the better part that has made Kettle an excellent choice when it comes to easy translation of your dimensional model to ETL.</p>
<p>However, where these days you have open source tools like <a href="http://www.datawarehousemanagement.org/">Quipu</a> and <a href="http://rapidace.com/">RapidACE</a> for data vault modelling I was sad to see that not too much exists for dimensional modeling in combination with Kettle.</p>
<p>So a few weeks ago I was doing some basic modeling for a new Pentaho <a href="http://jira.pentaho.com/browse/PDI-6665">logging</a> data mart for PDI 4.3 EE.  This data mart will be responsible for the delivery of easy to digest reports, analyses views and dashboards on the subjects of monitoring and logging of Pentaho servers.  Initially I started doing this in a nice Eclipse plugin called <a href="http://www.umlet.com/">UMLet</a> which resulted in a data model like this:</p>
<p><img class="aligncenter" style="vertical-align: middle; border: 1px solid black;" src="http://www.ibridge.be/images/ExecutionResult.png" alt="" width="800" height="400" /></p>
<p>While this result isn&#8217;t the worst diagram you can possibly imagine there are a number of problems with the approach:</p>
<ul>
<li>The information about dimensions, attributes, relationships, &#8230; is not captured in a structured way.</li>
<li>Export of the metadata is not possible in any usable format except for PDF and images.</li>
<li>UMLet, like so many UML and modeling tools is a generic tool that also supports many other features that I&#8217;m not interested in when I&#8217;m doing dimensional modeling.  As a result, creating a model takes time and real effort.</li>
<li>The work needs to be used in your favorite ETL tool so it makes sense to be have it handy there, instead of having to use a third party tool.</li>
</ul>
<div>So I thought: wouldn&#8217;t it be great if I had some sort of perspective in Spoon where I could do a bit dimensional modeling based on a logical Pentaho metadata model?</div>
<div><img src="http://www.ibridge.be/images/starmodeler-perspective.png" alt="" /></div>
<p>Wouldn&#8217;t it be great if I could create a new metadata domain to hold all the star models for a certain data mart?</p>
<p><img src="http://www.ibridge.be/images/starmodeler-create-domain.png" alt="" /></p>
<p>Then wouldn&#8217;t it be great if you could edit your star models in there?</p>
<p><img src="http://www.ibridge.be/images/starmodeler-starmodels.png" alt="" /></p>
<p>The graphics don&#8217;t have to be anything fancy, I thought.  It just needs to automatically position the fact table in the middle and the dimensions around it&#8230;</p>
<p><img src="http://www.ibridge.be/images/starmodeler-starmodel-info.png" alt="" /></p>
<p>Obviously, I would like to be able to edit the name, description and type of the dimension &#8230;</p>
<p><img src="http://www.ibridge.be/images/starmodeler-dimension-type.png" alt="" /></p>
<p>and depending on the type of dimension I would like to insert a bunch of default attributes&#8230;</p>
<p><img src="http://www.ibridge.be/images/starmodeler-dim-attributes.png" alt="" /></p>
<p>Using standard Kettle data grid I should be able to copy attributes and other metadata back and forth between dimension dialogs and a spreadsheet as well.</p>
<p>In the fact table definition it would be cool if we could not only specify the facts but also the relationships to the dimension&#8230;</p>
<p><img src="http://www.ibridge.be/images/starmodeler-fact-attributes.png" alt="" /></p>
<p>Because that way we wouldn&#8217;t have to worry about how to draw the star model and we would know everything we would need to know.</p>
<p>If we would have a tool like that we would be able to generate the SQL to generate the physical tables against a certain target database&#8230;</p>
<p><img src="http://www.ibridge.be/images/starmodeler-sql.png" alt="" /></p>
<p>Because if we would have all sorts of knowledge in metadata of the dimensions we could really nicely generate all the required data types, indexes and what not.</p>
<p>And then it would be cool to also generate a template transformation to update the dimension and fact tables in the models&#8230;</p>
<p><img src="http://www.ibridge.be/images/starmodeler-etl.png" alt="" /></p>
<p>Well, I thought it would be nice to have that sort of functionality.</p>
<p>Perhaps we could also create physical Pentaho metadata domain (XMI) from the star domain as well as Mondrian schemas and a PDF with documentation.</p>
<p>OK, so this is coming to a PDI release near you in the short term.  I&#8217;ve only been working on it for a few weeks on and off but you can try <a href="http://kettle4.s3.amazonaws.com/starmodeler.zip">an early version here</a>.  Simply unzip it in the plugins folder of a <a href="http://ci.pentaho.com/job/Kettle/"><strong>PDI 4.3</strong> build</a>.  The plugin needs 4.3 since that version already includes a lot of libraries like Pentaho metadata and reporting and that way I don&#8217;t need to package all those libraries with it.  We can see later how we can deploy on 4.2 as well.</p>
<p>Please provide feedback here or in <a href="http://jira.pentaho.com/browse/PDI-6890">PDI-6890</a>.</p>
<p>Until next time,</p>
<p>Matt</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ibridge.be/?feed=rss2&amp;p=206</wfw:commentRss>
		</item>
		<item>
		<title>Streaming XML content parsing with StAX</title>
		<link>http://www.ibridge.be/?p=205</link>
		<comments>http://www.ibridge.be/?p=205#comments</comments>
		<pubDate>Fri, 12 Aug 2011 16:27:59 +0000</pubDate>
		<dc:creator>Matt Casters</dc:creator>
		
		<category><![CDATA[Data Integration]]></category>

		<category><![CDATA[Open Source]]></category>

		<category><![CDATA[Kettle]]></category>

		<category><![CDATA[parsing]]></category>

		<category><![CDATA[PDI]]></category>

		<category><![CDATA[pentaho]]></category>

		<category><![CDATA[Pentaho Data Integration]]></category>

		<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://www.ibridge.be/?p=205</guid>
		<description><![CDATA[Today, one of our community members posted a deviously simply XML format on the forum that needed to be parsed.  The format looks like this:
&#60;RESPONSE&#62;
&#60;EXPR&#62;USD&#60;/EXPR&#62;
  &#60;EXCH&#62;GBP&#60;/EXCH&#62;
  &#60;AMOUNT&#62;1&#60;/AMOUNT&#62;
  &#60;NPRICES&#62;1&#60;/NPRICES&#62;
  &#60;CONVERSION&#62;
    &#60;DATE&#62;Fri, 01 Jun 2001 22:50:00 GMT&#60;/DATE&#62;
    &#60;ASK&#62;1.4181&#60;/ASK&#62;
    &#60;BID&#62;1.4177&#60;/BID&#62;
  &#60;/CONVERSION&#62;

  &#60;EXPR&#62;USD&#60;/EXPR&#62;
  [...]]]></description>
			<content:encoded><![CDATA[<p>Today, one of our community members posted a deviously simply XML format <a href="http://forums.pentaho.com/showthread.php?83480-XPath-in-Get-data-from-XML-tool">on the forum</a> that needed to be parsed.  The format looks like this:</p>
<pre>&lt;RESPONSE&gt;
&lt;EXPR&gt;USD&lt;/EXPR&gt;
  &lt;EXCH&gt;GBP&lt;/EXCH&gt;
  &lt;AMOUNT&gt;1&lt;/AMOUNT&gt;
  &lt;NPRICES&gt;1&lt;/NPRICES&gt;
  &lt;CONVERSION&gt;
    &lt;DATE&gt;Fri, 01 Jun 2001 22:50:00 GMT&lt;/DATE&gt;
    &lt;ASK&gt;1.4181&lt;/ASK&gt;
    &lt;BID&gt;1.4177&lt;/BID&gt;
  &lt;/CONVERSION&gt;

  &lt;EXPR&gt;USD&lt;/EXPR&gt;
  &lt;EXCH&gt;JPY&lt;/EXCH&gt;
  &lt;AMOUNT&gt;1&lt;/AMOUNT&gt;
  &lt;NPRICES&gt;1&lt;/NPRICES&gt;
  &lt;CONVERSION&gt;
    &lt;DATE&gt;Fri, 01 Jun 2001 22:50:02 GMT&lt;/DATE&gt;
    &lt;ASK&gt;0.008387&lt;/ASK&gt;
    &lt;BID&gt;0.008382&lt;/BID&gt;
  &lt;/CONVERSION&gt;</pre>
<pre>  ...</pre>
<pre>&lt;/RESPONSE&gt;</pre>
<p>Typically we parse XML content with the &#8220;<a href="http://wiki.pentaho.com/display/EAI/Get+Data+From+XML">Get Data From XML</a>&#8221; step which used XPath expressions to parse this content.  However, since the meaning of the XML content is determined by position instead of path, this becomes a problem.  To be specific, for each CONVERSION block you need to pick the last preceding EXPR and EXCH values.  You could solve it like this:</p>
<p><img src="http://www.ibridge.be/images/positional-xml-long.png" alt="" /></p>
<p>Unfortunately, this method requires a full parsing of your file 3 times and once extra for each additional preceding element.  The joining and all also slows things down considerably.</p>
<p>So this is another case where the new &#8220;<a href="http://wiki.pentaho.com/display/EAI/XML+Input+Stream+%28StAX%29">XML Input Stream (StAX)</a>&#8221; step comes to the rescue.  The solution using this step is the following:</p>
<p><img src="http://www.ibridge.be/images/positional-xml-stax.png" alt="" /></p>
<p>Here&#8217;s how it works:</p>
<p>1) The output of the &#8220;positional element.xml&#8221; step flattens the content of the XML file so that you can see the output of each individual SAX event like &#8220;start of element&#8221;, &#8220;characters&#8221;, &#8220;end of element&#8221;.  Every time you get the path, parent path, element value and so forth.  As mentioned in <a href="http://wiki.pentaho.com/display/EAI/XML+Input+Stream+%28StAX%29">the doc</a> this step is very fast and can handle files with just about any size with a minimal footprint.  It will appear in PDI version 4.2.0GA.</p>
<p>2) With a bit of scripting we collect information from the various rows that we find interesting.</p>
<p>3) We filter out only the result lines (the end of the CONVERSION element).  What you get is the following desired output:</p>
<p><img src="http://www.ibridge.be/images/positional-xml-result.png" alt="" /></p>
<p>The usage of JavaScript in this example is not ideal but compared to the reading speed of the XML I&#8217;m sure it&#8217;s fine for most use-cases.</p>
<p>Both examples are up for download <a href="http://forums.pentaho.com/showthread.php?83480-XPath-in-Get-data-from-XML-tool&amp;p=261230#post261230">from the forum</a>.</p>
<p>The &#8220;XML Input Stream (StAX)&#8221; step has also shown to work great with huge hierarchical XML structures, files of multiple GB in size.  The step was written by colleague <a href="http://kettle.bleuel.com/">Jens Bleuel</a> and he <a href="http://kettle.bleuel.com/2011/06/24/the-new-xml-input-stream-stax-step-in-pdi-42/">documented a more complex example on his blog</a>.</p>
<p>Have fun with it!</p>
<p>Matt</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ibridge.be/?feed=rss2&amp;p=205</wfw:commentRss>
		</item>
		<item>
		<title>Real-time streaming data aggregation</title>
		<link>http://www.ibridge.be/?p=204</link>
		<comments>http://www.ibridge.be/?p=204#comments</comments>
		<pubDate>Thu, 28 Jul 2011 08:54:03 +0000</pubDate>
		<dc:creator>Matt Casters</dc:creator>
		
		<category><![CDATA[Data Integration]]></category>

		<category><![CDATA[Open Source]]></category>

		<category><![CDATA[Kettle]]></category>

		<category><![CDATA[PDI]]></category>

		<category><![CDATA[pentaho]]></category>

		<category><![CDATA[Pentaho Data Integration]]></category>

		<category><![CDATA[Real-time]]></category>

		<category><![CDATA[streaming]]></category>

		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://www.ibridge.be/?p=204</guid>
		<description><![CDATA[Dear Kettle users,
Most of you usually use a data integration engine to process data in a batch-oriented way.  Pentaho Data Integration (Kettle) is typically deployed to run monthly, nightly, hourly workloads.  Sometimes folks run micro-batches of work every minute or so.  However, it&#8217;s lesser known that our beloved transformation engine can also be used to [...]]]></description>
			<content:encoded><![CDATA[<p>Dear Kettle users,</p>
<p>Most of you usually use a data integration engine to process data in a batch-oriented way.  Pentaho Data Integration (Kettle) is typically deployed to run monthly, nightly, hourly workloads.  Sometimes folks run micro-batches of work every minute or so.  However, it&#8217;s lesser known that our beloved transformation engine can also be used to stream data indefinitely (never ending) from a source to a target.  This sort of data integration is sometimes referred to as being &#8220;<em>streaming</em>&#8220;, &#8220;<em>real-time</em>&#8220;, &#8220;<em>near real-time</em>&#8220;, &#8220;<em>continuous</em>&#8221; and so on.  Typical examples of situations where you have a never-ending supply of data that needs to be processed the instance it becomes available are JMS (<a href="http://en.wikipedia.org/wiki/Java_Message_Service">Java Message Service</a>), RDBMS log sniffing, on-line fraud analyses, web or application log sniffing or of-course &#8230; <a href="http://www.twitter.com">Twitter</a>!  Since Twitter is easily accessed it&#8217;s common for examples to pop up regarding it&#8217;s usage and in this blog post too we will use this service to demo the Pentaho Data Integration capabilities wrt to processing streaming data.</p>
<p>Here&#8217;s what we want to do:</p>
<ol>
<li>Continuously read all the tweets that are being sent on Twitter.</li>
<li>Extract all the hash-tags used</li>
<li>Count the number of hash-tags used in a one-minute time-window</li>
<li>Report on all the tags that are being used more than once</li>
<li>Put the output in a browser window, continuously update every minute.</li>
</ol>
<p>This is a very generic example but the logic of this can be applied to different fields like JMS, HL7, log sniffing and so on.  It differs from the excellent work that <a href="http://open-bi.blogspot.com/2011/07/query-twitter-with-talend-to-see-what.html">Vincent from Open-BI described earlier this week on his blog</a> in the sense that his Talend job finishes where ours will never end and where ours will do time-based aggregation in contrast to aggregation over a finite data set.</p>
<p>Also note that in order for Kettle to fully support multiple streaming data sources we would have to implement support for &#8220;windowed&#8221; (time-based) joins and other nifty things.  We&#8217;ve seen very little demand for this sort of requirement in the past, perhaps because people don&#8217;t know it&#8217;s possible with Kettle.  In any case, if you currently are in need of full streaming data support, have a look at <a href="http://www.sqlstream.com/">SQLStream</a>, they can help you. SQLStream is co-founded by Pentaho&#8217;s <a href="http://julianhyde.blogspot.com/">Julian Hyde</a> of <a href="http://mondrian.pentaho.com/">Mondrian</a> fame.</p>
<p>OK, let&#8217;s see how we can solve our little problem with Kettle instead&#8230;</p>
<p><span style="text-decoration: underline;"><strong>1. Continuously read all the tweets that are being sent on Twitter.</strong></span></p>
<p>For this we are going to use one of the public Twitter web services, one that delivers a never-ending stream of JSON messages:</p>
<blockquote><p>http://stream.twitter.com/1/statuses/sample.json?delimited=length</p></blockquote>
<p>Since the format of the output is never-ending and specific in nature I wrote a small &#8220;User Defined Java Class&#8221; script:</p>
<blockquote>
<pre>public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
HttpClient client = SlaveConnectionManager.getInstance().createHttpClient();
client.setTimeout(10000);
client.setConnectionTimeout(10000);

Credentials creds = new UsernamePasswordCredentials(getParameter("USERNAME"), getParameter("PASSWORD"));
client.getState().setCredentials(AuthScope.ANY, creds);
client.getParams().setAuthenticationPreemptive(true);

HttpMethod method = new PostMethod("http://stream.twitter.com/1/statuses/sample.json?delimited=length");

// Execute request
//
InputStream inputStream=null;
BufferedInputStream bufferedInputStream=null;
try {
int result = client.executeMethod(method);

// the response
//
inputStream = method.getResponseBodyAsStream();
bufferedInputStream = new BufferedInputStream(inputStream, 1000);

StringBuffer bodyBuffer = new StringBuffer();
int opened=0;
int c;
while ( (c=bufferedInputStream.read())!=-1  &amp;&amp; !isStopped()) {
char ch = (char)c;
bodyBuffer.append(ch);
if (ch=='{') opened++; else if (ch=='}') opened--;
if (ch=='}' &amp;&amp; opened==0) {
// one JSON block, pass it on!
//
Object[] r = createOutputRow(new Object[0], data.outputRowMeta.size());
String jsonString = bodyBuffer.toString();

int startIndex = jsonString.indexOf("{");
if (startIndex&lt;0) startIndex=0;

// System.out.print("index="+startIndex+" json="+jsonString.substring(startIndex));

r[0] = jsonString.substring(startIndex);
putRow(data.outputRowMeta, r);

bodyBuffer.setLength(0);
}
}
} catch(Exception e) {
throw new KettleException("Unable to get tweets", e);
} finally {
bufferedInputStream.reset();
bufferedInputStream.close();
}

setOutputDone();
return false;
}</pre>
</blockquote>
<p>As the experienced <a href="http://wiki.pentaho.com/display/EAI/User+Defined+Java+Class">UDJC</a> writers among you will notice: this step never ends as long as the twitter service keeps on sending more data.  Depending on the stability and popularity of twitter that can be &#8220;<em>a very long time</em>&#8220;.</p>
<p>You could improve the code even further to re-connect to the service every time it drops away.  Personally I would not do this.  I would rather have the transformation terminate with an error (as it is implemented now), send an alert (e-mail, database, SNMP) and re-start the transformation in a loop in a job.  That way you have a trace in case twitter dies for a few hours.</p>
<p><span style="text-decoration: underline;"><strong>2. Extract all the hash-tags used</strong></span></p>
<p>First we&#8217;ll parse the JSON returned by the twitter service, extract the first 5 hash-tags from the message, split this up into 5 rows and count the tags&#8230;</p>
<p><img src="http://www.ibridge.be/images/streaming-time-based-aggregation-hashtags.png" alt="" width="880" height="519" /><span style="text-decoration: underline;"><strong></strong></span></p>
<p><span style="text-decoration: underline;"><strong>3. Count the number of hash-tags used in a one-minute time-window</strong></span></p>
<p>The counting is easy as you can simply use a &#8220;Group by&#8221;  step.  However, how can we aggregate in a time-based fashion without too much tinkering?   Well, we now have the &#8220;Single Threader&#8221; step which has the option to aggregate in a time-based manner so we might as well use this option:</p>
<p><img src="http://www.ibridge.be/images/streaming-time-based-aggregation-single-threader.png" alt="" /></p>
<p>This step simply accumulates all records in memory until 60 seconds have passed and then performs one iteration of the single threaded execution of the specified transformation.  This is a special execution method that doesn&#8217;t use the typical parallel engine.  Another cool thing about this engine is that the records that go into the engine in the time-window can be grouped and sorted without the transformation being restarted every minute.</p>
<p><span style="text-decoration: underline;"><strong>4. Report on all the tags that are being used more than once</strong></span></p>
<p>The filtering is done with a simple &#8220;Filter Rows&#8221; step.  However, thanks to the magic of the &#8220;Single Threader&#8221; step we can sort the rows descending by the tag occurrence count in that one-minute time-window.  It&#8217;s also interesting to note that if you have huge amounts of data, that you can easily parallelize your work by starting multiple copies of the single threader step and/or with some clever data partitioning.  In our case we could partition by hash-tag or re-aggregate the aggregated data.</p>
<p><span style="text-decoration: underline;"><strong>5. Put the output in a browser window, continuously update every minute.</strong></span></p>
<p>As shown in <a href="http://www.ibridge.be/?p=199">an earlier blog post</a>, we can do this quite easily with a &#8220;Text File Output&#8221; step.  However, we also want to put a small header and a separator between the data from every minute so we end up with a transformation that looks like this:</p>
<p><img src="http://www.ibridge.be/images/streaming-time-based-aggregation-parent.png" alt="" /></p>
<p>The script to print the header looks like this:</p>
<blockquote>
<pre>var out;
if (out==null) {
out = _step_.getTrans().getServletPrintWriter();
out.println("'Real-time' twitter hashtag report, minute based");
out.flush();
}</pre>
</blockquote>
<blockquote><p>The separator between each minute is simple too:</p>
<pre>if (nr==1) {
var out = _step_.getTrans().getServletPrintWriter();</pre>
<pre>  out.println("============================================");
out.println();</pre>
<pre>  out.flush();
}</pre>
</blockquote>
<p>You can execute this transformation on a Carte instance (4.2.0) and see the following output:</p>
<pre>'Real-time' twitter hashtag report, minute based
=================================================

nr;hashtag;count;from;to
1;tatilmayonezi;5;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
2;AUGUST6THBUZZNIGHTCLUB;3;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
3;teamfollowback;3;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
4;ayamzaman;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
5;dnd;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
6;follow;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
7;malhação;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
8;rappernames;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
9;thingswelearnedontwitter;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
=================================================

1;ska;5;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
2;followplanetjedward;4;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
3;chistede3pesos;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
4;NP;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
5;rappernames;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
6;tatilmayonezi;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
7;teamfollowback;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
8;AvrilBeatsVolcano;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
9;CM6;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
10;followme;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
11;Leão;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
12;NewArtists;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
13;OOMF;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
14;RETWEET;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
15;sougofollow;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
16;swag;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
17;thingswelearnedontwitter;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000

...</pre>
<p>For reference, I used the following URL to start the streaming report:</p>
<pre>http://cluster:cluster@127.0.0.1:8282/kettle/executeTrans/?trans=%2Fhome%2Fmatt%2Ftest-stuff%2FTwitter Stream%2FRead a twitter stream.ktr&amp;USERNAME=MyTwitterAccount&amp;PASSWORD=MyPassword</pre>
<p>I placed the complete example <a href="http://www.ibridge.be/files/twitter-stream.zip">over here</a> in case you want to try this yourself on PDI/Kettle version 4.2.0-RC1 or later. Things you can add to make it even cooler is to have this transformation send an e-mail every time a certain hash-tag gets used more than 15 times in a given minute.  That sort of alerting support for example gives you easy access to emerging new trends, events and memes.</p>
<p>For reference, take a look at <a href="http://www.ibridge.be/?p=202">this earlier blog post of mine</a> where I describe the internal cleanup mechanisms inside of Kettle that prevent our transformation from ever running out of memory or resources.</p>
<p>Until next time,</p>
<p>Matt</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ibridge.be/?feed=rss2&amp;p=204</wfw:commentRss>
		</item>
		<item>
		<title>What&#8217;s new in 4.2.0</title>
		<link>http://www.ibridge.be/?p=203</link>
		<comments>http://www.ibridge.be/?p=203#comments</comments>
		<pubDate>Fri, 17 Jun 2011 14:11:50 +0000</pubDate>
		<dc:creator>Matt Casters</dc:creator>
		
		<category><![CDATA[Data Integration]]></category>

		<category><![CDATA[4.2]]></category>

		<category><![CDATA[ETL]]></category>

		<category><![CDATA[Kettle]]></category>

		<category><![CDATA[pentaho]]></category>

		<category><![CDATA[Pentaho Data Integration]]></category>

		<guid isPermaLink="false">http://www.ibridge.be/?p=203</guid>
		<description><![CDATA[I took the time out to build a high level overview of all the new big ticket items that are going to be in the upcoming version 4.2 of Kettle (Pentaho Data Integration).]]></description>
			<content:encoded><![CDATA[<p>Dear Kettle fans,</p>
<p>Instead of pointing to the impressive <a href="http://jira.pentaho.com/secure/ReleaseNote.jspa?projectId=10062&amp;version=11114">list of changes in JIRA</a> I took the time out to build a high level overview of all the new big ticket items that are going to be in the upcoming version 4.2 of Kettle (Pentaho Data Integration).  Allow me to share it with you&#8230;:</p>
<ul>
<li>The <strong>Excel Writer</strong> step offers advanced Excel output functionality to control the look and feel of your spreadsheets.</li>
<li>Graphical performance and progress feedback for transformations</li>
<li>The <strong>Google Analytics</strong> step allows download of statistics from your Google analytics account</li>
<li>The <strong>Pentaho Reporting Output</strong> step makes it possible for you to run your (parameterized) Pentaho reports in a transformation. It allows for easy report bursting of personalized reports.</li>
<li>The <strong>Automatic Documentation</strong> step generates (simple) documentation of your transformations and jobs using the Pentaho Reporting API.</li>
<li>The <strong>Get repository names</strong> step retrieves job and transformation information from your repositories.</li>
<li>The <strong>LDAP Writer</strong> step</li>
<li>The <strong>Ingres VectorWise (streaming) bulk loader</strong> step</li>
<li>The <strong>Greenplumb (streaming) bulk loader</strong> step (for gpload)</li>
<li>The <strong>Talend Job Execution</strong> job entry</li>
<li>Healthcare Level 7 : <strong>HL7 Input</strong> step, <strong>HL7 MLLP Input</strong> and <strong>HL7 MLLP Acknowledge</strong> job entries</li>
<li>The <strong>PGP File Encryption,</strong> <strong>Decryption &amp; validation</strong> job entries facilitate encryption and decryption of files using PGP.</li>
<li>The <strong>Single Threader</strong> step for parallel performance tuning of large transformations</li>
<li>Allow a job to be started at a job entry of your choice (continue after fixing an error)</li>
<li>The <strong>MongoDB Input</strong> step (including authentication)</li>
<li>The <strong>ElasticSearch bulk loader</strong></li>
<li>The <strong>XML Input Stream (StAX)</strong> step to read huge XML files at optimal  performance and flat memory usage by flattening the structure of the  data.</li>
<li>The <strong>Get ID from Slave Server</strong> step allows multi-host or clustered transformations to get globally unique integer IDs from a slave server: <a href="http://wiki.pentaho.com/display/EAI/Get+ID+from+Slave+Server" target="_blank">http://wiki.pentaho.com/display/EAI/Get+ID+from+Slave+Server</a></li>
<li>Carte improvements:
<ol>
<li>reserve next value range from a slave sequence service</li>
<li>allow parallel (simultaneous) runs of clustered transformations</li>
<li>list (reserved and free) socket reservations service</li>
<li>new options in XML for configuring slave sequences</li>
<li>allow time-out of stale objects using environment variable <strong style="font-weight: normal;">KETTLE_CARTE_OBJECT_TIMEOUT_MINUTES</strong></li>
</ol>
</li>
<li>Memory tuning of logging back-end with: <strong style="font-weight: normal;">KETTLE_MAX_LOGGING_REGISTRY_SIZE</strong>, <strong style="font-weight: normal;">KETTLE_MAX_JOB_ENTRIES_LOGGED</strong>, <strong style="font-weight: normal;">KETTLE_MAX_JOB_TRACKER_SIZE</strong> allowing for flat memory usage for never ending ETL in general and jobs specifically.</li>
<li>Repository Import/Export
<ol>
<li>Export at the repository folder level</li>
<li>Export and Import with optional rule-based validations</li>
<li><strong>Import command line utility</strong> allow for rule-based (optional) import of  lists of transformations, jobs and repository export files: <a href="http://wiki.pentaho.com/display/EAI/Import+User+Documentation">http://wiki.pentaho.com/display/EAI/Import+User+Documentation</a></li>
</ol>
</li>
<li>ETL Metadata Injection:
<ol>
<li>Retrieval of rows of data from a step to the &#8220;metadata injection&#8221; step</li>
<li>Support for injection into the &#8220;Excel Input&#8221; step</li>
<li>Support for injection into the &#8220;Row normaliser&#8221; step</li>
<li>Support for injection into the &#8220;Row Denormaliser&#8221; step</li>
</ol>
</li>
<li>The <strong>Multiway Merge Join</strong> step (experimental) allows for any number of data sources to be joined using one or more keys using an inner or a full outer join algorithm.</li>
</ul>
<p>Beyond this list there&#8217;s as mentioned a long list of bug fixes and small improvements to the various steps and job entries.  It&#8217;s impossible to thank the complete community for all the contributions they&#8217;ve made to make this release a smashing success.  If you think it feels more like a 5.0 version please remember that we&#8217;re pretty conservative about version numbering.  As long as we don&#8217;t break our own Java API we won&#8217;t go to another major version.</p>
<p>Also remember you can try out all these new features right now by using <a href="http://ci.pentaho.com/job/Kettle/">a CI build</a> or once the RC1 build is posted on <a href="http://sourceforge.net/projects/pentaho/files/">SourceForge</a> later on.  Please help our QA team by posting any issues you might find in <a href="http://jira.pentaho.com/browse/PDI">JIRA</a>.</p>
<p>Last but certainly not least let&#8217;s not forget to mention the upcoming exciting features of the new <strong>Pentaho BI Server version 4</strong>.  I won&#8217;t spoil the surprise for you but I can tell you that certain things in that new release are looking really (really!) nice.  Next Thursday (Europe – 13:00 GMT/UTC, 9:00am EST, Americas – 1:00pm EST, 10:00am PST) you can join us for a web conference with live demo.  Please <a href="http://www.pentaho.com/events/pentaho-bi-4/">register here</a> if you are interested.</p>
<p>Have fun with the new Pentaho software releases!</p>
<p>Regards,<br />
Matt</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ibridge.be/?feed=rss2&amp;p=203</wfw:commentRss>
		</item>
	</channel>
</rss>
