November 3rd 2011

Data Modeling

Dear data integration fans,

I’m a big fan of “appropriate” data modeling prior to doing any data integration work.  For a number of folks out there that means the creation of an Enterprise Data Warehouse model in classical Bill Inmon style.  Others prefer to use modern modeling techniques like Data Vault, created by Dan Linstedt.  However, the largest group data warehouse architects use a technique called dimensional modeling championed by Ralph Kimball.

Using a modeling technique is very important since it brings structure to your data warehouse.  The techniques used, when applied correctly of-course, are helping you in a big way to avoid all sorts of pitfalls in the design of a data warehouse.

From my own experience and from what I see in my own Kettle community, dimensional modeling is by far the most popular technique used to create data warehouses.  For that reason (and the fact that I’m a huge fan of Kimball) I’ve always made sure to properly support the most complex part of technique: the slowly changing dimension.  For the better part that has made Kettle an excellent choice when it comes to easy translation of your dimensional model to ETL.

However, where these days you have open source tools like Quipu and RapidACE for data vault modelling I was sad to see that not too much exists for dimensional modeling in combination with Kettle.

So a few weeks ago I was doing some basic modeling for a new Pentaho logging data mart for PDI 4.3 EE.  This data mart will be responsible for the delivery of easy to digest reports, analyses views and dashboards on the subjects of monitoring and logging of Pentaho servers.  Initially I started doing this in a nice Eclipse plugin called UMLet which resulted in a data model like this:

While this result isn’t the worst diagram you can possibly imagine there are a number of problems with the approach:

  • The information about dimensions, attributes, relationships, … is not captured in a structured way.
  • Export of the metadata is not possible in any usable format except for PDF and images.
  • UMLet, like so many UML and modeling tools is a generic tool that also supports many other features that I’m not interested in when I’m doing dimensional modeling.  As a result, creating a model takes time and real effort.
  • The work needs to be used in your favorite ETL tool so it makes sense to be have it handy there, instead of having to use a third party tool.
So I thought: wouldn’t it be great if I had some sort of perspective in Spoon where I could do a bit dimensional modeling based on a logical Pentaho metadata model?

Wouldn’t it be great if I could create a new metadata domain to hold all the star models for a certain data mart?

Then wouldn’t it be great if you could edit your star models in there?

The graphics don’t have to be anything fancy, I thought.  It just needs to automatically position the fact table in the middle and the dimensions around it…

Obviously, I would like to be able to edit the name, description and type of the dimension …

and depending on the type of dimension I would like to insert a bunch of default attributes…

Using standard Kettle data grid I should be able to copy attributes and other metadata back and forth between dimension dialogs and a spreadsheet as well.

In the fact table definition it would be cool if we could not only specify the facts but also the relationships to the dimension…

Because that way we wouldn’t have to worry about how to draw the star model and we would know everything we would need to know.

If we would have a tool like that we would be able to generate the SQL to generate the physical tables against a certain target database…

Because if we would have all sorts of knowledge in metadata of the dimensions we could really nicely generate all the required data types, indexes and what not.

And then it would be cool to also generate a template transformation to update the dimension and fact tables in the models…

Well, I thought it would be nice to have that sort of functionality.

Perhaps we could also create physical Pentaho metadata domain (XMI) from the star domain as well as Mondrian schemas and a PDF with documentation.

OK, so this is coming to a PDI release near you in the short term.  I’ve only been working on it for a few weeks on and off but you can try an early version here.  Simply unzip it in the plugins folder of a PDI 4.3 build.  The plugin needs 4.3 since that version already includes a lot of libraries like Pentaho metadata and reporting and that way I don’t need to package all those libraries with it.  We can see later how we can deploy on 4.2 as well.

Please provide feedback here or in PDI-6890.

Until next time,

Matt

6 Comments »

August 12th 2011

Streaming XML content parsing with StAX

Today, one of our community members posted a deviously simply XML format on the forum that needed to be parsed.  The format looks like this:

<RESPONSE>
<EXPR>USD</EXPR>
  <EXCH>GBP</EXCH>
  <AMOUNT>1</AMOUNT>
  <NPRICES>1</NPRICES>
  <CONVERSION>
    <DATE>Fri, 01 Jun 2001 22:50:00 GMT</DATE>
    <ASK>1.4181</ASK>
    <BID>1.4177</BID>
  </CONVERSION>

  <EXPR>USD</EXPR>
  <EXCH>JPY</EXCH>
  <AMOUNT>1</AMOUNT>
  <NPRICES>1</NPRICES>
  <CONVERSION>
    <DATE>Fri, 01 Jun 2001 22:50:02 GMT</DATE>
    <ASK>0.008387</ASK>
    <BID>0.008382</BID>
  </CONVERSION>
  ...
</RESPONSE>

Typically we parse XML content with the “Get Data From XML” step which used XPath expressions to parse this content.  However, since the meaning of the XML content is determined by position instead of path, this becomes a problem.  To be specific, for each CONVERSION block you need to pick the last preceding EXPR and EXCH values.  You could solve it like this:

Unfortunately, this method requires a full parsing of your file 3 times and once extra for each additional preceding element.  The joining and all also slows things down considerably.

So this is another case where the new “XML Input Stream (StAX)” step comes to the rescue.  The solution using this step is the following:

Here’s how it works:

1) The output of the “positional element.xml” step flattens the content of the XML file so that you can see the output of each individual SAX event like “start of element”, “characters”, “end of element”.  Every time you get the path, parent path, element value and so forth.  As mentioned in the doc this step is very fast and can handle files with just about any size with a minimal footprint.  It will appear in PDI version 4.2.0GA.

2) With a bit of scripting we collect information from the various rows that we find interesting.

3) We filter out only the result lines (the end of the CONVERSION element).  What you get is the following desired output:

The usage of JavaScript in this example is not ideal but compared to the reading speed of the XML I’m sure it’s fine for most use-cases.

Both examples are up for download from the forum.

The “XML Input Stream (StAX)” step has also shown to work great with huge hierarchical XML structures, files of multiple GB in size.  The step was written by colleague Jens Bleuel and he documented a more complex example on his blog.

Have fun with it!

Matt

7 Comments »

July 28th 2011

Real-time streaming data aggregation

Dear Kettle users,

Most of you usually use a data integration engine to process data in a batch-oriented way.  Pentaho Data Integration (Kettle) is typically deployed to run monthly, nightly, hourly workloads.  Sometimes folks run micro-batches of work every minute or so.  However, it’s lesser known that our beloved transformation engine can also be used to stream data indefinitely (never ending) from a source to a target.  This sort of data integration is sometimes referred to as being “streaming“, “real-time“, “near real-time“, “continuous” and so on.  Typical examples of situations where you have a never-ending supply of data that needs to be processed the instance it becomes available are JMS (Java Message Service), RDBMS log sniffing, on-line fraud analyses, web or application log sniffing or of-course … Twitter!  Since Twitter is easily accessed it’s common for examples to pop up regarding it’s usage and in this blog post too we will use this service to demo the Pentaho Data Integration capabilities wrt to processing streaming data.

Here’s what we want to do:

  1. Continuously read all the tweets that are being sent on Twitter.
  2. Extract all the hash-tags used
  3. Count the number of hash-tags used in a one-minute time-window
  4. Report on all the tags that are being used more than once
  5. Put the output in a browser window, continuously update every minute.

This is a very generic example but the logic of this can be applied to different fields like JMS, HL7, log sniffing and so on.  It differs from the excellent work that Vincent from Open-BI described earlier this week on his blog in the sense that his Talend job finishes where ours will never end and where ours will do time-based aggregation in contrast to aggregation over a finite data set.

Also note that in order for Kettle to fully support multiple streaming data sources we would have to implement support for “windowed” (time-based) joins and other nifty things.  We’ve seen very little demand for this sort of requirement in the past, perhaps because people don’t know it’s possible with Kettle.  In any case, if you currently are in need of full streaming data support, have a look at SQLStream, they can help you. SQLStream is co-founded by Pentaho’s Julian Hyde of Mondrian fame.

OK, let’s see how we can solve our little problem with Kettle instead…

1. Continuously read all the tweets that are being sent on Twitter.

For this we are going to use one of the public Twitter web services, one that delivers a never-ending stream of JSON messages:

http://stream.twitter.com/1/statuses/sample.json?delimited=length

Since the format of the output is never-ending and specific in nature I wrote a small “User Defined Java Class” script:

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
HttpClient client = SlaveConnectionManager.getInstance().createHttpClient();
client.setTimeout(10000);
client.setConnectionTimeout(10000);

Credentials creds = new UsernamePasswordCredentials(getParameter("USERNAME"), getParameter("PASSWORD"));
client.getState().setCredentials(AuthScope.ANY, creds);
client.getParams().setAuthenticationPreemptive(true);

HttpMethod method = new PostMethod("http://stream.twitter.com/1/statuses/sample.json?delimited=length");

// Execute request
//
InputStream inputStream=null;
BufferedInputStream bufferedInputStream=null;
try {
int result = client.executeMethod(method);

// the response
//
inputStream = method.getResponseBodyAsStream();
bufferedInputStream = new BufferedInputStream(inputStream, 1000);

StringBuffer bodyBuffer = new StringBuffer();
int opened=0;
int c;
while ( (c=bufferedInputStream.read())!=-1  && !isStopped()) {
char ch = (char)c;
bodyBuffer.append(ch);
if (ch=='{') opened++; else if (ch=='}') opened--;
if (ch=='}' && opened==0) {
// one JSON block, pass it on!
//
Object[] r = createOutputRow(new Object[0], data.outputRowMeta.size());
String jsonString = bodyBuffer.toString();

int startIndex = jsonString.indexOf("{");
if (startIndex<0) startIndex=0;

// System.out.print("index="+startIndex+" json="+jsonString.substring(startIndex));

r[0] = jsonString.substring(startIndex);
putRow(data.outputRowMeta, r);

bodyBuffer.setLength(0);
}
}
} catch(Exception e) {
throw new KettleException("Unable to get tweets", e);
} finally {
bufferedInputStream.reset();
bufferedInputStream.close();
}

setOutputDone();
return false;
}

As the experienced UDJC writers among you will notice: this step never ends as long as the twitter service keeps on sending more data.  Depending on the stability and popularity of twitter that can be “a very long time“.

You could improve the code even further to re-connect to the service every time it drops away.  Personally I would not do this.  I would rather have the transformation terminate with an error (as it is implemented now), send an alert (e-mail, database, SNMP) and re-start the transformation in a loop in a job.  That way you have a trace in case twitter dies for a few hours.

2. Extract all the hash-tags used

First we’ll parse the JSON returned by the twitter service, extract the first 5 hash-tags from the message, split this up into 5 rows and count the tags…

3. Count the number of hash-tags used in a one-minute time-window

The counting is easy as you can simply use a “Group by”  step.  However, how can we aggregate in a time-based fashion without too much tinkering?   Well, we now have the “Single Threader” step which has the option to aggregate in a time-based manner so we might as well use this option:

This step simply accumulates all records in memory until 60 seconds have passed and then performs one iteration of the single threaded execution of the specified transformation.  This is a special execution method that doesn’t use the typical parallel engine.  Another cool thing about this engine is that the records that go into the engine in the time-window can be grouped and sorted without the transformation being restarted every minute.

4. Report on all the tags that are being used more than once

The filtering is done with a simple “Filter Rows” step.  However, thanks to the magic of the “Single Threader” step we can sort the rows descending by the tag occurrence count in that one-minute time-window.  It’s also interesting to note that if you have huge amounts of data, that you can easily parallelize your work by starting multiple copies of the single threader step and/or with some clever data partitioning.  In our case we could partition by hash-tag or re-aggregate the aggregated data.

5. Put the output in a browser window, continuously update every minute.

As shown in an earlier blog post, we can do this quite easily with a “Text File Output” step.  However, we also want to put a small header and a separator between the data from every minute so we end up with a transformation that looks like this:

The script to print the header looks like this:

var out;
if (out==null) {
out = _step_.getTrans().getServletPrintWriter();
out.println("'Real-time' twitter hashtag report, minute based");
out.flush();
}

The separator between each minute is simple too:

if (nr==1) {
var out = _step_.getTrans().getServletPrintWriter();
  out.println("============================================");
out.println();
  out.flush();
}

You can execute this transformation on a Carte instance (4.2.0) and see the following output:

'Real-time' twitter hashtag report, minute based
=================================================

nr;hashtag;count;from;to
1;tatilmayonezi;5;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
2;AUGUST6THBUZZNIGHTCLUB;3;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
3;teamfollowback;3;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
4;ayamzaman;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
5;dnd;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
6;follow;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
7;malhação;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
8;rappernames;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
9;thingswelearnedontwitter;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
=================================================

1;ska;5;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
2;followplanetjedward;4;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
3;chistede3pesos;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
4;NP;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
5;rappernames;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
6;tatilmayonezi;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
7;teamfollowback;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
8;AvrilBeatsVolcano;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
9;CM6;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
10;followme;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
11;Leão;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
12;NewArtists;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
13;OOMF;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
14;RETWEET;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
15;sougofollow;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
16;swag;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
17;thingswelearnedontwitter;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000

...

For reference, I used the following URL to start the streaming report:

http://cluster:cluster@127.0.0.1:8282/kettle/executeTrans/?trans=%2Fhome%2Fmatt%2Ftest-stuff%2FTwitter Stream%2FRead a twitter stream.ktr&USERNAME=MyTwitterAccount&PASSWORD=MyPassword

I placed the complete example over here in case you want to try this yourself on PDI/Kettle version 4.2.0-RC1 or later. Things you can add to make it even cooler is to have this transformation send an e-mail every time a certain hash-tag gets used more than 15 times in a given minute.  That sort of alerting support for example gives you easy access to emerging new trends, events and memes.

For reference, take a look at this earlier blog post of mine where I describe the internal cleanup mechanisms inside of Kettle that prevent our transformation from ever running out of memory or resources.

Until next time,

Matt

10 Comments »

June 17th 2011

What’s new in 4.2.0

Dear Kettle fans,

Instead of pointing to the impressive list of changes in JIRA I took the time out to build a high level overview of all the new big ticket items that are going to be in the upcoming version 4.2 of Kettle (Pentaho Data Integration).  Allow me to share it with you…:

  • The Excel Writer step offers advanced Excel output functionality to control the look and feel of your spreadsheets.
  • Graphical performance and progress feedback for transformations
  • The Google Analytics step allows download of statistics from your Google analytics account
  • The Pentaho Reporting Output step makes it possible for you to run your (parameterized) Pentaho reports in a transformation. It allows for easy report bursting of personalized reports.
  • The Automatic Documentation step generates (simple) documentation of your transformations and jobs using the Pentaho Reporting API.
  • The Get repository names step retrieves job and transformation information from your repositories.
  • The LDAP Writer step
  • The Ingres VectorWise (streaming) bulk loader step
  • The Greenplumb (streaming) bulk loader step (for gpload)
  • The Talend Job Execution job entry
  • Healthcare Level 7 : HL7 Input step, HL7 MLLP Input and HL7 MLLP Acknowledge job entries
  • The PGP File Encryption, Decryption & validation job entries facilitate encryption and decryption of files using PGP.
  • The Single Threader step for parallel performance tuning of large transformations
  • Allow a job to be started at a job entry of your choice (continue after fixing an error)
  • The MongoDB Input step (including authentication)
  • The ElasticSearch bulk loader
  • The XML Input Stream (StAX) step to read huge XML files at optimal performance and flat memory usage by flattening the structure of the data.
  • The Get ID from Slave Server step allows multi-host or clustered transformations to get globally unique integer IDs from a slave server: http://wiki.pentaho.com/display/EAI/Get+ID+from+Slave+Server
  • Carte improvements:
    1. reserve next value range from a slave sequence service
    2. allow parallel (simultaneous) runs of clustered transformations
    3. list (reserved and free) socket reservations service
    4. new options in XML for configuring slave sequences
    5. allow time-out of stale objects using environment variable KETTLE_CARTE_OBJECT_TIMEOUT_MINUTES
  • Memory tuning of logging back-end with: KETTLE_MAX_LOGGING_REGISTRY_SIZE, KETTLE_MAX_JOB_ENTRIES_LOGGED, KETTLE_MAX_JOB_TRACKER_SIZE allowing for flat memory usage for never ending ETL in general and jobs specifically.
  • Repository Import/Export
    1. Export at the repository folder level
    2. Export and Import with optional rule-based validations
    3. Import command line utility allow for rule-based (optional) import of lists of transformations, jobs and repository export files: http://wiki.pentaho.com/display/EAI/Import+User+Documentation
  • ETL Metadata Injection:
    1. Retrieval of rows of data from a step to the “metadata injection” step
    2. Support for injection into the “Excel Input” step
    3. Support for injection into the “Row normaliser” step
    4. Support for injection into the “Row Denormaliser” step
  • The Multiway Merge Join step (experimental) allows for any number of data sources to be joined using one or more keys using an inner or a full outer join algorithm.

Beyond this list there’s as mentioned a long list of bug fixes and small improvements to the various steps and job entries.  It’s impossible to thank the complete community for all the contributions they’ve made to make this release a smashing success.  If you think it feels more like a 5.0 version please remember that we’re pretty conservative about version numbering.  As long as we don’t break our own Java API we won’t go to another major version.

Also remember you can try out all these new features right now by using a CI build or once the RC1 build is posted on SourceForge later on.  Please help our QA team by posting any issues you might find in JIRA.

Last but certainly not least let’s not forget to mention the upcoming exciting features of the new Pentaho BI Server version 4.  I won’t spoil the surprise for you but I can tell you that certain things in that new release are looking really (really!) nice.  Next Thursday (Europe – 13:00 GMT/UTC, 9:00am EST, Americas – 1:00pm EST, 10:00am PST) you can join us for a web conference with live demo.  Please register here if you are interested.

Have fun with the new Pentaho software releases!

Regards,
Matt

3 Comments »

May 31st 2011

Memory tuning fast paced ETL

Dear Kettle friends,

on occasion we need to support environments where not only a lot of data needs to be processed but also in frequent batches.  For example, a new data file with hundreds of thousands of rows arrives in a folder every few seconds.

In this setting we want to use clustering to use “commodity” computing resources in parallel.  In this blog post I’ll detail how the general architecture would look like and how to tune memory usage in this environment.

Clustering was first created around the end of 2006.  Back then it looked like this.

The master

This is the most important part of our cluster.  It takes care of administrating network configuration and topology.  It also keeps track of the state of dynamically added slave servers.

The master is started just like any other slave server below.  It’s just a logical part of a cluster schema but without it we can’t do any clustering.

The slaves

Any Kettle cluster needs one or more slaves servers to do the heavy lifting.  The slave servers can be manually added to the cluster schema using the Spoon browser, or they can be dynamically configured as detailed here.

All slave servers are started by running the Carte server.  You can start it by simply running the following command:

sh carte.sh myhostname 8282

By executing this, we started a small embedded web server that we can actually access on the following URL:  http://myhostname:8282/

An alternative way of running Carte is by passing it an XML configuration file.  You can find a few examples in the pwd/ folder of your Pentaho Data Integration (Kettle) distribution package.

Security

The user names and passwords of the carte instances are defined in the pwd/kettle.pwd file.  The defaults are user: “cluster” and password “cluster”.

A clustered transformation

Our transformation reads the incoming data files in parallel from a network attached storage device.  This is the fastest way to read the file as it eliminates all possible CPU bottlenecks incurred by character code-page conversion and data conversion logic.  The data is then split into pieces with regular expression logic and some complex scripts are applied (add more steps, rinse and repeat to your liking).  Finally the data is landed in an output file or in a database.

To get some idea on the master about how much data was processed we count the row and calculate a total for the complete cluster on the master.  Please note that all steps that are configured to run clustered will be executed on the slave servers (in our case 2) and the others will be executed on the master slave server.

How does this work?

The executing job (or Spoon) takes the definition of your transformation and creates one master transformation and a set of Slave transformations.  This splitting of the original transformation is made possible because Pentaho Data Integration is 100% metadata driven.  So Kettles takes your Kettle metadata, splits it and converts it to make it suitable for execution on the various servers.   If you enable the “Show transformations” option in the transformation execution dialog in Spoon, you can see the generated master and slave transformations.

Sometimes (like in our example) data needs to be sent from the master to the slaves or vice-versa.  We don’t send this data over web services as it would be too slow.  For this we use TCP/IP sockets.  The port numbers are allocated in advance and administered by the master server.  You can list all the allocated sockets for a certain host by using the following call in your favorite browser:

http://masterhost:8282/kettle/listSocket/?host=slavehost

If you only want to see the open (or used) sockets, you can add  “&onlyOpen=Y” to the URL.

Then the transformations are sent to the master and slave servers using the web services (simple servlets actually) that are exposed on the Carte instances.   They arrive on the slave servers and are put in a “Waiting” state.  Then all the transformations are initialized (or prepared) with another web service call.  At this point any server sockets are opened for those steps that needs to send data to a remote next step (from a slave to a master or vice versa).  When all that completed successfully we start the transformations.

After the clustered execution of the master and slave transformations is competed we run a clean-up on the various slaves where sockets are closed and where server ports on the master are deallocated for re-use by another transformation.

Parallel clustering

To further help out with fast-paced environments, version 4.2.0-RC (due real soon) will support parallel execution of the same or different clustered transformations on the same cluster.  This allows you to devise a strategy where a clustered execution is started for each incoming file without regards to the state of the previous execution.  To keep the various executions apart we’ve introduced a new internal Kettle variable called ${Internal.Cluster.Run.ID} which you can use on the master and the slaves to write to different output files for example.

A never ending job

If you check the “Repeat” option in the dialog of the “Start” job entry, you get a never ending job unless you execute an “Abort job” job entry.  This is very convenient if you want to have logic that keeps looking for new files or more things to do until nothing is left.  We can also use this to grab another value from a message queue, a web service or a directory full of files and near-real time data integration in general.

In the past this option somewhat became discredited because memory management in versions prior to the upcoming 4.1 was not as good as it is now.  Let’s take a look at what those options in version 4.2 are…

Time out stale carte objects

Every time you execute a clustered transformation you will see a new master and slave transformation appear in a “Waiting” state on slave server.  If you execute every couple of minutes or even faster, you will get a long list of logged transformations on the slave servers.  By default (in 4.2) they will be automatically purged but only after one day.  Setting a faster time-out period is important in this case.  You can do this either by setting the <object_timeout_minutes> option in your carte configuration file.  Another option is to set the KETTLE_CARTE_OBJECT_TIMEOUT_MINUTES variable in the kettle.properties files on the various servers.

Configure the logging back-end

Since version 4, the logging back-end of Kettle was completely re-written to be as flexible as possible.  This logging back-end also provides us with valuable tracing and lineage information.  Every executable component in Kettle, every database connection, transformation, job, step or job entry has its own unique logging channel ID.  Upon creation, the component registers itself in a central logging registry where we keep track of the name and type of the component (and so on) but also which parent it has.  With this we know the complete execution lineage of a job for example.
During execution, log lines are kept in a central log buffer so that we can easily retrieve the lines incrementally (in Spoon) or store them in a database table. (transformation or job logging for example).  Each line references the logging registry so we know at all times where it came from.

All this information is being managed as detailed on this wiki page but in high-paced environments it is still wise to set limits as to how much memory can be consumed by the logging back-end.  For this there are a few parameters you can set:

  • KETTLE_MAX_LOGGING_REGISTRY_SIZE : Make sure to consider this parameter in fast paced environments where a job never ends and the registry is not cleaned automatically because of this.  The default of 1000 should be enough to provide accurate logging.  If you have complex jobs you might want to increase this number.
  • KETTLE_MAX_JOB_ENTRIES_LOGGED : For never ending jobs this makes a big difference.  Please note that you can enable interval logging on the job entry log table.  Make sure you keep enough entries in memory until the next time you write them out to the database table.  The default is also set to a reasonably low 1000 entries.
  • KETTLE_MAX_JOB_TRACKER_SIZE: Again this parameter makes a difference in never ending jobs as it allows another possible memory leak to be cleaned up automatically beyond a certain size.  The job tracker keeps track of the results of job entries.  In a never ending job you rarely need more than the default, 1000 again.
  • KETTLE_MAX_LOG_SIZE_IN_LINES: If you accidentally execute a transformation in say “Row level” logging mode, an enormous amount of very detailed logging will be produced.  In the past, before version 4, this was usually a common cause for running out of memory and crashing your whole cluster.  By setting this value to a fair maximum (default is 5000 in 4.2) you will prevent this situation.  You can also specify this parameter in your Carte XML configuration file with <max_log_lines> parameter.
  • KETTLE_MAX_LOG_TIMEOUT_IN_MINUTES: If you prefer to let the records time out after a while, then that is possible.  You can specify a maximum age with this parameter.  The default maximum age is 1440 (one day).    You can also specify this parameter in your Carte XML configuration file with <max_log_timeout_minutes> parameter.

As someone who had the pleasure of testing the various settings for the past couple of weeks I can say that it all works nicely.  Once a job (executing 3 clustered transformations in parallel processing hundreds of millions of rows) runs for thousands of iterations and for days on end without consuming more than a few hundred MB you can be sure that memory management is under control.

Enjoy!

Matt

4 Comments »

Next »

Pentaho world image