Figuring out the layout of a delimited file

Hey Kettle friends,

Recently I got the question whether or not PDI would be able to automatically figure out, at runtime, what the layout of a delimited text file is simply by looking at the header of the file and by doing various scans of the file to determine the delimiter, data types of the columns and so on.

Now obviously you could also do without a header file but then the columns would have to be called something like “field1 to fieldN”.

First, we want to see which delimiter, from a list of possible delimiters, is used in the file.  We want to do this by reading every line of the file in question and by doing a character-by-character analyses of the line.  We’ll count the number of occurrences of all possible and then pick the most often used delimiter at the end and set it as a variable.

Detecting the delimiter

Detecting the delimiter

 

The hardest part of this is doing the counting in JavaScript.  We’re using an associative array:

 

// A simple algorithm: count every ,;/ and tab, see which one is used most often
//
var delimiters = [ ',', ';', '/', '\\t' ];
var counters = new Array();
var i;
for (i=0;i<line.length;i++) {
  var c=line.charAt(i);
  for (d=0;d<delimiters.length;d++) {
    if (c==delimiters[d]) {
      if (counters[c]==null) {
	    counters[c]=1;
      } else {
        counters[c]++;
      }
    }
  }
}
var maxC=0;
var max=0;
for (i=0;i<delimiters.length;i++) {   if (counters[delimiters[i]]>max) {
    maxC=i;
    max=counters[delimiters[i]];
  }
}

var delimiter=""+delimiters[maxC];
var count = counters[delimiters[maxC]];

So the above script populates an array of counters which is then evaluated to see which delimiter wins.  Obviously it should be easy to modify the list of possible participants in this delimiters game.

Now that we found the delimiter of the file, we can go ahead and determine the field names from the header of the file:

Detect field names

Detect field names

Again, we’re using a bit of JavaScript in the above :

var fields = line.split(getVariable("DELIMITER",""));
for (i=0;i<fields.length;i++) {
  var newRow = createRowCopy(getOutputRowMeta().size());
  var rowIndex = getInputRowMeta().size();
  newRow[rowIndex++] = fields[i];
  putRow(newRow);
}

var field=null;

As you can see, this script converts the line into different rows.

Previewing the field names detection

Previewing the field names detection

OK, but this still not solves the complete puzzle as we still need to figure out the hardest part: the data types, lengths, precisions, conversion masks of the various fields.  To do this we pass the column names we detected into another transformation:

metadata-detector-parser

 

So the “CSV file input” step is configured automatically by the parent transformation to read all fields as Strings.  Then there is another piece of JavaScript to allow us to make use of the same functionality being used to automatically detect data types in the “Get Fields” functionality of the “Text File Input” and “CSV file input” step dialogs.  In this case it’s done by a class called “StringEvaluator” which we’ll re-use here:

 

for (i=0;i<getInputRowMeta().size();i++) {
  var stringValue = getInputRowMeta().getString(row, i);
  var evaluator = evaluators.get(i);
  evaluator.evaluateString(stringValue);
}

At the end of the steps life, when the script ends, we have another block of code:

// write the results as new rows

for (i=0;i<getInputRowMeta().size();i++) {
  var stringValue = getInputRowMeta().getString(row, i);
  var evaluator = evaluators.get(i);
  evaluator.evaluateString(stringValue);

  var evalResult = evaluator.getAdvicedResult();
  var valueMeta = evalResult.getConversionMeta();

  var newRow = createRowCopy(getOutputRowMeta().size());
  var rowIndex = getInputRowMeta().size();

  newRow[rowIndex++] = getInputRowMeta().getValueMeta(i).getName();
  newRow[rowIndex++] = valueMeta.getTypeDesc();
  newRow[rowIndex++] = java.lang.Integer.toString(valueMeta.getLength());
  newRow[rowIndex++] = java.lang.Integer.toString(valueMeta.getPrecision());
  newRow[rowIndex++] = valueMeta.getConversionMask();
  newRow[rowIndex++] = valueMeta.getDecimalSymbol();
  newRow[rowIndex++] = valueMeta.getGroupingSymbol();

  putRow(newRow);
}

After some cleanup we finally have the data types and everything we need to properly read in the file.
You can fine the complete example over here.  You can use this on Kettle version 4.4.0 or above.

The output of the example is an XML file called layout.xml.  Obviously you can store it in different formats and/or inject it into another transformation.   You can then for example do complete normalization of every column value and store it in a database table.  Not that I would recommend such a generic structure but it’s possible.

Enjoy!

Matt

 

 

 

10 comments

  • Sathish

    Matt,

    Consider a scenario like this. The Flat file which I need to decode for delimiter and columns with data types reside in remote machine. So I am using FTP to get the file, if the file is huge , the scanning time is hindering the business. How can I avoid this? Since we are trying this option with User interface which calls this PDI job to find out the delimiter and columns with data types, the processing/waiting time is more (say 75 seconds) which is not acceptable.

    To be more specific, below are the factors which takes more time,
    1. Time taken to transfer file from remote location to target location.
    2. Scanning through the file to find the delimiter.

  • Hi Sathish, you obviously don’t have to scan the complete file. One would think that after say a few thousand rows you would get a sufficiently good sample set to determine the data types of the various fields.
    In the worst case scenario you first do a “Text File Input” to “Text File Output” with a limit to the first 10000 rows, producing a file just for scanning only.

  • Sathish

    Yes Matt. I am exactly trying that option only. The only thing I am worried about is the files are existing in a remote location , so using FTP to get file takes time :( . Thinking it would be better if I am able to access only 100 lines of the file from remote location using any step (I am trying with UNIX shell script even) , it will be of more help.

  • The text file input step can read directly from the FTP location but a slow networking link is not a problem anyone can solve for you.

  • Sathish

    yes . Thanks

  • Sathish

    Matt,

    Can the above process be modified for a Microsoft Excel Input file ? If so, what are challenges and guide me how to do it…

  • I’m pretty sure you can modify the example. If you get stuck, post a few questions on the Kettle forum of contact Pentaho services :-)
    Good luck!

  • Deepthi

    HI Mat.

    I am stuck in pentaho Data Integration.

    I am trying to integrate data from two different Json Tables with Id, but when I am trying to integrate , I am trying to do it by Merge join but when I am using Left Outer join I am able to see only columns with data from left table. I am able to see data from rigt table. Can you suggest me best way. Thank you

  • Sathish

    Matt,

    I am getting error in PDI 4.2.1 :
    Java constructor for “org.pentaho.di.core.util.StringEvaluator” with arguments “” not found. (trans_Start#13)

  • Sathish

    Cleared the error by declaring

    var evaluator = new org.pentaho.di.core.util.StringEvaluator(true);