Streaming XML content parsing with StAX
Today, one of our community members posted a deviously simply XML format on the forum that needed to be parsed. The format looks like this:
<RESPONSE> <EXPR>USD</EXPR> <EXCH>GBP</EXCH> <AMOUNT>1</AMOUNT> <NPRICES>1</NPRICES> <CONVERSION> <DATE>Fri, 01 Jun 2001 22:50:00 GMT</DATE> <ASK>1.4181</ASK> <BID>1.4177</BID> </CONVERSION> <EXPR>USD</EXPR> <EXCH>JPY</EXCH> <AMOUNT>1</AMOUNT> <NPRICES>1</NPRICES> <CONVERSION> <DATE>Fri, 01 Jun 2001 22:50:02 GMT</DATE> <ASK>0.008387</ASK> <BID>0.008382</BID> </CONVERSION>
Typically we parse XML content with the “Get Data From XML” step which used XPath expressions to parse this content. However, since the meaning of the XML content is determined by position instead of path, this becomes a problem. To be specific, for each CONVERSION block you need to pick the last preceding EXPR and EXCH values. You could solve it like this:
Unfortunately, this method requires a full parsing of your file 3 times and once extra for each additional preceding element. The joining and all also slows things down considerably.
So this is another case where the new “XML Input Stream (StAX)” step comes to the rescue. The solution using this step is the following:
Here’s how it works:
1) The output of the “positional element.xml” step flattens the content of the XML file so that you can see the output of each individual SAX event like “start of element”, “characters”, “end of element”. Every time you get the path, parent path, element value and so forth. As mentioned in the doc this step is very fast and can handle files with just about any size with a minimal footprint. It will appear in PDI version 4.2.0GA.
2) With a bit of scripting we collect information from the various rows that we find interesting.
3) We filter out only the result lines (the end of the CONVERSION element). What you get is the following desired output:
Both examples are up for download from the forum.
The “XML Input Stream (StAX)” step has also shown to work great with huge hierarchical XML structures, files of multiple GB in size. The step was written by colleague Jens Bleuel and he documented a more complex example on his blog.
Have fun with it!