Streaming XML content parsing with StAX

Today, one of our community members posted a deviously simply XML format on the forum that needed to be parsed.  The format looks like this:

<RESPONSE>
<EXPR>USD</EXPR>
  <EXCH>GBP</EXCH>
  <AMOUNT>1</AMOUNT>
  <NPRICES>1</NPRICES>
  <CONVERSION>
    <DATE>Fri, 01 Jun 2001 22:50:00 GMT</DATE>
    <ASK>1.4181</ASK>
    <BID>1.4177</BID>
  </CONVERSION>

  <EXPR>USD</EXPR>
  <EXCH>JPY</EXCH>
  <AMOUNT>1</AMOUNT>
  <NPRICES>1</NPRICES>
  <CONVERSION>
    <DATE>Fri, 01 Jun 2001 22:50:02 GMT</DATE>
    <ASK>0.008387</ASK>
    <BID>0.008382</BID>
  </CONVERSION>
  ...
</RESPONSE>

Typically we parse XML content with the “Get Data From XML” step which used XPath expressions to parse this content.  However, since the meaning of the XML content is determined by position instead of path, this becomes a problem.  To be specific, for each CONVERSION block you need to pick the last preceding EXPR and EXCH values.  You could solve it like this:

Unfortunately, this method requires a full parsing of your file 3 times and once extra for each additional preceding element.  The joining and all also slows things down considerably.

So this is another case where the new “XML Input Stream (StAX)” step comes to the rescue.  The solution using this step is the following:

Here’s how it works:

1) The output of the “positional element.xml” step flattens the content of the XML file so that you can see the output of each individual SAX event like “start of element”, “characters”, “end of element”.  Every time you get the path, parent path, element value and so forth.  As mentioned in the doc this step is very fast and can handle files with just about any size with a minimal footprint.  It will appear in PDI version 4.2.0GA.

2) With a bit of scripting we collect information from the various rows that we find interesting.

3) We filter out only the result lines (the end of the CONVERSION element).  What you get is the following desired output:

The usage of JavaScript in this example is not ideal but compared to the reading speed of the XML I’m sure it’s fine for most use-cases.

Both examples are up for download from the forum.

The “XML Input Stream (StAX)” step has also shown to work great with huge hierarchical XML structures, files of multiple GB in size.  The step was written by colleague Jens Bleuel and he documented a more complex example on his blog.

Have fun with it!

Matt

7 comments

  • Hi everyone,

    I don’t know how to put this question on the forum. Have been a follower but neva actually posted anything on it.

    There’s one problem I am looking to solve. It goes as follows…

    there’s a live XML feed URL which gets updated say every few seconds.. say 30 secs once on an average.. Some feeds might even come more frequently than 30 secs.

    My question here is simple. Can I use the streaming XML input step to process the incoming records. What i am doing currently is using a normal XML input step and storing the already processed records in an in-memory table per class per a tag. and keep updating the memory DB table for every other record processed.

    Does streaming XML input keep a record of what row it has processed and follows it by the following incoming records or does it go thru the whole XML to parse it for every record???!!!

    And also I wanna do this for numerous XML feeds at a time. Can one streaming XML input do this job for me. I pass the URLs of the feeds dynamically through a variable in the previous step.

    Summing up… I wanna record events from multiple XML feeds which are getting updated live (current time). Moreover, I wanna do all this one record at a time for each feed.

    I hope I was clear in putting forward the question.

    I wanted to confirm before I actually jump in to put the effort on developing the code.

    Please suggest.

    -Shravan

  • It’s a streaming parser as it says in the title so obviously it does not go back to the beginning of the file to read a next record.
    So no, you are not completely clear. Feel free to post you question and sample on the forum if you like to get more help!

    Matt

  • Thanks for the reply Matt..

    I will try my utmost this time.

    Joyce
    Asker

    Fossum
    Ingebretsen
    Mjøndalen

    this is how my XML feed looks like and data in this feed is continuously updated(in the sense that there are rows flowing into the file every few seconds,say).

    Now there are N no. of files like this. I have to process all the feeds coming in at the same time simultaneously but at the same time per feed i have to process row by row. It’s like parsing through different XML files in parallel but records for each feed should be processed one row at a time.

    this means if there are 10 such feeds. all the 10 feeds should be read but no two records of the same feed should be processed further in the job at the same time. If there are simultaneous records being processed, they should be of different feeds.

    so how do i do it??

    Please pass me a yell if you did not get what I said… 🙂

    thanks in advance.
    Shravan

  • Shravan

    Joyce
    Asker

    Fossum
    Ingebretsen
    Mjøndalen

  • Shravan

    sorry for the spam…. I was tryin to post the exact format of the XML….

    here it is..

    (<!–

    Joyce
    Asker

    Fossum
    Ingebretsen
    Mjøndalen –>))))

  • hi all,

    I think this will work, but unfortunately I am getting following error in javascript code-
    ReferenceError: “xml_data_type_description” is not defined. (script#4)

    We must define xml_data_type_description in somewhere in our code, or any config must missing.

    answer please…

  • Please don’t turn this blog into a support forum. Post your more complex problems on our forum.

    Thanks!!!
    Matt