Using XSLT for Very Large Files (October 20th, 2008)

While I was working recently on one of my projects, I noticed a curious problem. The server I was using was running out of memory while doing a simple XSLT transform. That was sort of strange because the XSLT transform in question was rather simple and the amount of memory on the server was very big (an EC2 instance). After further investigation, it turned out that the issue was due to the large size of the input XML (over 300 MBs) which was clogging up the memory. It seems that most XSLT processors, including libXSLT which I was using, load the input XML into memory completely before doing the transform. A better alternative is to use a process similar to SAX where the input XML is loaded and transformed incrementally – something that is called “streaming”. There are several solutions:

1. Saxon XSLT processor supports “streaming mode” which allows processing of files upto 20 Gbs. BUT this feature is only available in the commercial version.

2. An alternative to XSLT is something called STX or “Streaming Transformations for XML”, which is specifically designed to address this issue. HOWEVER, it is not a standard of any sort like XSLT and there are only two implementations.

3. There is a streaming XSLT processor released by a team at a national laboratory but I currently misplaced the link.

4. Apache Xalan XSLT Processor in incremental mode (note this is NOT true streaming since the entire original file is eventually loaded into memory).

For my project I choose #4 – Apache Xalan because (a) I wanted an open source solution and (b) I wanted to stick to the XSLT standard as opposed to STX. I might look into STX in the future to reduce the original XML file in size, and then further process it using standard XSLT tools.

Now the good news is that the next step in the XSLT standardization process at W3C is streaming with something called XSLT2++. Take a look at this O’Reilly news article and a blog post from Michael Kay (the editor of the XSLT WG at W3C). The bad news that it will take at least 18 months for the standards process and who knows how long for the actual implementations.

Comments?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s