“it could be bunnies”

Splitting large XML files with ColdFusion and StAX

In a recent discussion on cf-talk the question was asked how to improve the performance of ColdFusion when working with very large XML documents. One of the solutions proposed was to use StAX and that got me thinking. StAX is a stream processor works very different from what you may be used to from other XML processors. Instead of viewing an XML document as a whole and elements in context to their parents, children and siblings, it just treats the whole document as a sequence of items. Each of these elements can be of type elementstart, elementend, comment, entity etc. The way you work with this is you iterate through all the items in your document and process them one by one. Working that way is sufficiently different to make it necessary to rewrite all your processing from scratch if you want to switch from the built-in processor to StAX which makes it a solution that is not so attractive.

But what if we combine a preprocessing step in StAX to split the large XML document into smaller pieces with the regular processing in ColdFusion? StAX is Java so it is easy to integrate it into ColdFusion and to test this I wrote a sample implementation to test if this would help. It has some limitations such as only handling elements, element text and attributes, but it seems to work just fine (and the code is open for improvement). With this I benchmarked some XML files I downloaded from internet with the following results:

Source file	Source size	Split on	Records	Time
http://www.ins.cwi.nl/projects/xmark/Assets/standard.gz	111 MB	regions	1	24274 ms
http://www.ins.cwi.nl/projects/xmark/Assets/standard.gz	111 MB	mailbox	21750	146999 ms
ftp://ftp.nlm.nih.gov/nlmdata/sample/medline/medsamp2011h.xml.zip	164 MB	30000	30000	472043 ms

As you can see how you are splitting a document has a significant impact. I presume this is mostly due to the impact the write operations have on my laptop with a slow 5400 rpm harddisk. On the other hand in the best case scenario the parsing speed is over 4 MB per second. Memory consumption stayed under 200 MB for the whole server so it looks like there are some scenario’s where this might be useful.

Code for xmlSplitter.cfc, tested on CF 9.01, 64-bit with StAX 1.2.0 and Java 1.6u24 64-bit.

This entry was posted by Jochem on 2011/02/17 at 01:47 under Uncategorized. Tagged ColdFusion, Prisma IT, StAX, XML. Both comments and pings are currently closed.

One Comment

Matthew Lesko says:

Think you need to use Buffered input and output streams in your CFC to avoid OOM exceptions. Elsewise the whole file will be read into memory. This is an example of how I’ve done it: http://stackoverflow.com/questions/4995238/looping-over-a-large-xml-file/4995560#4995560
2011/02/21, 13:53

Splitting large XML files with ColdFusion and StAX

One Comment

Matthew Lesko says:

Pages

Recent Posts

Categories

Archives