In a recent discussion on cf-talk the question was asked how to improve the performance of ColdFusion when working with very large XML documents. One of the solutions proposed was to use StAX and that got me thinking. StAX is a stream processor works very different from what you may be used to from other XML processors. Instead of viewing an XML document as a whole and elements in context to their parents, children and siblings, it just treats the whole document as a sequence of items. Each of these elements can be of type elementstart, elementend, comment, entity etc. The way you work with this is you iterate through all the items in your document and process them one by one. Working that way is sufficiently different to make it necessary to rewrite all your processing from scratch if you want to switch from the built-in processor to StAX which makes it a solution that is not so attractive.
But what if we combine a preprocessing step in StAX to split the large XML document into smaller pieces with the regular processing in ColdFusion? StAX is Java so it is easy to integrate it into ColdFusion and to test this I wrote a sample implementation to test if this would help. It has some limitations such as only handling elements, element text and attributes, but it seems to work just fine (and the code is open for improvement). With this I benchmarked some XML files I downloaded from internet with the following results:
Source file | Source size | Split on | Records | Time |
---|---|---|---|---|
http://www.ins.cwi.nl/projects/xmark/Assets/standard.gz | 111 MB | regions | 1 | 24274 ms |
http://www.ins.cwi.nl/projects/xmark/Assets/standard.gz | 111 MB | mailbox | 21750 | 146999 ms |
ftp://ftp.nlm.nih.gov/nlmdata/sample/medline/medsamp2011h.xml.zip | 164 MB | 30000 | 30000 | 472043 ms |
As you can see how you are splitting a document has a significant impact. I presume this is mostly due to the impact the write operations have on my laptop with a slow 5400 rpm harddisk. On the other hand in the best case scenario the parsing speed is over 4 MB per second. Memory consumption stayed under 200 MB for the whole server so it looks like there are some scenario’s where this might be useful.
Code for xmlSplitter.cfc, tested on CF 9.01, 64-bit with StAX 1.2.0 and Java 1.6u24 64-bit.
Matthew Lesko says:
Think you need to use Buffered input and output streams in your CFC to avoid OOM exceptions. Elsewise the whole file will be read into memory. This is an example of how I’ve done it: http://stackoverflow.com/questions/4995238/looping-over-a-large-xml-file/4995560#4995560
2011/02/21, 13:53