Processing Large XML Files
Recently, a team member experienced performance degradation in an application that was consuming a large XML file to process. This post examines the potential causes and solutions to processing large XML files.
In general, when dealing with large XML files (i.e. approaching 1 MB or greater), significant performance problems can arise when parsing these files. A common need is to extract values from specific nodes in the hierarchy and do something with the result. The most obvious choice is to load the XML document into the DOM and use XPATH queries to extract the data. Not a bad choice when dealing with small data sets. However, with large data sets, this can take an enormous amount of time and resources to achieve. If the file is very large, the application could in theory reach an out of memory condition. Also, there are general application design issues to consider, such as how the app will behave while the lengthy process is running. Will the application just “hang” until finished or will it provide meaningful feedback?
XMLTextReader and XPathReader Technologies
Striving to minimize the amount of XML needed to be loaded into memory is the desired goal. If the XML is small, then this consideration is moot. To overcome the memory consumption issue in processing large XML files, some form of streaming technology can be used. This comes in the form of an XMLTextReader and/or XPathReader. There are limitations on extracting data in a clean way using XMLTextReader and XPathReader technologies.
Even if these techniques are used, when processing extremely large XML files on the order of GBs, the time it can take to process the document may be unacceptable. One way to minimize the impact of the time to process on a win form application is to use some sort of asynchronous process. .NET 4.0 comes with a handy OOB control called the BackgroundWorker. This is basically a simple way of implementing multi-threading in an application and allowing the UI to continue operating while the process finishes.
When dealing with a web-application, a simple AJAX pattern can be used to download the file in the background and process it on the client. The processing can then take place on the client. Minimizing the footprint of the data can achieved by using JSON data format instead of XML. There can be a size saving approaching 50% when using JSON. You lose the XML parsing ability but the size improvement makes this choice compelling.
Specifying the schema for an XML file can improve performance as well. BizTalk solutions often require processing large XML files which can cause the system to experience performance problems in much the same way as traditional applications mentioned above.
When using the BizTalk mapper to perform transformations, the entire document needs to be loaded into memory. Since mapping transformations use the same underlying XML technology in .NET, this makes sense. A BizTalk solution should take this into consideration and avoid performing transformations on large files. Similarly, XPATHING into a document to extract a value to facilitate program logic should also be avoided for large documents. Consider using promoted or distinguished fields to perform program logic against instead. Another common technique is to “shred” the large document into small chunks using a pipeline in front of the process and feeding the smaller chunks to separate orchestration instances. This also gives you the built parallel processing and load balancing capabilities inherent in BizTalk.
Finally, if the values within the document are not needed for processing large XML files and the application is simply routing the document to some destination, consider zipping the file in a pipeline component upfront in the receive port. This will drastically reduce the size of the document sent to the Message box for routing. The document can then be unpacked on the outbound send port and sent to the final destination.
Have anything to add to this discussion on examining some of the issues surrounding processing large XML files within an application? Please share them with me in the comments below.