Fragmented XML text nodes in Java

August 29, 2009
4 min

Java’s default SAX/StAX parsing might return seemingly continuous text in multiple parts (especially when special characters are involved). You should turn on coalescing to receive them as a single text block.

XML is a versatile data format, and sometimes the files can get quite large. For example it is not unusual for the Wikipedia XML dumps to reach a few GBs. There are two strategies to process such files: either you read it entirely in the memory (DOM), or you use streaming (SAX/StAX) to go through it. For large files, streaming might be the only option, however, the tools are designed to be resilient: they might read continuous text blocks in multiple pieces.

And the SAX parser is exactly like that: it does not load the entire XML, just reads the stream of bytes and uses a callback function to notify the caller on the content. It ensures that the processing consumes little memory besides the stream buffers and allows fast XML processing, although the callback API is a bit unfriendly sometimes.

The StAX parser provides better API and DOM-like parsing, while still reading only the partial stream.

What if the XML text node is much larger than the available memory buffer?

The SAX parser will notify the caller through the callback interface with the text chunks available, and proceeds with the stream. That means, you will receive only text fragments, multiple times inside the same element, and will not see the texts as a whole.

This is not the only case when fragments happen, it might happen at the boundaries of special or escaped characters too. For example if you have the text: Q&A, which in XML will be escaped to Q&A, you might end up with reading first the string Q then, the & and finally the A, instead of reading it as a whole string. Check it yourself in the following code:

import java.io.StringReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;

public class TestTextNode {
  public static void main(String[] args) throws Exception {
    String xml = "<?xml version=\"1.0\" ?><test>Q&amp;A</test>";
    XMLInputFactory factory = XMLInputFactory.newInstance();
    XMLStreamReader reader = factory.createXMLStreamReader(new StringReader(xml));
    reader.next();
    reader.next();
    System.out.println(reader.getText());
  }
}

On Sun’s Java 6 JVM you shall receive only Q in the output. If you continue the processing, you will eventually receive the other characters, but for the first-time observer it might be strange.

As in the example above, while you stream though the XML, you will receive a sequences of TextNodes. The boundaries are usually one of the following items:

  • closing tag of the actual element
  • opening tag of a new child element
  • buffer size of the reader (if the buffer becomes full, the callback will receive the text so far)
  • special escape characters (as above, the escaped & created a fragment)

While the first two may be trivial, the third and fourth are lesser-known internals of the XML parsers.

How can the parser join the consecutive text nodes?

It depends on the parser, but in case you are using Java’s default one, put the following code after the factory initialization:

XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty(XMLInputFactory.IS_COALESCING, true);

Or if you are using DOM parsing:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setCoalescing(true);

The tradeoff in this case: memory consumption vs. easier text processing.

I’ve originally published this short article on oktech in 2009.

Last updated: August 29, 2014
Question? Comment?
Contact us!