64KB String limit in Java data streams

September 26, 2009

4 min

Java’s DataOutputStream and ObjectOutputStream are not able to serialize Strings larger than 64KB. Let’s try and write a really long String into a data stream:

public static void main(String[] args) throws Exception {
    // generate string longer than 64KB
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < 10000; i++)
        sb.append("1234567890");
    String s = sb.toString();

    // write the string into the stream
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    DataOutputStream dos = new DataOutputStream(baos);
    dos.writeUTF(s);
    dos.close();
}

If you run the code above, you will get something like this:

Exception in thread "main" java.io.UTFDataFormatException: encoded string too long: 100000 bytes
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:347)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:306)
    at com.example.Demo.main(Demo.java:28)

What just happened? The Javadoc comes to the rescue:

First, two bytes are written to out as if by the writeShort method giving the number of bytes to follow.

Two bytes length prefix will cap the number of bytes to the 64KB limit. Digging into the JVM sources, it has an explicit check for it:

if (utflen > 65535)
    throw new UTFDataFormatException(
        "encoded string too long: " + utflen + " bytes");

What can we do about it?

If you are using some 3rd party library and have no mean to access the source, then you are at their mercy, and you can just hope that you won’t have such long Strings.

If you are able to access the source codes, you may have better chances: you can define or modify the binary format of your data. Of course there are cases when this is not really possible, but for now, let us suppose you have created your binary format in an extensible way (with version bits or whatever tracking) because that allows us to focus only on the writeUTF() method:

(1) Use byte[] arrays

You can manually transform the String to byte[] (with e.g. s.getBytes("utf-8")). Put the buffer's length as a 4-byte int prefix in the beginning of the stream, and reading won’t be a problem either.

(2) Split your String into smaller chunks

You might split the String into ~16KB chunks, and call writeUTF for each of them. Easy and no need to mess with manual byte[] transforms.

(3) Use a custom length prefix

If you are lucky enough and you might even do an incremental upgrade for you binary format. As the writeUTF() fails on null values, you may wrap the writes in blocks something like the following:

  if (s == null) {
    // mark that we had a null value
    dos.writeByte(0);
    // no string to write
  } else {
    // mark the non-null reference
    dos.writeByte(1);
    // write the string
    dos.writeUTF(s);
  }

This code uses a single byte prefix to mark the null/non-null value of the following String. One can easily extend the code to check for String length and perform different writes, like this:

  if (s == null) {
    dos.writeByte(0);
  } else {
    dos.writeByte(1);
    if (s.length() < 16*1024) {
      dos.writeUTF(s);
    } else {
      // here comes the simple workaround
      byte[] b = s.getBytes("utf-8");
      dos.writeInt(b.length);
      dos.write(b);
    }
  }

There are no silver bullets for this problem, and in the end, workarounds contain some level of "hacks".

I’ve originally published this short article on oktech in 2009.

Last updated: August 29, 2014

István Soós

software engineer, business advisor

Advocates for the maker-movement, self-directed learning and agile methods. His regular topics include: machine intelligence, data and risk analysis, distributed systems and knowledge management.

Question? Comment?
Contact us!