Java with 64KB String limit in a few IO operations

Although I've written about it around one and a half years ago, I've recently encountered the same problem: java.io.DataOutputStream (and java.io.ObjectOutputStream too for the matter) is not able to serialize strings larger than 64KB. In this entry I will show you the underlying details of this problem and I will outline a few workaround options you might use.

public static void main(String[] args) throws Exception {
    // generate string longer than 64KB
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < 10000; i++)
        sb.append("1234567890");
    String s = sb.toString();

    // write the string into the stream
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    DataOutputStream dos = new DataOutputStream(baos);
    dos.writeUTF(s);
    dos.close();
}

If you run the code above, you will recieve something like this:

Exception in thread "main" java.io.UTFDataFormatException: encoded string too long: 100000 bytes
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:347)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:306)
    at com.example.Demo.main(Demo.java:28)

Shocking, ever again. I got lazy and assumed that Java just works - but it does not. On the other hand, the Javadoc does contain this information implicitly:

First, two bytes are written to out as if by the <code>writeShort</code>
method giving the number of bytes to follow.

Yeah: two bytes will mark the number of bytes to follow equals 64KB limit. If you still doubt it, the following is the related code from the JVM sources:

if (utflen > 65535)
    throw new UTFDataFormatException(
        "encoded string too long: " + utflen + " bytes");

What is the workaround?

If you are not fortunate enough and some 3rd party library or application uses the writeUTF() method: you have to hope that there will be no such long string. Not very reassuring.

If you access the source codes, you have better chances: you shall be able to re-design or modify the binary format of your data exchange. Of course there are cases when this is not really possible, but for now, let us suppose you have created your binary format in an extensible way (with version bits or whatever tracking) because that allows us to focus only on the writeUTF() method:

  • You can manually transform the String to byte[] (with e.g. s.getBytes("utf-8") ). Put a 4-byte int buffer length in the beginning of the stream, and reading won't be a problem either.

  • You might split the String into ~16KB chunks, store the number of chunks and call writeUTF on each of them. Pretty easy and does not mess with manual byte[] transforms.

  • If you are lucky enough and you might even create an increment binary format upgrade. As the writeUTF() fails on null values, people usually wrap the writes in blocks something like the following:

    if (s == null) {
            // mark that we had a null value
            dos.writeByte(0);
            // no string to write
        } else {
            // mark the non-null reference
            dos.writeByte(1);
            // write the string
            dos.writeUTF(s);
        }
    

    As you might have noticed, this code uses a single byte prefix to mark the null/non-null value of the following String. One can easily extend the code to check for String length and perform different writes, like this:

    if (s == null) {
            dos.writeByte(0);
        } else {
            dos.writeByte(1);
            if (s.length() < 16*1024) {
                dos.writeUTF(s);
            } else {
                // here comes the simple workaround
                byte[] b = s.getBytes("utf-8");
                dos.writeInt(b.length);
                dos.write(b);
            }
        }
    

Of course there is no silver bullet for this problem, and in the end, most workaround will contain the same level of "hacks". Just be prepared for it and think about it before it is too late!

The original entry was published by myself at oktech

Labels:
Timestamp: 2011-02-18 22:37
blog comments powered by Disqus
Author
István Soós
technology expert, trainer, business consultant and agile coach
More...