64KB String limit in Java data streams
Java’s DataOutputStream and ObjectOutputStream are not able to serialize Strings larger than 64KB. Let’s try and write a really long String into a data stream:
public static void main(String[] args) throws Exception {
// generate string longer than 64KB
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 10000; i++)
sb.append("1234567890");
String s = sb.toString();
// write the string into the stream
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
dos.writeUTF(s);
dos.close();
}
If you run the code above, you will get something like this:
Exception in thread "main" java.io.UTFDataFormatException: encoded string too long: 100000 bytes
at java.io.DataOutputStream.writeUTF(DataOutputStream.java:347)
at java.io.DataOutputStream.writeUTF(DataOutputStream.java:306)
at com.example.Demo.main(Demo.java:28)
What just happened? The Javadoc comes to the rescue:
First, two bytes are written to out as if by the writeShort method giving the number of bytes to follow.
Two bytes length prefix will cap the number of bytes to the 64KB limit. Digging into the JVM sources, it has an explicit check for it:
if (utflen > 65535)
throw new UTFDataFormatException(
"encoded string too long: " + utflen + " bytes");
What can we do about it?
If you are using some 3rd party library and have no mean to access the source, then you are at their mercy, and you can just hope that you won’t have such long Strings.
If you are able to access the source codes, you may have better chances: you
can define or modify the binary format of your data. Of course there are cases
when this is not really possible, but for now, let us suppose you have created
your binary format in an extensible way (with version bits or whatever
tracking) because that allows us to focus only on the writeUTF()
method:
(1) Use byte[] arrays
You can manually transform the String
to byte[]
(with e.g.
s.getBytes("utf-8")
). Put the buffer's length as a 4-byte int prefix
in the beginning of the stream, and reading won’t be a problem either.
(2) Split your String into smaller chunks
You might split the String
into ~16KB chunks, and call writeUTF
for
each of them. Easy and no need to mess with manual byte[]
transforms.
(3) Use a custom length prefix
If you are lucky enough and you might even do an incremental upgrade for
you binary format. As the writeUTF()
fails on null values, you may
wrap the writes in blocks something like the following:
if (s == null) {
// mark that we had a null value
dos.writeByte(0);
// no string to write
} else {
// mark the non-null reference
dos.writeByte(1);
// write the string
dos.writeUTF(s);
}
This code uses a single byte prefix to mark the null/non-null value of the following String. One can easily extend the code to check for String length and perform different writes, like this:
if (s == null) {
dos.writeByte(0);
} else {
dos.writeByte(1);
if (s.length() < 16*1024) {
dos.writeUTF(s);
} else {
// here comes the simple workaround
byte[] b = s.getBytes("utf-8");
dos.writeInt(b.length);
dos.write(b);
}
}
There are no silver bullets for this problem, and in the end, workarounds contain some level of "hacks".
Contact us!