64KB String limit in Java data streams

István Soós September 26, 2009 3 min read

Java’s DataOutputStream and ObjectOutputStream are not able to serialize Strings larger than 64KB. Let’s try and write a really long String into a data stream:

public static void main(String[] args) throws Exception {
    // generate string longer than 64KB
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < 10000; i++)
    String s = sb.toString();

    // write the string into the stream
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    DataOutputStream dos = new DataOutputStream(baos);

If you run the code above, you will get something like this:

Exception in thread "main" java.io.UTFDataFormatException: encoded string too long: 100000 bytes
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:347)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:306)
    at com.example.Demo.main(Demo.java:28)

What just happened? The Javadoc comes to the rescue:

First, two bytes are written to out as if by the writeShort method giving the number of bytes to follow.

Two bytes length prefix will cap the number of bytes to the 64KB limit. Digging into the JVM sources, it has an explicit check for it:

if (utflen > 65535)
    throw new UTFDataFormatException(
        "encoded string too long: " + utflen + " bytes");

What can we do about it?

If you are using some 3rd party library and have no mean to access the source, then you are at their mercy, and you can just hope that you won’t have such long Strings.

If you are able to access the source codes, you may have better chances: you can define or modify the binary format of your data. Of course there are cases when this is not really possible, but for now, let us suppose you have created your binary format in an extensible way (with version bits or whatever tracking) because that allows us to focus only on the writeUTF() method:

(1) Use byte[] arrays

You can manually transform the String to byte[] (with e.g. s.getBytes("utf-8")). Put a 4-byte int buffer length in the beginning of the stream, and reading won’t be a problem either.

(2) Split your String into smaller chunks

You might split the String into ~16KB chunks, store the number of chunks and call writeUTF on each of them. Pretty easy and does not mess with manual byte[] transforms.

(3) Use a custom length prefix

If you are lucky enough and you might even create an increment binary format upgrade. As the writeUTF() fails on null values, people usually wrap the writes in blocks something like the following:

  if (s == null) {
    // mark that we had a null value
    // no string to write
  } else {
    // mark the non-null reference
    // write the string

As you might have noticed, this code uses a single byte prefix to mark the null/non-null value of the following String. One can easily extend the code to check for String length and perform different writes, like this:

  if (s == null) {
  } else {
    if (s.length() < 16*1024) {
    } else {
      // here comes the simple workaround
      byte[] b = s.getBytes("utf-8");

There is no silver bullet for this problem, and in the end, most workaround will contain the same level of "hacks".

I’ve originally published this short article on oktech in 2009.

updated: August 29, 2014
István Soós
software engineer, business advisor
Advocates for the maker-movement, self-directed learning and agile methods. His regular topics include: machine intelligence, data and risk analysis, distributed systems and knowledge management.