Thursday, June 24, 2010

InputStreamReader and invalid UTF-8

While writing some unit tests for some code to read in UTF-8 encoded data, I was surprised to find that the java InputStreamReader class did not throw an exception with bad data. Consider the following example program:
import java.io.*;

public class IConv {
    public static void main(String[] args) throws IOException {
        if (args.length != 2) {
            System.err.println("Usage: iconv <from> <to>");
            System.exit(1);
        }

        Reader input = new InputStreamReader(System.in, args[0]);
        Writer output = new OutputStreamWriter(System.out, args[1]);
        char[] buffer = new char[4096];
        int length;

        while ((length = input.read(buffer)) != -1) {
            output.write(buffer, 0, length);
        }

        output.flush();
    }
}
For correct UTF-8 input it does what you would expect:
$ printf "\xE2\x82\xAC\n" | iconv -f utf-8 -t utf-8 | od -t x1
0000000 e2 82 ac 0a
0000004
$ printf "\xE2\x82\xAC\n" | java IConv utf-8 utf-8 | od -t x1
0000000 e2 82 ac 0a
0000004
However, if you feed it bad data:
$ printf "\xE2\x82\xFC\n" | iconv -f utf-8 -t utf-8 | od -t x1
iconv: illegal input sequence at position 0
0000000

$ printf "\xE2\x82\xFC\n" | java IConv utf-8 utf-8 | od -t x1
0000000 ef bf bd ef bf bd
0000006
The unix iconv tool gives an error as expected. The java version is printing out the unicode replacement character twice. According to the javadocs for CharsetDecoder, the default behavior is supposed to be to report the error:
How a decoding error is handled depends upon the action requested for that type of error, which is described by an instance of the CodingErrorAction class. The possible error actions are to ignore the erroneous input, report the error to the invoker via the returned CoderResult object, or replace the erroneous input with the current value of the replacement string. The replacement has the initial value "\uFFFD"; its value may be changed via the replaceWith method.

The default action for malformed-input and unmappable-character errors is to report them. The malformed-input error action may be changed via the onMalformedInput method; the unmappable-character action may be changed via the onUnmappableCharacter method.
Looking at the javadocs for InputStreamReader, I didn't see any indication that it deviates from the the default behavior. But sure enough, look at the InputStreamReader source code and it creates a decoder using CodingErrorAction.REPLACE. One caveat, I used Sun/Oracle's java 6 VM for testing, but the source code I looked at was for the Apache Harmony VM because it conveniently came up in the search results. So now that we know why it is happening, we can fix it to throw an exception by explicitly specifying a CharsetDecoder configured to use CodingErrorAction.REPORT:
import java.io.*;
import java.nio.charset.*;

public class IConv {

    private static CharsetDecoder decoder(String encoding) {
        return Charset.forName(encoding).newDecoder()
            .onMalformedInput(CodingErrorAction.REPORT)
            .onUnmappableCharacter(CodingErrorAction.REPORT);
    }

    private static CharsetEncoder encoder(String encoding) {
        return Charset.forName(encoding).newEncoder()
            .onMalformedInput(CodingErrorAction.REPORT)
            .onUnmappableCharacter(CodingErrorAction.REPORT);
    }

    public static void main(String[] args) throws IOException {
        if (args.length != 2) {
            System.err.println("Usage: iconv <from> <to>");
            System.exit(1);
        }

        Reader input = new InputStreamReader(System.in, decoder(args[0]));
        Writer output = new OutputStreamWriter(System.out, encoder(args[1]));
        char[] buffer = new char[4096];
        int length;

        while ((length = input.read(buffer)) != -1) {
            output.write(buffer, 0, length);
        }

        output.flush();
    }
}
Now, if the conversion is not possible it will fail loudly with an exception:
$ printf "\xE2\x82\xFC\n" | java IConv utf-8 utf-8 | od -t x1
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 2
        at java.nio.charset.CoderResult.throwException(CoderResult.java:260)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:319)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
        at java.io.InputStreamReader.read(InputStreamReader.java:167)
        at java.io.Reader.read(Reader.java:123)
        at IConv.main(IConv.java:29)
0000000

$ printf "\xE2\x82\xAC\n" | java IConv utf-8 iso-8859-1 | od -t x1
Exception in thread "main" java.nio.charset.UnmappableCharacterException: Input length = 1
        at java.nio.charset.CoderResult.throwException(CoderResult.java:261)
        at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:266)
        at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:106)
        at java.io.OutputStreamWriter.write(OutputStreamWriter.java:190)
        at IConv.main(IConv.java:30)
0000000

No comments:

Post a Comment