I rather often abuse UTF8 to encode binary to pass it into a textbased API.
Today, after years (!!), was the first time I got caught by non-reversible UTF8 encodings.
1 2 3 4 5
| byte[] original = ....; String encoded = new String(original, "UTF-8"); byte[] decoded = encoded.getBytes("UTF-8");
Arrays.equals(original, decoded); |
Gotta rewrite some stuff...
shame on me !
Presumably the cause of your problem is that 'byte[] original' contains a string encoded using
modified UTF-8, rather than
UTF-8? (caused by inproper use of dos.writeUTF elsewhere in your app.)
Though if that's the case i'm surprised you hadn't encountered a problem sooner; it's unusual for binary data to contain no zeros!
Though perhaps the UTF-8 decoder used by the String constructor is silently accepting an Overlong encoding for zero, and you've only been caught out now because you're data contains one of the UTF-16 surrogate pair byte values. (which are also encoded overlong in
modified UTF-8)
If that's the case the UTF-8 decoder used by Java is being very naughty - as accepting overlong encodings would mean it fails to meet the current Unicode compliancy requirements!