| The Sun "ByteToCharUTF8" gives neither control nor
feedback from the UTF8 conversion process.
The following URLs:
http://www-106.ibm.com/developerworks/library/utfencodingforms/
http://www.cl.cam.ac.uk/~mgk25/unicode.html
are invaluable in understanding what is going on here.
Note these two are actually in disagreement - IBM agrees with this file,
and says that max UTF-8 is 4 bytes, whereas Kuhn goes up to 6.
RFC2781 says that > 4byte UTF-8 values (> U+1ffff) are simply not convertible to
UTF-16 which is presumably what is used by Java, so the difference is academic.
It would appear characters > U+10000 have not appeared until Unicode 3.1,
standard dated 23/03/01.
Sun implementation insists on throwing MalformedInputException on all
possible occasions, this is not acceptable strategy for getting user to
correct invalid characters. It should be able to throw UnknownCharacterException,
but this decoder never does.
Note that this code obeys recommendation against accepting overlong encodings
of the same characer.
|