Char coding in Java

Thomas Banderas :

I open file with notepad, write there: "ą" save and close.

I try to read this file in two ways

First:

        InputStream inputStream = Files.newInputStream(Paths.get("file.txt"));
        int result = inputStream.read();
        System.out.println(result);
        System.out.println((char) result);

196 Ä

Second:

        InputStream inputStream = Files.newInputStream(Paths.get("file.txt"));
        Reader reader = new InputStreamReader(inputStream);
        int result = reader.read();
        System.out.println(result);
        System.out.println((char) result);

261 ą

Questions: 1) In binary mode, this letter is saved as 196? Why not as 261? 2) This letter is saved as 196 in which encoding?

I try to understand why there are differences

Pshemo :

UTF-8 encodes values from range U+0080 - U+07FF as two bytes in form 110xxxxx 10xxxxxx (more at wiki). So there are only xxxxx xxxxxx 11 bytes available for value.

ą is indexed as U+0105 where 0105 is hexadecimal value (as decimal it is 261). As binary it can be represented as

      01       05    (hex)
00000001 00000101    (bin)
     xxx xxxxxxxx <- values for U+0080 - U+07FF range encode only those bits

     001 00000101 <- which means `x` will be replaced by only this part

So UTF-8 encoding will add 110xxxxx 10xxxxxx mask which means it will combine

110xxxxx 10xxxxxx
   00100   000101

into (two bytes):

11000100 10000101

Now, InputStream reads data as raw bytes. So when you call inputStream.read(); first time you are getting 11000100 which is 196 in decimal. Calling inputStream.read(); second time would return 10000101 which is 133 in decimal.

Readers ware introduced in Java 1.1 so we could avoid this kind of mess in our code. Instead we can specify what encoding Reader should use (or let it use default one) to get properly encoded values like in this case 00000001 00000101 (without mask) which is equal to 0105 in hexadecimal form and 261 in decimal form.

In short

use Readers (with properly specified encoding) if you want to read data as text,
use Streams if you want to read data as raw bytes.

Guess you like