I open file with notepad, write there: "ą" save and close.
I try to read this file in two ways
First:
InputStream inputStream = Files.newInputStream(Paths.get("file.txt"));
int result = inputStream.read();
System.out.println(result);
System.out.println((char) result);
196 Ä
Second:
InputStream inputStream = Files.newInputStream(Paths.get("file.txt"));
Reader reader = new InputStreamReader(inputStream);
int result = reader.read();
System.out.println(result);
System.out.println((char) result);
261 ą
Questions: 1) In binary mode, this letter is saved as 196? Why not as 261? 2) This letter is saved as 196 in which encoding?
I try to understand why there are differences
UTF-8 encodes values from range U+0080
- U+07FF
as two bytes in form 110xxxxx
10xxxxxx
(more at wiki). So there are only xxxxx xxxxxx
11 bytes available for value.
ą
is indexed as U+0105 where 0105
is hexadecimal value (as decimal it is 261
). As binary it can be represented as
01 05 (hex)
00000001 00000101 (bin)
xxx xxxxxxxx <- values for U+0080 - U+07FF range encode only those bits
001 00000101 <- which means `x` will be replaced by only this part
So UTF-8 encoding will add 110xxxxx
10xxxxxx
mask which means it will combine
110xxxxx 10xxxxxx
00100 000101
into (two bytes):
11000100 10000101
Now, InputStream
reads data as raw bytes. So when you call inputStream.read();
first time you are getting 11000100
which is 196
in decimal. Calling inputStream.read();
second time would return 10000101
which is 133
in decimal.
Reader
s ware introduced in Java 1.1 so we could avoid this kind of mess in our code. Instead we can specify what encoding Reader should use (or let it use default one) to get properly encoded values like in this case 00000001 00000101
(without mask) which is equal to 0105
in hexadecimal form and 261
in decimal form.
In short
- use
Reader
s (with properly specified encoding) if you want to read data as text, - use
Stream
s if you want to read data as raw bytes.