UTF-8 string to ordinal value: Java equivalent for Python output

Kevin Cruijssen :

I have the feeling this is most likely a duplicate, but I'm unable to find it.

NOTE: My Python knowledge is very limited, so I'm not 100% sure how strings, bytes, and encodings are done in Python. My knowledge about encodings in general is also not too great..

Let's say we have the string "Aä$$€h". It contains three different ordinary ASCII characters (A$h), and two non-ASCII characters (ä€). In Python we have the following code:

# coding: utf-8
input = u'Aä$$€h'
print [ord(c) for c in input.encode('utf-8')]
# Grouped per character:
print [[ord(x) for x in c.encode('utf-8')] for c in input_code]

Which will output:

[65, 195, 164, 36, 36, 226, 130, 172, 104]
[[65], [195, 164], [36], [36], [226, 130, 172], [104]]

Try it online.

Now I'm looking for a Java equivalent giving this same integer-array. I know all Strings in Java are by default encoded with UTF-16, and only byte-arrays can have an actual encoding. I thought the following code would give the result I expected:

String input = "Aä$$€h";
byte[] byteArray = input.getBytes(java.nio.charset.StandardCharsets.UTF_8);
System.out.println(java.util.Arrays.toString(byteArray));

But unfortunately it gives the following result instead:

[65, -61, -92, 36, 36, -30, -126, -84, 104]

Try it online.

I'm not sure where these negative values are coming from..

So my question is mostly this:

Given a String in Java containing non-ASCII characters (i.e. "Aä$$€h"), output its ordinal UTF-8 integers similar as the Python ord-function does on an UTF-8 encoded byte. The first part of this question, in that we already have a Java String, is a precondition for this question.

Jorn Vernee :

Java byte is signed, that is where the negative numbers are coming from. Bit-wise the numbers have the same value in both languages, the way they are being represented is just different. You can get the same representation as in python by using Byte.toUnsignedInt():

String input = "Aä$$€h";
byte[] byteArray = input.getBytes(java.nio.charset.StandardCharsets.UTF_8);
int[] ints = new int[byteArray.length];
for(int i = 0; i < ints.length; i++) {
    ints[i] = Byte.toUnsignedInt(byteArray[i]);
}
System.out.println(java.util.Arrays.toString(ints));

Which prints:

[65, 195, 164, 36, 36, 226, 130, 172, 104]

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=147811&siteId=1