What's so special about Unicode character "" that it breaks the parser logic based on curly braces?

DVK :

I am trying to debug a weird issue, hoping a Unicode expert here would be able to help.

  • I have a (Perl based) sender program, which takes some data structure
  • it encodes the data structure into a proprietary serialized format which uses curly braces for encoding the data. Here's an example serialized string: {{9}{{8}{{skip_association}{{0}{}}}{{data}{{9}{{1}{{exceptions}{{9}{{1}{{-472926}{{9}{{1}{{AAAAAAYQ2}
  • it then sends that serialized string to a Java server
  • Java server tries to de-serialize the string back into a data structure.
  • The encoding does not really matter too much (imho) other than it uses field length as part of encoded data; e.g. {{id}{{7}9{Z928D2AA2}}} means a field named "id", of type "string" (7), length of string 9, value Z928D2AA2.

Problem: When the data structure being serialized contains some specific Unicode character(s), the de-serialization fails.

Specifically, this character: "" (which various online decoders display as %82 or 0x82) causes the issue.

I'm trying to understand why this would be an issue and what's so special about this character - there are other Unicode characters that do not break the de-serializer.

Is there something special about (aka 0x82) Unicode character that would interfere with parsing a serialized string dependent on curly braces as separators and field lengths being known?

Unfortunately, I am unable to debug the decodig library, so I only get a generic error message that decoding failed without any idea what about it failed.

P.P.S Double extra curious: when I used that character in the title of SO question, it printed in the preview, but got deleted when the question was posted!!! When I tried to copy/paste the strings into the editor, their measured length was correct compared to encoded string length

P.S. The Perl code doing the serialization as far as I know is fully Unicode compliant:

use open      qw(:std :utf8);    # undeclared streams in UTF-8
use charnames qw(:full :short);  # unneeded in v5.16
use Encode qw(decode);
ysth :

You can see information about characters in the unicode character database; a text dump of that can be found at https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt where it shows:

0082;<control>;Cc;0;BN;;;;;N;BREAK PERMITTED HERE;;;;

The meanings of the fields can be found at http://www.unicode.org/reports/tr44/#UnicodeData.txt (though that seems to omit the first field, which is the codepoint).

So it is an "other" class control character, with Bidirectional Category "Boundary Neutral" (which is normal for a Cc or Cf class character). There isn't anything else special about it.

But being a control character, it doesn't surprise me that something expecting text data has a problem with it.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=131102&siteId=1