Encoding - 2-mark escaped

Text Editor utf8
"a" .encode ( "gbk")

Save: "a" + utf8 saved as hexadecimal \ xe4 \ xb8 \ x80, \ x identifies which is utf8 code
memory: \ xe4 \ xb8 \ x80 + utf8 " a" that \ u4e00, \ u identify this is Unicode code
saved: "a" that \ u4e00 \ xd2 \ xbb, \ x identifies this is gbk code
so, utf8 code and gbk codes are used \ x identification, Unicode code with \ u identity; and \ x and \ u expressed in binary in nature, but different coding table ID, \ U represents a single Unicode code, and \ X represents all other coding

"A" .encode ( "gbk"). Decode ( "Shift_JIS")
ignored at this time encoding format, only as hexadecimal \ xd2 \ xbb in Japan based on hexadecimal code table, find the corresponding character "Methyl sa "(Shift_JIS encoding by looking at the table, indeed, \ xd2 corresponding" Methyl ")


Therefore, essentially
encode:. 1 according to the prior find Unicode character hexadecimal code + coding format, according to find Unicode character code corresponding to the hexadecimal encoding format, Unicode transfer station.
Decode: inlet 16 according to a conventional coding system (ignoring encoding format), looking for character encoding specified in the table, you do not need Unicode transit, look directly at the encoding specified in the table, one step


python command line, hit enter, and print
time on a command line itself is a text editor, it will be the Enter


Escape: Before adding the letters "\" to represent ASCII characters can not be displayed in those common
case that is not to increase the existing characters, but want to increase the representation that a new character is represented by a combination of existing, that is slash " \ "+ the character
\ does not indicate another, designed to escape other characters already exist, in order to increase the expressed
with a \ a \ n \ t like
which indicates a corresponding character string, \ a with" \ X07 "the overall the string representation

Namely: "\ x07" to represent the buzzer, but too long who remember ah, it is better to use a \ a
and \ n is special, it is not hard to remember the string corresponding


\ 0yy, \ xyy, \ uyy rather special, with the other characters behind the need, and \ a, \ t represents a direct message
, yy is identified as hexadecimal, 16 hexadecimal into decimal, then corresponding to the character code tables isolated



First, distinguish between inputs and outputs are always characters entered directly, can not be character code (without the use of the method)
>>> "A"
>>> "\ x07"
is no different, are strings, but "\ x07" Lookin 'like '16 binary only, but it really is not hex

And print (string) itself is a method, i.e. the method: treatment of a string, returns a string processed
and print the most significant feature is the escape, the escape character recognition, such as \ a ,; essentially the information of the code information into human translation; string '\ a' is translated into a beep


Enter b "" represents the encoding is performed, and the output b "" indicates that this is a result of coding


Encoding
1.bytes, strings encoded according to the encoding format specified equivalent to .encode ( "encoding format")
bytes ( "I", "utf8"), returns b '\ xe6 \ x88 \ x91 '

2. b, bytes special case, only support ascii code, so there is no need to use a function of the form
b "ab" == "ab" .encode ( "ascii"), returns True
Output: not expressed as hexadecimal string, but the short, direct write B "a", represents indeed a byte stream, and is a binary code corresponding to ascii decimal converted into binary
>>> "I" .encode () returns b ' \ XE6 \ X88 \ X91 '
>>> "a" .encode () returns b "a"
byte stream is represented by only a short code for ascii

decoding

A method of bytes not similar to the corresponding method of decoding according to the specified coding

u, only supports unicode-escape decode
decode ( "unicode-escape ')
U" \ u6211 "return" me ", and Unicode table in the \ u6211 character corresponds to" I "



Character Set: ordinary characters (IWC character set) + plus the escape character
character encoding: In particular, the escape character encoding, \ a \ b can be seen from the command line, but \ n and \ t either command line directly enter or .encode (), does not appear, but b "\ n", but just do not show it, in essence, there is a hex-encoded

print ( "\ a") is a buzzer, \ A is a character
print ( "\ z") is \ Z, and \ Z is two characters, i.e., "\" and "z"
, however, is not allowed in Python "\" appears alone; so the output from the command line \ \\ when z is z, indicating that this is not a character but two characters

The special character is incorporated as a single escape character in the character set, the encoding and decoding escape link

Command line display process is made to appear inconsistent (eg: \ a output \ A, but \ Z output \\ z), for ease of viewing the character is the case


unicode-escape with utf8 / gbk no difference, is the character encoding format, but is Unicode encoding

Special escape 1: \ x \ 0, and the \ 0 and \ X equivalents, i.e., "\ X07" and "\ 007"
\ a turn behind the character
  \ and X \ 0 behind turn two characters
such as: "\ xab ", denotes a, 16 hexadecimal characters in the encoding table corresponding to b together, i.e. '<<', and" \ xab ".encode () is '<<' .encode ()
" \ 0AB "equivalents

Therefore, "\ a" and "\ x07" is the same thing, but with \ X07 intuitive way to display the \ position of a character in the encoding table, 7th

In this case, the character set and expanded, "\ x07" "\ xab" "\ 007" "\ 007" is a character

But the file \ \\ fact, so \ x07 actually \\ x07

Special string 2: \ u440e, Unicode the encoding, four characters behind the escape, and the four strings must be present in the form of Unicode encoding table;

\ xyy, yy indicates a byte, and a byte ascii code, the character code corresponding to different absence;
is tantamount to ascii characters have played alias

The \ uyyyy, pointed out the need to find the corresponding character in the Unicode table, the problem does not exist any more
so if \ u string behind the formation of four characters in Unicode encoding is not in the file, the error will be


Binary write, read, there is no problem encoding; is a byte write, a byte read


Question:
In response to get directly saved in binary, binary, there \ u, but not behind the content, so read the time wrong

The question reads file. 1:
\ U question: utf / unicode-escape of ascii characters the processing is the same, but the characters are inconsistent;
Unicode-Escape encoded byte stream generated by decoding utf8, punctuation and English is no problem, but encounter unicode yards, utf8 can not be resolved,
a text editor, open view, the default is to use utf8 open, so Unicode code is translated in accordance with the Unicode table, is displayed in the form of processed \ u6211 \ u9978 the
actual use utf8 decode the time is the same, print out a \ u6211 \ u997 this form

Therefore, fundamental or different encoding and decoding, if there are intermediate links, then get the document read out by utf8, which will have \ u6211 \ u9978 This form of
intermediate links:
data encoded with unicode-escape utf8 decode, then utf8 encoding save to is actually "\ u6211 \ u9978" .encode ( ), i.e., b '\ xe6 \ x88 \ x91 \ xe9 \ xa5 \ xb8'
we get with utf8 decoding, de-out naturally "\ u6211 \ u997 "this string


File reads Question 2:
In addition to Question 1
to wb write, read with utf8, there \ uxxx problem
with pycharm open the lower right corner to see what the code is, on what the solution

Also:
Save the file, be sure to save the code, there is no direct deposit of characters, but open when the code has been assigned, they do not need our manual .encode ();
wb is a byte write, a byte readout resolution, such as characters for a plurality of bytes, an error occurs

Guess you like

Origin www.cnblogs.com/justaman/p/11514697.html