Text Editor utf8
"a" .encode ( "gbk")
Save: "a" + utf8 saved as hexadecimal \ xe4 \ xb8 \ x80, \ x identifies which is utf8 code
memory: \ xe4 \ xb8 \ x80 + utf8 " a" that \ u4e00, \ u identify this is Unicode code
saved: "a" that \ u4e00 \ xd2 \ xbb, \ x identifies this is gbk code
so, utf8 code and gbk codes are used \ x identification, Unicode code with \ u identity; and \ x and \ u expressed in binary in nature, but different coding table ID, \ U represents a single Unicode code, and \ X represents all other coding
"A" .encode ( "gbk"). Decode ( "Shift_JIS")
ignored at this time encoding format, only as hexadecimal \ xd2 \ xbb in Japan based on hexadecimal code table, find the corresponding character "Methyl sa "(Shift_JIS encoding by looking at the table, indeed, \ xd2 corresponding" Methyl ")
Therefore, essentially
encode:. 1 according to the prior find Unicode character hexadecimal code + coding format, according to find Unicode character code corresponding to the hexadecimal encoding format, Unicode transfer station.
Decode: inlet 16 according to a conventional coding system (ignoring encoding format), looking for character encoding specified in the table, you do not need Unicode transit, look directly at the encoding specified in the table, one step
python command line, hit enter, and print
time on a command line itself is a text editor, it will be the Enter
Escape: Before adding the letters "\" to represent ASCII characters can not be displayed in those common
case that is not to increase the existing characters, but want to increase the representation that a new character is represented by a combination of existing, that is slash " \ "+ the character
\ does not indicate another, designed to escape other characters already exist, in order to increase the expressed
with a \ a \ n \ t like
which indicates a corresponding character string, \ a with" \ X07 "the overall the string representation
Namely: "\ x07" to represent the buzzer, but too long who remember ah, it is better to use a \ a
and \ n is special, it is not hard to remember the string corresponding
\ 0yy, \ xyy, \ uyy rather special, with the other characters behind the need, and \ a, \ t represents a direct message
, yy is identified as hexadecimal, 16 hexadecimal into decimal, then corresponding to the character code tables isolated
First, distinguish between inputs and outputs are always characters entered directly, can not be character code (without the use of the method)
>>> "A"
>>> "\ x07"
is no different, are strings, but "\ x07" Lookin 'like '16 binary only, but it really is not hex
And print (string) itself is a method, i.e. the method: treatment of a string, returns a string processed
and print the most significant feature is the escape, the escape character recognition, such as \ a ,; essentially the information of the code information into human translation; string '\ a' is translated into a beep
Enter b "" represents the encoding is performed, and the output b "" indicates that this is a result of coding
Encoding
1.bytes, strings encoded according to the encoding format specified equivalent to .encode ( "encoding format")
bytes ( "I", "utf8"), returns b '\ xe6 \ x88 \ x91 '
2. b, bytes special case, only support ascii code, so there is no need to use a function of the form
b "ab" == "ab" .encode ( "ascii"), returns True
Output: not expressed as hexadecimal string, but the short, direct write B "a", represents indeed a byte stream, and is a binary code corresponding to ascii decimal converted into binary
>>> "I" .encode () returns b ' \ XE6 \ X88 \ X91 '
>>> "a" .encode () returns b "a"
byte stream is represented by only a short code for ascii
decoding
A method of bytes not similar to the corresponding method of decoding according to the specified coding
u, only supports unicode-escape decode
decode ( "unicode-escape ')
U" \ u6211 "return" me ", and Unicode table in the \ u6211 character corresponds to" I "
Character Set: ordinary characters (IWC character set) + plus the escape character
character encoding: In particular, the escape character encoding, \ a \ b can be seen from the command line, but \ n and \ t either command line directly enter or .encode (), does not appear, but b "\ n", but just do not show it, in essence, there is a hex-encoded
print ( "\ a") is a buzzer, \ A is a character
print ( "\ z") is \ Z, and \ Z is two characters, i.e., "\" and "z"
, however, is not allowed in Python "\" appears alone; so the output from the command line \ \\ when z is z, indicating that this is not a character but two characters
The special character is incorporated as a single escape character in the character set, the encoding and decoding escape link
Command line display process is made to appear inconsistent (eg: \ a output \ A, but \ Z output \\ z), for ease of viewing the character is the case
unicode-escape with utf8 / gbk no difference, is the character encoding format, but is Unicode encoding
Special escape 1: \ x \ 0, and the \ 0 and \ X equivalents, i.e., "\ X07" and "\ 007"
\ a turn behind the character
\ and X \ 0 behind turn two characters
such as: "\ xab ", denotes a, 16 hexadecimal characters in the encoding table corresponding to b together, i.e. '<<', and" \ xab ".encode () is '<<' .encode ()
" \ 0AB "equivalents
Therefore, "\ a" and "\ x07" is the same thing, but with \ X07 intuitive way to display the \ position of a character in the encoding table, 7th
In this case, the character set and expanded, "\ x07" "\ xab" "\ 007" "\ 007" is a character
But the file \ \\ fact, so \ x07 actually \\ x07
Special string 2: \ u440e, Unicode the encoding, four characters behind the escape, and the four strings must be present in the form of Unicode encoding table;
\ xyy, yy indicates a byte, and a byte ascii code, the character code corresponding to different absence;
is tantamount to ascii characters have played alias
The \ uyyyy, pointed out the need to find the corresponding character in the Unicode table, the problem does not exist any more
so if \ u string behind the formation of four characters in Unicode encoding is not in the file, the error will be
Binary write, read, there is no problem encoding; is a byte write, a byte read
Question:
In response to get directly saved in binary, binary, there \ u, but not behind the content, so read the time wrong
The question reads file. 1:
\ U question: utf / unicode-escape of ascii characters the processing is the same, but the characters are inconsistent;
Unicode-Escape encoded byte stream generated by decoding utf8, punctuation and English is no problem, but encounter unicode yards, utf8 can not be resolved,
a text editor, open view, the default is to use utf8 open, so Unicode code is translated in accordance with the Unicode table, is displayed in the form of processed \ u6211 \ u9978 the
actual use utf8 decode the time is the same, print out a \ u6211 \ u997 this form
Therefore, fundamental or different encoding and decoding, if there are intermediate links, then get the document read out by utf8, which will have \ u6211 \ u9978 This form of
intermediate links:
data encoded with unicode-escape utf8 decode, then utf8 encoding save to is actually "\ u6211 \ u9978" .encode ( ), i.e., b '\ xe6 \ x88 \ x91 \ xe9 \ xa5 \ xb8'
we get with utf8 decoding, de-out naturally "\ u6211 \ u997 "this string
File reads Question 2:
In addition to Question 1
to wb write, read with utf8, there \ uxxx problem
with pycharm open the lower right corner to see what the code is, on what the solution
Also:
Save the file, be sure to save the code, there is no direct deposit of characters, but open when the code has been assigned, they do not need our manual .encode ();
wb is a byte write, a byte readout resolution, such as characters for a plurality of bytes, an error occurs