Python学习笔记（三十六）Unicode 和 byte 字符串（Unicode and Byte Strings）

1. Python 3.X中字符串对象类型的名称（names）和角色（roles）是什么？
答: Python 3.X有三种字符串类型：str（用于Unicode文本，包括ASCII），bytes（用于具有绝对字节值【 absolute byte values】的二进制数据）和 bytearray（可变字节的形式）。 str类型通常表示存储在文本文件中的内容，而其他两种类型通常表示存储在二进制文件中的内容。

2. Python 2.X中字符串对象类型的名称和角色是什么？
答: Python 2.X有两种主要的字符串类型：str（用于8位文本和二进制数据）和 unicode（用于可能更宽的字符Unicode文本）。 str类型用于文本和二进制文件内容; unicode用于通常比8位字符更复杂的文本文件内容。 Python 2.6（但不是更早）也有3.X的bytearray类型，但它主要是一个后端口【back-port】，并没有表现出它在3.X中的明显文本/二进制区别。

3. 2.X和3.X字符串类型之间的映射关系是什么？
答:从2.X到3.X字符串类型的映射不是直接的，因为2.X的str等于3.X中的str和bytes，而3.X的str等于2.X中的str和unicode。 3.X中bytearray的可变性也是唯一的。通常，虽然Unicode文本由3.X str 和2.X unicode处理，基于字节的数据由3.X bytes和2.X str处理，而3.X bytes和2.X str都可以处理一些更简单的文本类型。

4. Python 3.X的字符串类型在操作方面有何不同？
答:Python 3.X的字符串类型几乎共享所有相同的操作：方法调用（method calls），序列操作（sequence operations），甚至模式匹配（pattern matching）等更大的工具也以相同的方式工作。另一方面，只有str支持字符串格式化操作，而bytearray具有执行就地更改（in-place changes）的另外的操作集。 str和bytes类型还分别具有用于编码和解码文本的方法。

5.如何在3.X中的字符串中编码（encode）非ASCII的Unicode字符？
答:非ASCII Unicode字符可以用hex（\ xNN）和Unicode（\ uNNNN，\ UNNNNNNNN）转义字符串来编码。在某些机器上，某些非ASCII字符（例如某些Latin-1字符）也可以直接键入或粘贴到代码中，并根据UTF-8默认值或源代码编码指令注释进行解释（directive comment）。

6. Python 3.X中的文本和二进制模式文件之间的主要区别是什么？
答:在3.X中，文本模式文件假定其文件内容是Unicode文本（即使它是全ASCII），并在写入时自动解码和编码。对于二进制模式文件，字节将不加改变地传入和传出文件。文本模式文件的内容通常在脚本中表示为str对象，二进制文件的内容表示为bytes（或bytearray）对象。文本模式的文件还处理某些编码类型的BOM【 byte order marker】，并自动将输入和输出上的单个\ n字符的行尾序列转换为单个\n字符，除非明确禁用；二进制模式文件不执行任何这些步骤。 Python 2.X将codecs.open用于Unicode文件，其编码和解码类似; 2.X的open仅在文本模式下只在文本模式下翻译行结束（line ends）。

7.您如何读取包含与平台默认编码不同的文本的Unicode文本文件？
答:要读取以与平台默认编码不同的编码而编码之后的文件，只需将文件编码的名称传递给3.X中的open内置函数（2.X中的codecs.open()）;从文件中读取数据时，将按指定的编码对数据进行解码。您还可以读取二进制模式并通过提供编码名称手动将字节解码为字符串，但这涉及额外的工作，并且在多字节字符方面有些容易出错（您可能会意外地读取部分字符序列）。

8.如何以特定的编码格式创建Unicode文本文件？
答:要以特定的编码格式创建Unicode文本文件，请传递所需的编码名称到3.X的 open （2.X中的codecs.open()）；在将字符串写入文件时，将根据所需的编码对字符串进行编码。您还可以手动将字符串编码为字节并以二进制模式写入，但这通常是额外的工作。

9.为什么ASCII文本被认为是Unicode文本的一种？
答: ASCII文本被认为是一种Unicode文本，因为它的7位值范围是大多数Unicode编码的子集。例如，有效的ASCII文本也是有效的Latin-1文本（Latin-1只是将剩余的可能值以8位字节分配给其他字符）和有效的UTF-8文本（UTF-8定义了一个表示更多字符的可变字节格式，但ASCII字符仍然用相同的代码以单个字节表示。）。这使得Unicode向后兼容世界上大量的ASCII文本数据（尽管它也可能限制了它的选项 - 例如，自我识别的文本可能很难识别(尽管BOMs扮演着同样的角色)。

10. Python 3.X的字符串类型的更改对您的代码有多大影响？
答:Python 3.X的字符串类型更改的影响取决于您使用的字符串类型。对于在具有ASCII兼容的默认编码的平台上使用简单ASCII文本的脚本，影响可能很小：在这种情况下，str字符串类型在2.X和3.X中的工作方式相同。此外，尽管标准库中的字符串相关工具（如re，struct，pickle和xml）在技术上可能在3.X中使用的和在2.X中使用的是不同类型的，但这些更改在很大程度上与大多数程序无关，因为3.X的str和bytes和2.X的str支持几乎相同的接口。如果您处理Unicode数据，您需要的工具集只需从2.X的unicode和codecs.open()移动到3.X的str和open。如果处理二进制数据文件，则需要将内容作为bytes对象处理；因为它们具有与2.X字符串类似的接口，但影响应该再次最小化。尽管如此，针对3.X的Programming Python的更新遇到了许多情况，其中Unicode在3.X中的强制状态意味着标准库API，从网络和GUI到数据库和电子邮件的变化。总体而言，Unicode最终可能会影响大多数3.X用户。

一些其它的东西：

1.Python2.X的str和Unicode变异为3.X的bytes和str类型

2.编码：根据所需的编码将字符串转化为其原始字节形式的过程

3.在2.X中，使用str处理简单的文本和二进制数据，unicode处理无法映射到8位字节的高级文本；在3.X中，使用str处理任意文本，bytes和bytearray处理二进制数据

4.Python 3.X中，str和bytes不会自动转换

（1）str.encode() bytes(S, encoding)

（2）bytes.decode() str(B, encoding)

str -> bytes：把文本编码为原始字节

bytes -> str：把原始字节解码为文本

5.Python 2.X：str -> uni, uni -> str

Python 3.X：bytes -> str，str -> bytes

6.Unicode cod point（unicode码点）可以看作为 character（字符）

7.集合的使用

# Attribute in str but not bytes
>>>set(dir('abc')) - set(dir(b'abc'))

# Attribute in bytes but not str
>>>set(dir(b'abc')) - set(dir('abc'))

8.bytes特殊用法

>>>B = bytes([97, 98, 99])
>>>B
'abc'

9.注意：
1. 使用str处理文本数据

2. 使用bytes处理二进制数据

3. 使用bytearray处理希望原地（in-place）改变的二进制数据

注：转载《Learning Python 5th Edition》[奥莱理]

1. What are the names and roles of string object types in Python 3.X?
2. What are the names and roles of string object types in Python 2.X?
3. What is the mapping between 2.X and 3.X string types?
4. How do Python 3.X's string types differ in terms of operations?
5. How can you code non-ASCII Unicode characters in a string in 3.X?
6. What are the main differences between text- and binary-mode files in Python 3.X?
7. How would you read a Unicode text file that contains text in a different encoding than the default for your platform?
8. How can you create a Unicode text file in a specific encoding format?
9. Why is ASCII text considered to be a kind of Unicode text?
10. How large an impact does Python 3.X's string types change have on your code?

1. Python 3.X has three string types: str (for Unicode text, including ASCII), bytes(for binary data with absolute byte values), and bytearray (a mutable flavor of bytes). The str type usually represents content stored on a text file, and the other two types generally represent content stored on binary files.

2. Python 2.X has two main string types: str (for 8-bit text and binary data) and unicode (for possibly wider character Unicode text). The str type is used for both text and binary file content; unicode is used for text file content that is generally more complex than 8-bit characters. Python 2.6 (but not earlier) also has 3.X's bytearray type, but it's mostly a back-port and doesn't exhibit the sharp text/binary distinction that it does in 3.X.

3. The mapping from 2.X to 3.X string types is not direct, because 2.X's str equates to both str and bytes in 3.X, and 3.X's str equates to both str and unicode in 2.X. The mutability of bytearray in 3.X is also unique. In general, though: Unicode text is handled by 3.X str and 2.X unicode, byte-based data is handled by 3.X bytes and 2.X str, and 3.X bytes and 2.X str can both handle some simpler types of text.

4. Python 3.X's string types share almost all the same operations: method calls, sequence operations, and even larger tools like pattern matching work the same way. On the other hand, only str supports string formatting operations, and bytear ray has an additional set of operations that perform in-place changes. The str and bytes types also have methods for encoding and decoding text, respectively.

5. Non-ASCII Unicode characters can be coded in a string with both hex (\xNN) and Unicode (\uNNNN, \UNNNNNNNN) escapes. On some machines, some non-ASCII characters—certain Latin-1 characters, for example—can also be typed or pasted directly into code, and are interpreted per the UTF-8 default or a source code encoding directive comment.

6. In 3.X, text-mode files assume their file content is Unicode text (even if it's all ASCII) and automatically decode when reading and encode when writing. With binary-mode files, bytes are transferred to and from the file unchanged. The contents of text-mode files are usually represented as str objects in your script, and the contents of binary files are represented as bytes (or bytearray) objects. Textmode files also handle the BOM for certain encoding types andautomatically translate end-of-line sequences to and from the single \n character on input and output unless this is explicitly disabled; binary-mode files do not perform either of these steps. Python 2.X uses codecs.open for Unicode files, which encodes and decodes similarly; 2.X's open only translates line ends in text mode.

7. To read files encoded in a different encoding than the default for your platform, simply pass the name of the file's encoding to the open built-in in 3.X (codecs.open() in 2.X); data will be decoded per the specified encoding when it is read from the file. You can also read in binary mode and manually decode the bytes to a string by giving an encoding name, but this involves extra work and is somewhat error-prone for multibyte characters (you may accidentally read a partial character sequence).

8. To create a Unicode text file in a specific encoding format, pass the desired encoding name to open in 3.X (codecs.open() in 2.X); strings will be encoded per the desired encoding when they are written to the file. You can also manually encode a string to bytes and write it in binary mode, but this is usually extra work.

9. ASCII text is considered to be a kind of Unicode text, because its 7-bit range of values is a subset of most Unicode encodings. For example, valid ASCII text is also valid Latin-1 text (Latin-1 simply assigns the remaining possible values in an 8-bit byte to additional characters) and valid UTF-8 text (UTF-8 defines a variable-byte scheme for representing more characters, but ASCII characters are still represented with the same codes, in a single byte). This makes Unicode backward-compatible with the mass of ASCII text data in the world (though it also may have limited its options—self-identifying text, for instance, may have been difficult (though BOMs serve much the same role).

10. The impact of Python 3.X's string types change depends upon the types of strings you use. For scripts that use simple ASCII text on platforms with ASCII-compatible default encodings, the impact is probably minor: the str string type works the same
in 2.X and 3.X in this case. Moreover, although string-related tools in the standard library such as re, struct, pickle, and xml may technically use different types in 3.X than in 2.X, the changes are largely irrelevant to most programs because 3.X's str and bytes and 2.X's str support almost identical interfaces. If you process Unicode data, the toolset you need has simply moved from 2.X's unicode and codecs.open() to 3.X's str and open. If you deal with binary data files, you'll need to deal with content as bytes objects; since they have a similar interface to 2.X strings, though, the impact should again be minimal. That said, the update of the book Programming Python for 3.X ran across numerous cases where Unicode's mandatory status in 3.X implied changes in standard library APIs—from networking and GUIs, to databases and email. In general, Unicode will probably impact most 3.X users eventually.

Python学习笔记（三十六）Unicode 和 byte 字符串（Unicode and Byte Strings）

猜你喜欢