Python Standard Library: codecs for encoding and decoding string

Reprinted from the product is slightly Library  http://www.pinlue.com/article/2020/03/3100/5210101088002.html

1. codecs for encoding and decoding string

codecs and stream interface module provides an interface file to complete the conversion between different data representing text. Unicode text processing is typically used, but also provides additional coding to meet other purposes.

1.1 Unicode entry

CPython 3.x distinguish the text (text) and a byte (byte) string. Examples of bytes using an 8-bit byte value sequence. In contrast, str strings internally as a Unicode code point (code point) to manage the sequence. Code point value uses 2 bytes or 4 bytes, depending on the options specified when compiling Python.

Str output value will be encoded using some standard mechanism, since the byte sequence can be reconstructed for the same text string. Byte code value is not necessarily identical to the code point values, coding just defines a way to convert between two sets of values. It will also be required to know the data read Unicode encoding, so as to convert the received byte represents internal unicode class.

Western languages ​​most commonly used encoding is UTF-8 and UTF-16, these two are encoded using single-byte and two-byte sequence representing each code point values. For other languages, since most characters are represented by more than two-byte code point representation, so use other coding may be more efficient to store.

To understand the coding, the best way is to use different methods to encode the same sequence, and view different byte sequences generated. The following example uses the following format byte string functions, make it more readable.

import binascii

def to_hex(t, nbytes):

"""Format text t as a sequence of nbyte long values

separated by spaces.

"""

chars_per_item = nbytes * 2

hex_version = binascii.hexlify(t)

return b' '.join(

hex_version[start:start + chars_per_item]

for start in range(0, len(hex_version), chars_per_item)

)

if __name__ == '__main__':

print(to_hex(b'abcdef', 1))

print(to_hex(b'abcdef', 2))

This function uses the input byte string obtained binascii hexadecimal notation, before returning this value is inserted every nbytes a byte space.

The first example of a first original coded representations of class unicode printing text 'francais', followed by the name of each of the Unicode characters in the database. The next two lines of the string are encoded as UTF-8 and UTF-16, and displays the obtained encoded hexadecimal value.

import unicodedata

import binascii

def to_hex(t, nbytes):

"""Format text t as a sequence of nbyte long values

separated by spaces.

"""

chars_per_item = nbytes * 2

hex_version = binascii.hexlify(t)

return b' '.join(

hex_version[start:start + chars_per_item]

for start in range(0, len(hex_version), chars_per_item)

)

text = 'French'

print('Raw : {!r}'.format(text))

for c in text:

print(' {!r}: {}'.format(c, unicodedata.name(c, c)))

print('UTF-8 : {!r}'.format(to_hex(text.encode('utf-8'), 1)))

print('UTF-16: {!r}'.format(to_hex(text.encode('utf-16'), 2)))

The results of a str coding is a bytes object.

Given a sequence of encoded bytes (bytes as an example), decode () method to convert it into code points, and return to this example as a sequence str.

import binascii

def to_hex(t, nbytes):

"""Format text t as a sequence of nbyte long values

separated by spaces.

"""

chars_per_item = nbytes * 2

hex_version = binascii.hexlify(t)

return b' '.join(

hex_version[start:start + chars_per_item]

for start in range(0, len(hex_version), chars_per_item)

)

text = 'French'

encoded = text.encode('utf-8')

decoded = encoded.decode('utf-8')

print('Original :', repr(text))

print('Encoded :', to_hex(encoded, 1), type(encoded))

print('Decoded :', repr(decoded), type(decoded))

Choice of which type of coding used does not change the output.

1.2 handle file

Processing I / O operations, encoding and decoding string is particularly important. Whether writing to a file, socket or other flow data must use the appropriate coding. Generally, all of the text data when required by reading bytes decoded, encoded as needed from within a specific value when writing data representation. You can explicitly encoding and decoding data, but depending on the coding used, in order to determine whether the read bytes sufficient to fully decode the data, which may not be easy. codecs provides classes for managing data encoding and decoding, so the application is no longer necessary to do the job.

Codecs simplest interface is provided to override the built-in open () function. This practice function and built-in functions of the new version is similar, but adds two error-handling techniques and parameters to specify the encoding required.

import binascii

import codecs

def to_hex(t, nbytes):

"""Format text t as a sequence of nbyte long values

separated by spaces.

"""

chars_per_item = nbytes * 2

hex_version = binascii.hexlify(t)

return b' '.join(

hex_version[start:start + chars_per_item]

for start in range(0, len(hex_version), chars_per_item)

)

encodings = ['utf-8','utf-16','utf-32']

for encoding in encodings:

filename = encoding + '.txt'

print('Writing to', filename)

with codecs.open(filename, mode='w', encoding=encoding) as f:

f.write ( 'English')

# Determine the byte grouping to use for to_hex()

nbytes = {

'utf-8': 1,

'utf-16': 2,

'utf-32': 4,

}.get(encoding, 1)

# Show the raw bytes in the file

print('File contents:')

with open(filename, mode='rb') as f:

print(to_hex(f.read(), nbytes))

This example contains a first string of unicode ç process using a specified encoding this text will be saved to a file.

With open () read data is very simple, but one thing to note: you must know in advance the correct coding to build a decoder. Although some data formats (such as XML) will specify the encoding in the file, but usually have to be managed by the application. codecs take only one encoding parameter, and assumes that the code is correct.

import binascii

import codecs

def to_hex(t, nbytes):

"""Format text t as a sequence of nbyte long values

separated by spaces.

"""

chars_per_item = nbytes * 2

hex_version = binascii.hexlify(t)

return b' '.join(

hex_version[start:start + chars_per_item]

for start in range(0, len(hex_version), chars_per_item)

)

encodings = ['utf-8','utf-16','utf-32']

for encoding in encodings:

filename = encoding + '.txt'

print('Reading from', filename)

with codecs.open(filename, mode='r', encoding=encoding) as f:

print(repr(f.read()))

This example reads the program creates a file and print unicode object indicating get to the console.

1.3 endian

Between different computer system to transfer data (a file may be copied directly, or using a network communication to complete the transfer), multi-byte coding (e.g., UTF-16 and UTF-32) will bring about a problem. Different systems of different high order and low bytes used. This feature is referred to as data byte order (the endianness), depending on factors like the hardware architecture, the operating system and also depending on the choice made by the application developer. Usually no way to know in advance which one byte order given a set of data to use, so the multi-byte coding sequence further comprises a flag byte (Byte-Order Marker, BOM), the flag appears in the first few encoded output byte. For example, UTF-16, and defines 0xFFFE 0xFEFF not valid character, may be used to indicate the byte order. codecs defined UTF-16 and the corresponding constants used endian flag UTF-32.

import codecs

import binascii

def to_hex(t, nbytes):

"""Format text t as a sequence of nbyte long values

separated by spaces.

"""

chars_per_item = nbytes * 2

hex_version = binascii.hexlify(t)

return b' '.join(

hex_version[start:start + chars_per_item]

for start in range(0, len(hex_version), chars_per_item)

)

BOM_TYPES = [

'GOOD', 'BOM_BE', 'BOM_LE'

'BOM_UTF8'

'BOM_UTF16', 'BOM_UTF16_BE', 'BOM_UTF16_LE'

"Bhon_UTF32 'Bhon_UTF32_be' Bhon_UTF32_le '

]

for name in BOM_TYPES:

print('{:12} : {}'.format(

name, to_hex(getattr(codecs, name), 2)))

Native byte order dependent on the current system, BOM, BOM_UTF16 BOM_UTF32 and automatically set to an appropriate big-endian (big-endian) or a little-endian (little-endian) value.

Automatically detected and processed by the codecs in the byte order decoder may specify an explicit byte order in coding.

import codecs

import binascii

def to_hex(t, nbytes):

"""Format text t as a sequence of nbyte long values

separated by spaces.

"""

chars_per_item = nbytes * 2

hex_version = binascii.hexlify(t)

return b' '.join(

hex_version[start:start + chars_per_item]

for start in range(0, len(hex_version), chars_per_item)

)

# Pick the nonnative version of UTF-16 encoding

if codecs.BOM_UTF16 == codecs.BOM_UTF16_BE:

good = codecs.BOM_UTF16_LE

encoding = 'utf_16_le'

else:

good = codecs.BOM_UTF16_BE

encoding = 'utf_16_be'

print('Native order :', to_hex(codecs.BOM_UTF16, 2))

print('Selected order:', to_hex(bom, 2))

# Encode the text.

encoded_text = 'français'.encode(encoding)

print('{:14}: {}'.format(encoding, to_hex(encoded_text, 2)))

with open('nonnative-encoded.txt', mode='wb') as f:

# Write the selected byte-order marker. It is not included

# in the encoded text because the byte order was given

# explicitly when selecting the encoding.

f.write(bom)

# Write the byte string for the encoded text.

f.write(encoded_text)

First results native byte order, then the use of explicit alternate form, so that the next example endian can automatically detect when the reading display.

Byte order does not specify when a file is opened, so that the decoder uses two bytes BOM value before the file is determined endian.

import codecs

import binascii

def to_hex(t, nbytes):

"""Format text t as a sequence of nbyte long values

separated by spaces.

"""

chars_per_item = nbytes * 2

hex_version = binascii.hexlify(t)

return b' '.join(

hex_version[start:start + chars_per_item]

for start in range(0, len(hex_version), chars_per_item)

)

# Look at the raw data

with open('nonnative-encoded.txt', mode='rb') as f:

raw_bytes = f.read()

print('Raw :', to_hex(raw_bytes, 2))

# Re-open the file and let codecs detect the BOM

with codecs.open('nonnative-encoded.txt',

mode='r',

encoding='utf-16',

) as f:

decoded_text = f.read()

print('Decoded:', repr(decoded_text))

Since the first two bytes of the file for detecting byte order, so they are not included in the data read () returned.

1.4 Error Handling

Previous sections noted the need to know the encoding used when reading and writing Unicode files. Set the correct encoding is very important, for two reasons: first, if not correctly configured to read the file encoding, you can not correctly interpret the data, the data may be destroyed or can not be decoded, it will generate an error may be lost data.

Str similar to the encode () method and the bytes of the decode () method, codecs also use the same five error handling options.

Error pattern description

strict If you can not convert the data, an exception is thrown.

replace the special character of the mark encoded data can not be replaced.

ignore skip data.

xmlcharrefreplaceXML character (encoding only)

backslashreplace escape sequence (encoding only)

1.4.1 coding errors

The most common error is received in a UnicodeEncodeError write Unicode data to an ASCII output stream (such as a regular file or sys.stdout).

import codecs

error_handlings = ['strict','replace','ignore','xmlcharrefreplace','backslashreplace']

text = 'French'

for error_handling in error_handlings:

try:

# Save the data, encoded as ASCII, using the error

# handling mode specified on the command line.

with codecs.open('encode_error.txt', 'w',

encoding='ascii',

errors=error_handling) as f:

f.write(text)

except UnicodeEncodeError as err:

print('ERROR:', err)

else:

# If there was no error writing to the file,

# show what it contains.

with open('encode_error.txt', 'rb') as f:

print('File contents: {!r}'.format(f.read()))

The first option, to ensure that the application explicitly set the correct code for all I / O operations, strict mode is the safest option, but an exception when this mode may cause the program to crash.

The second option, replace ensure that no mistakes at the expense of some of the data can not be converted to the required encoding may be lost. pi (π) are still unable to use Unicode character encoding ASCII, but when using this error processing mode, is not an exception, but will be replaced in this character in the output?.

The third option, not the encoded data are discarded.

The fourth option is expressed as a character substitution will different from the candidate encoding defined in the standard. xmlcharrefreplace uses an XML character reference as an alternative.

The fifth option, and the fourth will replace characters as represented by a different candidate with the coding defined in the standard. It generates a print output format similar object repr unicode value returned (). Unicode character will be replaced \ u hexadecimal values ​​and code points.

1.4.2 coding errors

It is also possible data coding error was encountered while, especially if you use the wrong encoding.

import codecs

import binascii

def to_hex(t, nbytes):

"""Format text t as a sequence of nbyte long values

separated by spaces.

"""

chars_per_item = nbytes * 2

hex_version = binascii.hexlify(t)

return b' '.join(

hex_version[start:start + chars_per_item]

for start in range(0, len(hex_version), chars_per_item)

)

error_handlings = ['strict','ignore','replace']

text = 'French'

for error_handling in error_handlings:

print('Original :', repr(text))

# Save the data with one encoding

with codecs.open('decode_error.txt', 'w',

encoding='utf-16') as f:

f.write(text)

# Dump the bytes from the file

with open('decode_error.txt', 'rb') as f:

print('File contents:', to_hex(f.read(), 1))

# Try to read the data with the wrong encoding

with codecs.open('decode_error.txt', 'r',

encoding='utf-8',

errors=error_handling) as f:

try:

data = f.read()

except UnicodeDecodeError as err:

print('ERROR:', err)

else:

print('Read :', repr(data))

Encoding the same, if not the correct opcode byte stream section, the strict error raises an exception processing mode. Here, the reason is to try to produce UnicodeDecodeError UTF-8 decoder is converted to a character UTF-16BOM portion.

Switch to the decoder will ignore valid byte skip. However, the result is still not the original count results, because it contains embedded null bytes.

When using replace mode, illegal bytes will be replaced \ uFFFD, which is the official Unicode replacement character, looks like a diamond with a black background, which includes a white question mark.

1.5 transcoding

While most applications are processed within str data, the data decoding or encoding as part of I / O operations, but in some cases it may be necessary to change the encoded files without intermediate data continue to adhere to this format, which may be useful. EncodedFile () takes a file handle when using some coding opened, the package file handle in a class, there are I / O operations that will data into another encoding.

import binascii

import codecs

I import

def to_hex(t, nbytes):

"""Format text t as a sequence of nbyte long values

separated by spaces.

"""

chars_per_item = nbytes * 2

hex_version = binascii.hexlify(t)

return b' '.join(

hex_version[start:start + chars_per_item]

for start in range(0, len(hex_version), chars_per_item)

)

# Raw version of the original data.

data = 'French'

# Manually encode it as UTF-8.

utf8 = data.encode('utf-8')

print('Start as UTF-8 :', to_hex(utf8, 1))

# Set up an output buffer, then wrap it as an EncodedFile.

output = io.BytesIO()

encoded_file = codecs.EncodedFile(output, data_encoding='utf-8',

file_encoding='utf-16')

encoded_file.write(utf8)

# Fetch the buffer contents as a UTF-16 encoded byte string

utf16 = output.getvalue()

print('Encoded to UTF-16:', to_hex(utf16, 2))

# Set up another buffer with the UTF-16 data for reading,

# and wrap it with another EncodedFile.

buffer = io.BytesIO(utf16)

encoded_file = codecs.EncodedFile(buffer, data_encoding='utf-8',

file_encoding='utf-16')

# Read the UTF-8 encoded version of the data.

recoded = encoded_file.read()

print('Back to UTF-8 :', to_hex(recoded, 1))

This example shows how to read and write different handle returned EncodedFile (). Regardless of when the handle is used to read or write, file_encoding always indicate indication always open file handles coding used (as the first argument), data_encoding value indicates invoked to pass data through the read () and write () used coding.

1.6 Non-Unicode encoding

While most examples use Unicode encoding before, but in fact many other codecs can also be used for data conversion. For example, Python codecs process comprising the base-64, bzip2, ROT-13, ZIP, and other data formats.

import codecs

I import

buffer = io.StringIO()

stream = codecs.getwriter('rot_13')(buffer)

text = 'abcdefghijklmnopqrstuvwxyz'

stream.write(text)

stream.flush()

print('Original:', text)

print('ROT-13 :', buffer.getvalue())

If the conversion may be expressed as a function with a single input parameter and returns a byte or string of Unicode, then such conversion can be registered as a codec. For 'rot_13'codec, input should be a Unicode string; Unicode output is a string.

Packaging codecs use a data stream may be provided more simply than using zlib interface.

import codecs

I import

buffer = io.BytesIO()

stream = codecs.getwriter('zlib')(buffer)

text = b'abcdefghijklmnopqrstuvwxyz\n' * 50

stream.write(text)

stream.flush()

print('Original length :', len(text))

compressed_data = buffer.getvalue()

print('ZIP compressed :', len(compressed_data))

buffer = io.BytesIO(compressed_data)

stream = codecs.getreader('zlib')(buffer)

first_line = stream.readline()

print('Read first line :', repr(first_line))

uncompressed_data = first_line + stream.read()

print('Uncompressed :', len(uncompressed_data))

print('Same :', text == uncompressed_data)

Not all compression or coding systems support the use of the readline () or read () to read a portion of the data stream via the interface, because it requires to find the end of the compression section to complete the decompression. If a program can not save the entire decompressed data sets in memory, you can use incremental access feature compression library instead of codecs.

1.7 Incremental encoder

Some currently supplied encoded (and especially bz2 zlib) can significantly alter the length of the data flow in processing the data stream. For large data sets, which can be incrementally better encoding process, i.e. process only a small block of data. IncrementalEncoder / IncreamentalDecoder API is designed for this.

import codecs

import sys

text = b'abcdefghijklmnopqrstuvwxyz\n'

repetitions = 50

print('Text length :', len(text))

print('Repetitions :', repetitions)

print('Expected len:', len(text) * repetitions)

# Encode the text several times to build up a

# large amount of data

encoder = codecs.getincrementalencoder('bz2')()

encoded = []

print()

print('Encoding:', end=' ')

last = repetitions - 1

for i in range(repetitions):

en_c = encoder.encode(text, final=(i == last))

if en_c:

print('\nEncoded : {} bytes'.format(len(en_c)))

encoded.append(en_c)

else:

sys.stdout.write('.')

all_encoded = b''.join(encoded)

print()

print('Total encoded length:', len(all_encoded))

print()

# Decode the byte string one byte at a time

decoder = codecs.getincrementaldecoder('bz2')()

decoded = []

print('Decoding:', end=' ')

for i, b in enumerate(all_encoded):

final = (i + 1) == len (text)

c = decoder.decode(bytes([b]), final)

if c:

print('\nDecoded : {} characters'.format(len(c)))

print('Decoding:', end=' ')

decoded.append(c)

else:

sys.stdout.write('.')

print()

restored = b''.join(decoded)

print()

print('Total uncompressed length:', len(restored))

Each time the data is transmitted to the encoder or decoder, will update its internal state. When consistent state (as defined by the codec), and returns the data reset state. Prior to this, encode () or decode () call does not return any data. When the last bit of the incoming data, final parameters should be set to True, this codec can be aware of the need to refresh the output of all remaining buffered data.

1.8 Defining custom coding

Because Python has provided a large number of standard codecs, so the application is generally less likely to need to define custom encoder or decoder. However, if really necessary, many codecs in the base class can help you more easily define custom coding.

The first step is to understand the nature of the encoding described conversion. Examples of this section will use a "invertcaps" encoding, it lowercase to uppercase, lowercase to uppercase letters. The following is a simple definition of the encoding function, the input string that will complete the conversion.

import string

def invertcaps(text):

"""Return new string with the case of all letters switched.

"""

return ''.join(

c.upper() if c in string.ascii_lowercase

else c.lower() if c in string.ascii_uppercase

else c

for c in text

)

if __name__ == '__main__':

print(invertcaps('ABCdef'))

print(invertcaps('abcDEF'))

Here, the encoder and decoder are the same function (similar to the ROT-13).

While it is easy to understand, but to achieve this efficiency is not high, especially for very large text string. Fortunately, codecs include some auxiliary functions, you can create a character map (character map) of codecs, such as invertcaps. Character map encoding consists of two dictionaries. Coding and mapping (encoding map) of the input character string is converted into a byte value of the output value of the decoding mapping (decoding map) is the opposite. Decoding first create a mapping, then make_encoding_map () to convert it to an encoding mapping. C function charmap_encode () and charmap_decode () can be mapped using the efficient conversion of input data.

import codecs

import string

# Map every character to itself

decoding_map = codecs.make_identity_dict(range(256))

# Make a list of pairs of ordinal values for the lower

# and uppercase letters

pairs = list(zip(

[ord(c) for c in string.ascii_lowercase],

[ord(c) for c in string.ascii_uppercase],

))

# Modify the mapping to convert upper to lower and

# lower to upper.

decoding_map.update({

upper: lower

for (lower, upper)

in pairs

})

decoding_map.update({

lower: upper

for (lower, upper)

in pairs

})

# Create a separate encoding map.

encoding_map = codecs.make_encoding_map(decoding_map)

if __name__ == '__main__':

print(codecs.charmap_encode('abcDEF', 'strict',

encoding_map))

print(codecs.charmap_decode(b'abcDEF', 'strict',

decoding_map))

print(encoding_map == decoding_map)

Although invertcaps encoding and decoding mapping is the same, but not always. For each input character will sometimes encoding the same output bytes, make_encoding_map () detects these situations, and replace the code value to None, to mark encoded as undefined.

All standard error handling character map encoder and decoder support described earlier, there is no need to do any extra work to support this part of the API.

import codecs

import string

# Map every character to itself

decoding_map = codecs.make_identity_dict(range(256))

# Make a list of pairs of ordinal values for the lower

# and uppercase letters

pairs = list(zip(

[ord(c) for c in string.ascii_lowercase],

[ord(c) for c in string.ascii_uppercase],

))

# Modify the mapping to convert upper to lower and

# lower to upper.

decoding_map.update({

upper: lower

for (lower, upper)

in pairs

})

decoding_map.update({

lower: upper

for (lower, upper)

in pairs

})

# Create a separate encoding map.

encoding_map = codecs.make_encoding_map(decoding_map)

text = 'pi: \u03c0'

for error in ['ignore', 'replace', 'strict']:

try:

encoded = codecs.charmap_encode(

text, error, encoding_map)

except UnicodeEncodeError as err:

encoded = str(err)

print('{:7}: {}'.format(error, encoded))

Since Unicode code point is no longer π coded map, it will generate an exception processing mode using strict error.

After defining the encoding and decoding mapping, also we need to establish some additional classes, in addition to registration code. register () to add a function to search the registry, so that when the user wishes to use this encoding, the codecs can find it. The search function must have a string parameter that contains encoding name, if it knows that encodes a CodecInfo object is returned, otherwise None.

import codecs

def search1(encoding):

print('search1: Searching for:', encoding)

return None

def search2(encoding):

print('search2: Searching for:', encoding)

return None

codecs.register(search1)

codecs.register(search2)

utf8 = codecs.lookup('utf-8')

print('UTF-8:', utf8)

try:

unknown = codecs.lookup('no-such-encoding')

except LookupError as err:

print('ERROR:', err)

You can register multiple search functions, each search function will in turn call until a search function returns a CodecInfo, or all search functions have been called. codecs registered internal search function to know how to install standard codecs, such as encodings of UTF-8, so these codes were not passed on to customize the search function.

Examples of the search function returns CodecInfo tells how to use various codecs supported mechanism to complete encoding and decoding, comprising: a stateless encoder, incremental encoder and the stream encoding. codecs including some base class to help build character map encoding. The following example integrates all content, it registers a search function, and returns to a CodecInfo instance invertcaps codec configuration.

import codecs

import string

# Map every character to itself

decoding_map = codecs.make_identity_dict(range(256))

# Make a list of pairs of ordinal values for the lower

# and uppercase letters

pairs = list(zip(

[ord(c) for c in string.ascii_lowercase],

[ord(c) for c in string.ascii_uppercase],

))

# Modify the mapping to convert upper to lower and

# lower to upper.

decoding_map.update({

upper: lower

for (lower, upper)

in pairs

})

decoding_map.update({

lower: upper

for (lower, upper)

in pairs

})

# Create a separate encoding map.

encoding_map = codecs.make_encoding_map(decoding_map)

class InvertCapsCodec(codecs.Codec):

"Stateless encoder/decoder"

def encode(self, input, errors='strict'):

return codecs.charmap_encode(input, errors, encoding_map)

def decode(self, input, errors='strict'):

return codecs.charmap_decode(input, errors, decoding_map)

class InvertCapsIncrementalEncoder(codecs.IncrementalEncoder):

def encode(self, input, final=False):

data, nbytes = codecs.charmap_encode(input,

self.errors,

encoding_map)

return data

class InvertCapsIncrementalDecoder(codecs.IncrementalDecoder):

def decode(self, input, final=False):

data, nbytes = codecs.charmap_decode(input,

self.errors,

decoding_map)

return data

class InvertCapsStreamReader(InvertCapsCodec,

codecs.StreamReader):

pass

class InvertCapsStreamWriter(InvertCapsCodec,

codecs.StreamWriter):

pass

def find_invertcaps(encoding):

"""Return the codec for 'invertcaps'.

"""

if encoding == 'invertcaps':

return codecs.CodecInfo(

name='invertcaps',

encode=InvertCapsCodec().encode,

decode=InvertCapsCodec().decode,

incrementalencoder = InvertCapsIncrementalEncoder,

incrementaldecoder = InvertCapsIncrementalDecoder,

streamreader=InvertCapsStreamReader,

streamwriter=InvertCapsStreamWriter,

)

return None

codecs.register(find_invertcaps)

if __name__ == '__main__':

# Stateless encoder/decoder

encoder = codecs.getencoder('invertcaps')

text = 'abcDEF'

encoded_text, consumed = encoder(text)

print('Encoded "{}" to "{}", consuming {} characters'.format(

text, encoded_text, consumed))

# Stream writer

I import

buffer = io.BytesIO()

writer = codecs.getwriter('invertcaps')(buffer)

print('StreamWriter for io buffer: ')

print(' writing "abcDEF"')

writer.write('abcDEF')

print(' buffer contents: ', buffer.getvalue())

# Incremental decoder

decoder_factory = codecs.getincrementaldecoder('invertcaps')

decoder = decoder_factory()

decoded_text_parts = []

for c in encoded_text:

decoded_text_parts.append(

decoder.decode(bytes([c]), final=False)

)

decoded_text_parts.append(decoder.decode(b'', final=True))

decoded_text = ''.join(decoded_text_parts)

print('IncrementalDecoder converted {!r} to {!r}'.format(

encoded_text, decoded_text))

No state of the encoder / decoder is a base class Codec, achieve with the new covered encode () and decode () (where each call charmap_encode () and charmap_decode ()). These methods must return a tuple, respectively, and the converted data contains a character input or the number of bytes consumed. charmap_encode () and charmap_decode () has returned to the news, so it is convenient.

IncrementalEncoder incrementalDecoder and incremental encoder may be used as the base class interface. Increments the encode () and decode () method is defined to return only the real data conversion. Buffered to maintain all relevant information as the internal state. invertcaps no buffering encoded data (which uses a one to one mapping). If the processed data coded according to generate different number of outputs, such as compression algorithm, then for these codes, BufferedIncrementalEncoder BufferedIncrementalDecoder and will be more suitable base class, because they can manage the input unprocessed portion.

StreamReader and StreamWriter need encode () and decode () method, but also because they are often returned to the same value in Codec corresponding method, can be achieved when using multiple inheritance.

 

Published 60 original articles · won praise 58 · Views 140,000 +

Guess you like

Origin blog.csdn.net/yihuliunian/article/details/105293297