Python: convert text encoding

After recently doing weekly when the need to csv text data extracted from the production of table production charts.

Csv acquiring text when using basically with open (filename, encoding = 'UTF-8') as f: open the csv text, but actual use is not found some csv utf-8 format text, causing an error during program run in, each time the need to manually modify the coding format of the text file into a utf-8, the program is run again, it is to say: determining directly in the program and modify the text encoding.

The basic idea: to find out if the text is utf-8 encoding, if not, change it to utf-8 encoded text, and then processed.

There chardet python library can view the information encoding text:

Only one non-detect function unicode string parameter and returns a dictionary (e.g.: { 'encoding': 'utf-8', 'confidence': 0.99}). The dictionary comprises determining the confidence level and the determination of the encoding format.

import chardet

def get_encode_info(file):
    with open(file, 'rb') as f:
        return chardet.detect(f.read())['encoding']

But this process from the time a small file performance was okay, if a little too much text is very slow, and now my local csv file is nearly 200k, you can clearly feel the speed is too slow and inefficient. But providing chardet library UniversalDetector object to handle: Create UniversalDetector object, and then repeatedly calls its feed method for each block of text. If the detector reaches a minimum confidence threshold, it will detector.done set to True. Once you run out of source text, call detector.close (), which will complete some final calculations, in case it does not reach the minimum threshold of confidence before the detector. The result would be a dictionary, which contains the character codes and confidence automatic detection (same function returns charde.test).

from chardet.universaldetector import UniversalDetector

def get_encode_info(file):
 with open(file, 'rb') as f:
        detector = UniversalDetector()
 for line in f.readlines():
            detector.feed(line)
 if detector.done:
 break
        detector.close()
 return detector.result['encoding']

Do encounter problems when transcoding: UnicodeDecodeError: 'charmap' codec can not decode byte 0x90 in position 178365: character maps to <undefined>

def read_file(file):
 with open(file, 'rb') as f:
 return f.read()

def write_file(content, file):
 with open(file, 'wb') as f:
        f.write(content)

def convert_encode2utf8(file, original_encode, des_encode):
    file_content = read_file(file)
    file_decode = file_content.decode(original_encode)   #-->此处有问题
    file_encode = file_decode.encode(des_encode)
    write_file(file_encode, file)

This is not good because the byte character set decoding, to add another parameter errors. Official documentation wrote:

bytearray.decode(encoding=”utf-8”, errors=”strict”)

Return a string decoded from the given bytes. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.

Means that the array of characters into a string decoding utf-8 may be set to different treatment options, the default is 'strict', it is possible throw UnicodeError, can be changed to 'ignore', 'replace' can be solved.

So line file_decode = file_content.decode (original_encode) to modify this file_decode = file_content.decode (original_encode, 'ignore') can.

Complete code:

from chardet.universaldetector import UniversalDetector

def get_encode_info(file):
 with open(file, 'rb') as f:
     detector = UniversalDetector()
     for line in f.readlines():
         detector.feed(line)
         if detector.done:
             break
     detector.close()
     return detector.result['encoding']

def read_file(file):
    with open(file, 'rb') as f:
        return f.read()

def write_file(content, file):
    with open(file, 'wb') as f:
        f.write(content)

def convert_encode2utf8(file, original_encode, des_encode):
    file_content = read_file(file)
    file_decode = file_content.decode(original_encode,'ignore')
    file_encode = file_decode.encode(des_encode)
    write_file(file_encode, file)

if __name__ == "__main__":
    filename = r'C:\Users\danvy\Desktop\Automation\testdata\test.csv'
    file_content = read_file(filename)
    encode_info = get_encode_info(filename)
    if encode_info != 'utf-8':
        convert_encode2utf8(filename, encode_info, 'utf-8')
    encode_info = get_encode_info(filename)
    print(encode_info)

Python: convert text encoding

Guess you like