After Python reads the text document, the strings appear to be the same length on the surface, but the lengths of the two are different (\ufeff) Python reads and writes json documents

After Python reads the text document, the strings seem to be the same length on the surface, but the lengths of the two are different (\ufeff)

1. Python reads text documents

Using python to read txt documents, the lengths of the two strings are different, because the common encoding="utf-8" is used, and the read string will be preceded by a (\ufeff) with a length of 1. In this way If the string is read, the length of the string read will always be 1 longer than the length of the actual string. It cannot be removed with strip()

After searching on the Internet, I found that the BOM (Byte Order Mark, byte order mark, appearing in the header of the text file, which is used to identify the encoding format of the file in the Unicode encoding standard) was included when the text was saved. Save txt Be careful not to include bom in the file. If bom is already included, you can use the notepad++ editor to convert it into a text file without BOM format.
Correct spelling after modification
Correct spelling after modification

2. Python reads the json file

text = open(global_newest_pool, "r", encoding="utf-8-sig")
for line in text.readlines():
    jsonOne = json.loads(line)
    #   print(jsonOne['_id'])
    #   读取json文件格式的数据,加载到内存


with open(global_recall_result_one, "wb+") as f:
    f.write(str(fina_recall_list).encode('utf-8'))
    #json.dumps(paperList, f)
    print("写入单个的文件完成")
    #把整个列表存入json文件,整个列表占一行


if os.path.isfile(global_recall_result_many):
    os.remove(global_recall_result_many)
    #先判断这个json文件存在不?存在就删除,因为后面是追加模式,把每一个完整列表追加到下一行。
    with open(global_recall_result_many, "ab+") as f:
        f.write(str(recallPaperList).encode('utf-8'))
        f.write(str('\n').encode('utf-8'))

Guess you like

Origin blog.csdn.net/baidu_41810561/article/details/115950335