Notes on reading and writing Chinese characters in Python3 JSON

some instructions

  • Why am I writing this article?

      Recently, I signed up for the reptile class of Mr. Song Tian, ​​who is a MOOC in China University. It was quite rewarding. I decided to find a university news website to practice my skills. I still encountered the notorious problem of Chinese character processing, which is hereby recorded.

  • What is this article about?

      This article is about how to convert Dict in Python into JSON Object and how to convert List into JSON Array, where the data in Dict and List includes Chinese, and how to store these data in the file in the correct way, Then correctly read into memory for reuse. The key point is the processing of Chinese characters, which was once a troublesome problem in Python2, and has been slightly improved in Python3, but it still needs to make some settings when reading and writing. Here is a detailed description of where the settings are.

How to achieve

The problem is clearly explained, and it is useless to say more, just go to the code.

# -*- coding: utf-8 -*-
import json
import requests
from bs4 import BeautifulSoup
import re
import traceback
import codecs

newsList = []
newsDict = {}
r = requests.get('http://news.hunnu.edu.cn/sdxw.htm')
r.encoding = r.apparent_encoding
soup = BeautifulSoup(r.text,'lxml')
newsInfoTags = soup.find_all(id=re.compile('lineu'))
for item in newsInfoTags:
    elem = item.contents
    # 注意 <class 'bs4.element.NavigableString'> 可自动转换为str
    # 所以是否把结果用显式地用str()转换都可以
    newsDict['tag'] = elem[3].string
    newsDict['title'] = elem[5].string
    newsDict['date'] = elem[7].string
    newsList.append(newsDict)

# 把数据写入JSON文件中,注意 'utf8' 必不可少
with codecs.open('newsList.json', 'w', 'utf8') as f:
    # 注意 ensure_ascii=False 必不可少
    f.write(json.dumps(newsList, ensure_ascii=False))

# 从JSON文件中把数据读入内存,注意 'utf8' 必不可少
with codecs.open('newsList.json', 'r', 'utf8') as f:
    objs = json.loads(f.read())
    print(len(objs))

Two built-in libraries of Python are mainly used here, namely codecs and json. Related usage introduction:

  1. codecs official documentation
  2. json official documentation

Please refer to the notes for the places you need to pay attention to, which is roughly the same. It is worth noting that the json library is particularly useful. It can automatically identify Python/file types, and then convert it to the corresponding file/Python type. These rich libraries make Python the only one that can be called a Super language programming language too.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324646523&siteId=291194637