Python writes a json file line by line, each line is a standard json object, but the whole file is not a json object

Today's article is mainly a small application-oriented practice. Why did I write this? I have to go back to 2017. At that time, when I was working on a project, one of them was to do data processing and analysis. It was given to me. When I get the data set, I always feel weird. Each line is a dictionary object, but the overall file content is not a json object. As a result, I will report an error when I directly use the json module load to read, as follows:
 

json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 2642)

Obviously, the file conforms to the marked json format.

I used an indirect processing method at that time, which was to read it in the form of a string, and then convert it into a dict object. It is precisely because of this technical route that I came into contact with a very magical function eval in python. (), in plain English, the function of this function is simply to convert the content of the string into an object of the original content. I also find it novel, and I wrote a blog to record it. At present, there are already 70,000+ readings. Interested If so, you can see:

"Eval() learning of python magic function"

The main purpose here today is not to review the things done before, but the things to be done today are related to the previous things. Before, it was used as an example to load, analyze and read. Today, it is mainly to write json object data line by line.

The initial method I used here was to convert the original data into a string object and then write it into a file, but such a file is problematic. When loading the data set for analysis, I found that the key in the corresponding json object could not be found. , so this way excludes.

Next, I want to operate the json file based on the a mode. This method is finally proven to be effective. Simply look at the corresponding code implementation:

string={
        "content": "我国现存有哪些环境污染问题?", 
        "summary": "1、大气污染,我国大气中主要污染物是氨氮,二氧化硫,氮氧化物这三类物质。2、水体污染,目前我国水资源污染还是比较严重的,主要有以下几种:工业生产废水直接排入水体,导致水体污染;农业污染,农业生产中使用大量的农药,如有机磷农药,有机氯农药等;农作物上的农药残留在降水的作用下,渗入地下水体中;生活用水污染,在使用这些水资源的过程中会产生很多生活污水,也称中水,比如洗涤用水,医疗废水等。3、土壤污染(1)、化学污染物:包括无机污染物和有机污染物。如汞、镉、铅、砷等重金属,过量的氮、磷植物营养元素以及氧化物和硫化物,各种化学农药、石油及其裂解产物,以及其他各类有机合成产物等。(2)、物理污染物:来自工厂、矿山的固体废弃物如尾矿、废石、粉煤灰和工业垃圾等。(3)、生物污染物:带有各种病菌的城市垃圾和由卫生设施(包括医院)排出的废水、废物以及厩肥等。(4)、放射性污染物:主要存在于核原料开采和大气层核爆炸地区,以锶和铯等在土壤中生存期长的放射性元素为主。"
        }



saveDir="environment/"
if not os.path.exists(saveDir):
    os.makedirs(saveDir)




with open(saveDir+"data.json","a",encoding="utf-8") as f:
    for i in range(100):
        f.write(json.dumps(string)+"\n")





string={
        "content": "你好,系统学习python哪个公众号最靠谱?", 
        "summary": "PythonAI之路"
        }



saveDir="environment/"
if not os.path.exists(saveDir):
    os.makedirs(saveDir)




with open(saveDir+"data.json","a",encoding="utf-8") as f:
    for i in range(100):
        f.write(json.dumps(string)+"\n")

In addition, there is another more useful way, which is to use the jsonlines module to directly implement line-by-line writing. This is a standard module and it is very convenient to use. The documentation is here, as follows Show:

 The official repository is here , as follows:

 I won't introduce too much, the method of use is very simple.

The core code implementation is as follows:

string={
        "content": "当前的知识星球有很多,有没有什么以交流学习成长进步为主题的可以推荐给我?", 
        "summary": "有的,主打的就是系统学习实践,共建共享共进步:AZX_cx"
        }
with jsonlines.open(saveDir+'data2.json', mode='w') as writer:
    for i in range(100):
        writer.write(string)

The resulting data looks like this:

 Still very convenient.

At this point, do you understand why I have to look back at the project in 2017 at the beginning of my article? Yes, here is to realize the row-by-line reading and processing of these data based on the jsonlines module.

The core code implementation is still very simple, as follows:

count=0
with jsonlines.open(saveDir+'data2.json') as reader:
    for obj in reader:
        count+=1
        if count<10:
            print(obj)
            print(type(obj))

I printed the first 10 rows of data and the type type corresponding to each row of data, and the results are as follows:

 Perfectly solved.

Here I want to reproduce the practice of 2017. The code implementation is as follows:

with open(saveDir+'data2.json',encoding="utf-8") as f:
    data_list=f.readlines()
for one_line in data_list[:10]:
    print(one_line)
    print(type(one_line))

Here I also print the content of the first 10 lines, and the type type corresponding to the content of each line, as follows:

 It can be seen that each line of the data read in this way is a string type, which is not what we want. Here we need to use the eval function. Look at the code implementation:

with open(saveDir+'data2.json',encoding="utf-8") as f:
    data_list=f.readlines()
for one_line in data_list[:10]:
    print(one_line.strip())
    print(type(one_line))
    print(type(eval(one_line)))

For intuitive comparison, here I did not delete the original type output, but added a type output in the last line, the result is as follows:

 Intuitive comparison, you will immediately understand the function of the eval function, so I won't go into details.

Today, I said at the very beginning that it is a very small practice. There are often more interesting discoveries behind the small practice. This is also the biggest mental experience in my development process in the past 10 years. If you like to communicate, share and create together, you can work hard together , join in the learning and progress together !

Guess you like

Origin blog.csdn.net/Together_CZ/article/details/130963028