Yelp review website official data set json to csv format method [latest] [super detailed]

Yelp is a website similar to Dianping in the United States. The data set released on its official website includes information such as merchants, users, reviews, check-ins, etc., which is detailed and rich. The dataset can be used to study recommender systems!

However, the data downloaded from Yelp’s official website is in json format, and the official Python analysis file was actually written in python2 five years ago. After running for a long time, countless errors were reported, and it needs to be changed a little bit! After downloading the 10G file for half a day, it is useless! ! After searching the Internet for a long time, there was no good solution, and I was so angry. Here is the latest code that uses python3 to parse into csv format, for the benefit of future generations.

The code of Zhizhi_ is referenced here , but that article only parses the data of Busness, and the parsing file I give below can parse all the files.

(1) Create a new python file named json_to_csv_business.py , the content is as follows:

import csv
import json
import sys
import os
import pandas as pd
import numpy as np

#我这里.py文件和数据放在同一个路径下了,如果不在同一个路径下,自己可以修改,注意路径要用//

json_file_path='yelp_academic_dataset_review.json'
csv_file_path='yelp_academic_dataset_review.csv'

#打开business.json文件,取出第一行列名
with open(json_file_path,'r',encoding='utf-8') as fin:
    for line in fin:
        line_contents = json.loads(line)
        headers=line_contents.keys()
        break
    print(headers)
    
#将json读成字典,其键值写入business.csv的列名,再将json文件中的values逐行写入business.csv文件
with open(csv_file_path, 'w', newline='',encoding='utf-8') as fout:
    writer=csv.DictWriter(fout, headers)
    writer.writeheader()
    with open(json_file_path, 'r', encoding='utf-8') as fin:
        for line in fin:
            line_contents = json.loads(line)
            #if 'Phoenix' in line_contents.values():
            writer.writerow(line_contents)
            
 # 删除state','postal_code','is_open','attributes'列,并保存
 # 可以根据需要选择,这里是针对review文件的一些列。
df_bus=pd.read_csv(csv_file_path)
df_reduced=df_bus.drop(['compliment_hot','compliment_more','compliment_profile'],axis=1)
df_cleaned=df_reduced.dropna()
df_cleaned.to_csv(csv_file_path,index=False)
df_bus=pd.read_csv(csv_file_path)

df_bus.to_csv(csv_file_path,index=False)

The above code shows the parsing method of the review file. If you want to parse business/user files, just change json_file_path and csv_file_path.

(2) Enter the python command line, I use anaconda here, so open the anaconda powershell promt.

Enter cd 路径名to enter the path where the Python file just written is located.
Then enter directly on the command line:

json_to_csv_business.py

You can see the parsed csv file in the folder! ! ! !

Upload the parsed user tip (brief comment) and chekin (check in) data here. The review file is too large to upload.

Link: https://pan.baidu.com/s/1HreQ1JaMdavS70nkUWo3NA Extraction code: 23d8
After copying this content, open the Baidu Netdisk mobile app, the operation is more convenient

Note that if you open it with excel, you may see that some data is displayed abnormally. To see the encoding when excel is opened, whether it is csv (separated by commas) or csv (utf-8, separated by commas), the correct way to open the encoding here is the latter. It is normal to see the analysis directly by opening it with python! If the python file and data are in the same directory:

import pandas as pd
review = pd.read_csv('yelp_academic_dataset_review.csv')
review.head()

It's been a long time! ! !

Guess you like

Origin blog.csdn.net/weixin_43846562/article/details/112343164