Python new crown epidemic case: 1. Merge Excel files

Note: This article takes the merging of COVID-19 data files as an example.
If you need relevant data, please go to: " 2020-2022 New Crown Epidemic Data


1. Data merging under a single directory

insert image description here

Merge all files under 2020 into one file:

import requests
import json
import openpyxl
import datetime
import datetime as dt
import time
import pandas as pd
import csv
from openpyxl import load_workbook
from sqlalchemy import create_engine
import math
import os
import glob
csv_list=glob.glob(r'D:\Python\03DataAcquisition\COVID-19\2020\*.csv')
print("所有数据文件总共有%s" %len(csv_list))
for i in csv_list:
    fr=open(i,"rb").read() #除了第一个数据文件外,其他不读取表头
    with open('../output/covid19temp0314.csv','ab') as f:
        f.write(fr)
    f.close()
print('数据合成完毕!')

insert image description here

Combined data:
insert image description here

2. Use functions to merge data

## 02 使用函数进行数据合并
import os
import pandas as pd 
# 定义函数(具有递归功能)
def mergeFile(parent,path="",pathdeep=0,filelist=[],csvdatadf=pd.DataFrame(),csvdata=pd.DataFrame()):
    fileAbsPath=os.path.join(parent,path)
    if os.path.isdir(fileAbsPath)==True:
        if(pathdeep!=0 and ('.ipynb_checkpoints' not in str(fileAbsPath))): # =0代表没有下一层目录
            print('--'+path)
        for filename2 in os.listdir(fileAbsPath):
            mergeFile(fileAbsPath,filename2,pathdeep=pathdeep+1)
    else:
        if(pathdeep==2 and path.endswith(".csv") and os.path.getsize(parent+'/'+path)>0):
            filelist.append(parent+'/'+path)
    return filelist

# D:\Python\03DataAcquisition\COVID-19
path=input("请输入数据文件所在目录:")
filelist=mergeFile(path)

filelist

csvdata=pd.DataFrame()
csvdatadf=pd.DataFrame()

for m in filelist:
    csvdata=pd.read_csv(m,encoding='utf-8-sig')
    csvdatadf=csvdatadf.append(csvdata)
# 由于2023年的数据还没有,所以不合并

insert image description here

(* ̄(oo) ̄) Note: The waiting time for this should be longer, because there are more than 1.9 million pieces of data.

Save the merged data:

csvdatadf.to_csv("covid190314.csv",index=None,encoding='utf-8-sig')
csvdatadf=pd.read_csv("covid190314.csv",encoding='utf-8-sig')
csvdatadf.info()

insert image description here

Read the data of the new crown epidemic before 2020/0101:

beforedf=pd.read_csv(r'D:\Python\03DataAcquisition\COVID-19\before20201111.csv',encoding='utf-8-sig')
beforedf.info()

insert image description here

insert image description here

Combine the two sets of data:

tempalldf=beforedf.append(csvdatadf)
tempalldf.head()

insert image description here

3. Processing data from Hong Kong, Macao and Taiwan

insert image description here
As shown in the picture: Country_Regionto Hong Kongchange from China. The same goes for Macau and Taiwan:

Find data about Taiwan:

beforedf.loc[beforedf['Country/Region']=='Taiwan']
beforedf.loc[beforedf['Country/Region'].str.contains('Taiwan')]
beforedf.loc[beforedf['Country/Region'].str.contains('Taiwan'),'Province/State']='Taiwan'
beforedf.loc[beforedf['Province/State']=='Taiwan','Country/Region']='China'
beforedf.loc[beforedf['Province/State']=='Taiwan']

insert image description here
Data processing in Hong Kong:

beforedf.loc[beforedf['Country/Region'].str.contains('Hong Kong'),'Province/State']='Hong Kong'
beforedf.loc[beforedf['Province/State']=='Hong Kong','Country/Region']='China'
afterdf.loc[afterdf['Country_Region'].str.contains('Hong Kong'),'Province_State']='Hong Kong'
afterdf.loc[afterdf['Province_State']=='Hong Kong','Country_Region']='China'

Data processing in Macau:

beforedf.loc[beforedf['Country/Region'].str.contains('Macau'),'Province/State']='Macau'
beforedf.loc[beforedf['Province/State']=='Macau','Country/Region']='China'
afterdf.loc[afterdf['Country_Region'].str.contains('Macau'),'Province_State']='Macau'
afterdf.loc[afterdf['Province_State']=='Macau','Country_Region']='China'

Finally save the sorted data:

beforedf.to_csv("beforedf0314.csv",index=None,encoding='utf-8-sig')
afterdf.to_csv("afterdf0314.csv",index=None,encoding='utf-8-sig')

insert image description here

4. Header modification + removal of null values

Merge the generated beforedf0314.csvand afterdf0314.csvtwo files.

import pandas as pd 
beforedf=pd.read_csv("beforedf0314.csv")
afterdf=pd.read_csv("afterdf0314.csv")

beforedfThe header with the first row of data:

Country/Region Province/State Latitude Longitude Confirmed Recovered Deaths Date
China Anhui 31.825700 117.226400 1.0 NaN NaN 2020/1/22

afterdfThe header with the first row of data:

FIPS Admin2 Province_State Country_Region Last_Update Years Long_ Confirmed Deaths Recovered Active Combined_Key Incident_Rate Case_Fatality_Ratio
NaN NaN NaN Afghanistan 2020-11-12 05:25:55 33.939110 67.709953 42609 1581 34967.0 6061.0 Afghanistan 109.454960

It can be seen that the headers of the two are not exactly the same, so if you want to merge the two, you must process the headers accordingly:

# 将两个数据集的属性进行统一化
beforedfv2=beforedf.rename(columns={
    
    'Country/Region':'Country_Region', 'Province/State':'Province_State', 'Latitude':'Lat', 'Longitude':'Longi',}).replace()

afterdfv2=afterdf.rename(columns={
    
    'Last_Update':'Date','Long_':'Longi'}).replace()

afterdfv2=afterdfv2[['Province_State', 'Country_Region', 'Date','Lat', 'Longi', 'Confirmed', 'Deaths', 'Recovered']]

At this point, the header can correspond: the
insert image description here
next step is to remove the data where the country is empty:

# 查看一下是否国家有空值
beforedfv2.loc[beforedfv2['Province_State'].isnull()]

Nice, no nulls:
insert image description here

afterdfv2.loc[afterdfv2['Province_State'].isnull()]

There are 7 null values:
insert image description here
Processing steps:
If there is a null value in the country, you can see that its province is not a null value, then fill the value of the province on the country

afterdfv2.loc[afterdfv2['Province_State'].isnull(),'Province_State']=afterdfv2['Country_Region']

After processing, check it: there is no null value.
insert image description here
Save the sorted data to disk:

beforedfv2.to_csv('beforedfv2.csv',encoding='utf-8-sig',index=None)
afterdfv2.to_csv('afterdfv2.csv',encoding='utf-8-sig',index=None)

5. Combined data

alldf1=beforedfv2.append(afterdfv2)

The first five items of the
insert image description here
merged dataset: the last five items of the merged dataset: a
insert image description here
total of more than 1.9 million items of data.

6. Post-merger data collation

# 剔除重复值
alldfnodup=alldf.drop_duplicates()
alldfnodup.to_csv("alldfnodup.csv",index=None,encoding='utf-8-sig')

Delete the following hours, minutes and seconds, only the date, not the time:
insert image description here

# 更改Date的类型
alldfnodup['Date']=pd.to_datetime(alldfnodup['Date']).dt.normalize()
# 空值填充为0
alldfnodup['Recovered'].fillna(0,inplace=True)
alldfnodup['Deaths'].fillna(0,inplace=True)
# 人数变成int64类型
alldfnodup['Recovered']=alldfnodup['Recovered'].astype('int64')
alldfnodup['Deaths']=alldfnodup['Deaths'].astype('int64')
alldfnodup['Confirmed']=alldfnodup['Confirmed'].astype('int64')
# 按照国家,省份,日期来排序
alldfsort=alldfnodup.sort_values(['Country_Region','Province_State','Date']).replace()

After finishing, a total of 280W+ data
insert image description here

alldfsort[alldfsort['Country_Region']=='China']

Among them the data of China:
insert image description here

Guess you like

Origin blog.csdn.net/wxfighting/article/details/123590669