还在为拿不到官方病例数据而发愁吗?
WHO各国病例数据如下:
https://experience.arcgis.com/experience/685d0ace521648f8a5beeeee1b9125cd
我们的目的就是爬出这个图中的数据:
审查元素
首先我们随便点开一个国家的疫情情况:
这里以中国为例,点开后找到URL:
https://services.arcgis.com/5T5nSi527N4F7luB/arcgis/rest/services/Historic_adm0_v3/FeatureServer/0/query?f=json&where=ADM0_NAME%3D%27CHINA%27&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=OBJECTID%2Ccum_conf%2CDateOfDataEntry&orderByFields=DateOfDataEntry%20asc&resultOffset=0&resultRecordCount=2000&cacheHint=true
Preview中可以看到:
就是我们想要的数据,但是他的时间格式我们没有见过,两两差分可以发现规律:
两个时期间相差864
上面是确证病例的URL,新增病例的如下:
https://services.arcgis.com/5T5nSi527N4F7luB/arcgis/rest/services/Historic_adm0_v3/FeatureServer/0/query?f=json&where=ADM0_NAME%3D%27CHINA%27&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=OBJECTID%2CNewCase%2CDateOfDataEntry&orderByFields=DateOfDataEntry%20asc&resultOffset=0&resultRecordCount=2000&cacheHint=true
以几个国家为例,代码如下(这里暂时写了名字是的单个单词的国家):
#coding:utf-8
import urllib.request
import os
import pandas as pd
import json
res = pd.DataFrame()
def Open(url):
heads = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
req = urllib.request.Request(url, headers=heads)
response = urllib.request.urlopen(url)
html = response.read()
return html.decode('utf-8')
def conserve(html, name):
global res
time, confirm = [], []
temp = pd.DataFrame(columns=['time', name])
for i in html['features']:
time.append(i['attributes']['DateOfDataEntry'])
confirm.append(i['attributes']['cum_conf'])
temp['time'] = time
temp[name] = confirm
temp = temp.set_index('time')
res = pd.concat([res, temp], axis=1)
def main():
global res
for name in ['China', 'Italy', 'Spain', 'France', 'Germany', 'Switzerland', 'Netherlands', 'Norway', 'Belgium', 'Sweden', 'Australia', 'Brazil', 'Egypt']:
print(name)
url = 'https://services.arcgis.com/5T5nSi527N4F7luB/arcgis/rest/services/Historic_adm0_v3/FeatureServer/0/query?f=json&where=ADM0_NAME%3D%27' + name + '%27&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=OBJECTID%2Ccum_conf%2CDateOfDataEntry&orderByFields=DateOfDataEntry%20asc&resultOffset=0&resultRecordCount=2000&cacheHint=true'
html = json.loads(Open(utl))
conserve(html, name)
print('--------------------------------------------------------------------------')
#America 单独拿出来
name = 'America'
url = 'https://services.arcgis.com/5T5nSi527N4F7luB/arcgis/rest/services/Historic_adm0_v3/FeatureServer/0/query?f=json&where=ADM0_NAME%3D%27United%20States%20of%20America%27&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=OBJECTID%2Ccum_conf%2CDateOfDataEntry&orderByFields=DateOfDataEntry%20asc&resultOffset=0&resultRecordCount=2000&cacheHint=true'
html = json.loads(Open(url))
conserve(html, name)
res['Datetime'] = pd.date_range(start='20200122', end='20200316')
res.to_csv('conform.csv', encoding='utf_8_sig')
main()
经过简单的数据处理后的结果如下:
注意,如果res[‘Datetime’] = pd.date_range(start=‘20200122’, end=‘20200317’)这一行报错,原因是我在三月十七号写的,需要将20200317改成今天的日期