Obtain information about the epidemic (the day of real-time data) with Python Reptile

In fact, before long, bloggers want to do this thing. But nonetheless their own talent and less learning, but did not want to make, after a period of study, bloggers try again, eventually lucky enough to crawl successfully, the following code and bloggers will explain posted.

爬取时间:2020-03-27
爬取难度:★★★★☆☆
请求链接:https://wp.m.163.com/163/page/news/virus_report/index.html?_nw_=1&_anw_=1 
爬取目标:爬取该网站上我国以及世界各国当日的疫情人数(确诊、治愈、死亡等信息)并保存为 CSV 文件
涉及知识:requests、json、time、CSV储存、pandas等。

First, select a data source

After the outbreak of pneumonia novel coronavirus infection outbreak, it has a huge impact on people's lives. The current number of infections is still changing. Day national health and health committee will announce major news media data of the epidemic, including the cumulative number of people diagnosed with the existing number of persons diagnosed.
In light of this, bloggers specially selected several sites ready for crawling, but found some sites text, some sites is the picture that these data are not easy to collect. Finally, I chose NetEase epidemic dynamic real-time broadcast platform as the data source.
URL is: ? Https://wp.m.163.com/163/page/news/virus_report/index.html nw = 1 & ANW = 1
open we will see the following:
1
2
open the page then we can find a dynamic pages, so the data can be found in the data we want> under F12- Network tab.
3

Second, a preliminary understanding of the data

4
In the above page, we can see the data location. , But also to see that this is the type of JSON.
5
The figure we see the request URL and the status code.
6
The figure for the User-Agent.
Well aware of the above we are ready to begin.
First we import the package and set the proxy head

import requests
import pandas as pd
import time 
pd.set_option('max_rows',500)
headers = {  'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}
url = 'https://c.m.163.com/ug/api/wuhan/app/data/list-total'   # 定义要访问的地址 
r = requests.get(url, headers=headers)  # 使用requests发起请求

This time we request the following:
7
the figure we can see the contents of the return hundreds of thousands is a length of string, since the string format convenient for analysis, and it was found on the page preview json data dictionary-like format, so we convert it to a json format.

import json
data_json = json.loads(r.text)
data_json.keys()

8
9
We can see in datathe store with the data we need, we extracted data.

data = data_json['data']
data.keys()

10
Data in a total of four keys, each stored in different content:

Key data content name

  • chinaTotal national day of data
  • chinaDayList National Historical Data
  • lastUpdateTime Updated
  • areaTree real-time data around the world

Then we began to get real-time data.

Third, real-time data crawling

3.1 crawling provinces real-time data

First we crawl the provinces of real-time data in areaTreekey pairs, the storage of data in real time from all over the world, areaTreeis a list, each element is a country of data, each element of childrenthe data in each country provinces. We first find the real-time data Chinese provinces, as shown below:
11
The following start data out of China's provinces:

data_province = data['areaTree'][2]['children']

Here we look at the key name in each province

data_province[0].keys()   # 查看每个省键名称

12
Key data content name

  • today the provinces the same day data
  • total provincial day total data
  • extData no data
  • name the name of the provinces
  • id number of administrative provinces
  • lastUpdateTime Updated
  • Data in a children provinces

Let us first see the provinces traversed name, update time

for i in range(len(data_province)):
    print(data_province[i]['name'],data_province[i]['lastUpdateTime'])
    if i == 5:
        break

13
Mr. Cheng Here we look at the results tables

pd.DataFrame(data_province).head()

14
Data generated by this, we can clearly see, we want to get the data and data gap. Only id, lastUpdateTime, namethree data display properly. Then why?
DataFrame not be generated because the data is directly nested in the dictionary, e.g. Hubei data as follows: red line marked with nested dictionary, the dictionary is not nested within red basket.
15
The figure for the analysis:
Since this we need only crawling total provincial data, and therefore childrenis not collected, then extDatais null, is not collected, then we need to collect todayand total, but also because the presence of these two nested dictionary, can not be directly acquired. We can id, lastUpdateTime, namedirectly as a data, today is a data, total as a data of the last three data into one data.
Here we begin to operate:


info = pd.DataFrame(data_province)[['id','lastUpdateTime','name']]
info.head()

16

today_data = pd.DataFrame([province['today'] for province in data_province ])
today_data.head()

17
Because todaythe key name and totalthe same key name, it is necessary to modify the column names are

# 获取today中的数据
today_data.columns = ['today_'+i for i in today_data.columns]
today_data.head()

18

# 获取total中的数据
total_data = pd.DataFrame([province['total'] for province in data_province])
total_data.columns = ['total_'+i for i in total_data.columns]
total_data.head()

19
The following three data merge:

pd.concat([info,total_data,today_data],axis=1).head()

20
Last saved:
21

3.2 crawling all over the world real-time data

According to previously learned in the json data datain areaTreea list format, each element is a real-time data of the country, children of each element is data for each province of the country, and now we extract real-time data around the world.
22
Here we look at the structure:
23
we can see that China's provinces and crawling consistent real-time data.
So we can crawl data in the same way: still divided into three steps.

	today_world = get_data(areaTree,['id','lastUpdateTime','name'])
    today_data = pd.DataFrame([i['today'] for i in data])  
    today_data.columns = ['today_'+i for i in today_data.columns]
    
    total_data = pd.DataFrame([i['total'] for i in data])
    total_data.columns = ['total_'+i for i in total_data.columns]
    pd.concat([info,total_data,today_data],axis=1).head()   # 将三个数据合并

The final results are as follows:
24

Fourth, the overall code implementation

# =============================================
# --*-- coding: utf-8 --*--
# @Time    : 2020-03-27
# @Author  : 不温卜火
# @CSDN    : https://blog.csdn.net/qq_16146103
# @FileName: Real-time epidemic.py
# @Software: PyCharm
# =============================================
import requests
import pandas as pd
import json
import time

pd.set_option('max_rows',500)

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}

url = 'https://c.m.163.com/ug/api/wuhan/app/data/list-total'   # 定义要访问的地址
r = requests.get(url, headers=headers)  # 使用requests发起请求

data_json = json.loads(r.text)
data = data_json['data']
data_province = data['areaTree'][2]['children']
areaTree = data['areaTree']

class spider_yiqing(object):


    # 将提取数据的方法封装成函数
    def get_data(data, info_list):
        info = pd.DataFrame(data)[info_list]  # 主要信息

        today_data = pd.DataFrame([i['today'] for i in data])  # 提取today的数据
        today_data.columns = ['today_' + i for i in today_data.columns]

        total_data = pd.DataFrame([i['total'] for i in data])
        total_data.columns = ['total_' + i for i in total_data.columns]

        return pd.concat([info, total_data, today_data], axis=1)

    

    def save_data(data,name):
        file_name = name+'_'+time.strftime('%Y_%m_%d',time.localtime(time.time()))+'.csv'
        data.to_csv(file_name,index=None,encoding='utf_8_sig')
        print(file_name+'保存成功!')

    if __name__ == '__main__':
        today_province = get_data(data_province, ['id', 'lastUpdateTime', 'name'])
        today_world = get_data(areaTree, ['id', 'lastUpdateTime', 'name'])
        save_data(today_province, 'today_province')
        save_data(today_world, 'today_world')

Fifth, run successfully screenshots

25
Domestic provinces:
26
the world:
27

VI Summary

This program code is slightly confusing, layering is not strong. There may be a more efficient means of crawling.

Published 24 original articles · won praise 36 · views 2711

Guess you like

Origin blog.csdn.net/qq_16146103/article/details/105148595