[Data Mining] Using Python to analyze public data【01/10】

1. Description

        This article discusses how to analyze official COVID-19 case data using the Pandas library using Python. You'll see how to glean insights from real data sets, uncovering information that might not be obvious at first glance. In particular, the examples provided in this article illustrate how to obtain information about how quickly a disease is spreading in different countries.

2. Prepare your work environment

        To proceed, you need to have the Pandas library installed in your Python environment. If you don't have it already, you can install it with the pip command:

pip install pandas 

        Then, you need to choose an actual dataset to use. For the example provided in this article, I need a dataset that contains information on the total number of confirmed cases of COVID-19 by country and date. Such a dataset can be downloaded as a CSV file from  the Novel Coronavirus (COVID-19) Cases Data - Humanitarian Data Exchange  : time_series_covid19_confirmed_global_narrow.csv

3. Load the data and prepare it for analysis

        Before reading the downloaded CSV file into a pandas dataframe, I manually removed the unnecessary second line:

#adm1+name,#country+name,#geo+lat,#geo+lon,#date,#affected+infected+value+num 

        Then I read it into a pandas dataframe:

>>> import pandas as pd
>>> df= pd.read_csv("/home/usr/dataset/time_series_covid19_confirmed_global_narrow.csv") 

Let’s now take a closer look at the file structure. The simplest way to do it is with the head method of the dataframe object:

>>> df.head()
 Province/State Country/Region Lat Long Date Value
0 NaN Afghanistan 33.0 65.0 2020–04–01 237
1 NaN Afghanistan 33.0 65.0 2020–03–31 174
2 NaN Afghanistan 33.0 65.0 2020–03–30 170
3 NaN Afghanistan 33.0 65.0 2020–03–29 120
4 NaN Afghanistan 33.0 65.0 2020–03–28 110 

        Since we are not going to perform a complex analysis that takes into account how close the affected countries are to each other geographically, we can safely remove the geographic latitude and geographic longitude columns from the dataset. This can be done as follows:

<span style="background-color:#f2f2f2"><span style="color:#242424">>>> df.drop("Lat", axis=1, inplace=True)
>>> df.drop("Long", axis=1, inplace=True)</span></span>

        Our content should now look like this:

>>> df.head()
 Province/State Country/Region Date Value
0 NaN Afghanistan 2020–04–01 237
1 NaN Afghanistan 2020–03–31 174
2 NaN Afghanistan 2020–03–30 170
3 NaN Afghanistan 2020–03–29 120
4 NaN Afghanistan 2020–03–28 110 

It would also be interesting to know how many rows are in the dataset before we start removing unnecessary rows:

>>> df.count
…[18176 rows x 4 columns]> 

4. Compress the dataset

        As you browse the rows in the dataset, you may notice that some country information is detailed by region (for example, China). But what you need is consolidated data for the entire country. To accomplish this merging step, you can apply a groupby operation to the dataset as follows:

>>> df = df.groupby(['Country/Region','Date']).sum().reset_index() 

This operation should reduce the number of rows in the dataset, eliminating the province/state column:

>>> df.count
...[12780 rows x 3 columns] 

5. Execution analysis

        Suppose you need to determine the rate at which a disease spreads in different countries at an initial stage. Say, for example, you want to know how many days it takes for a disease to reach 100 cases from the day when at least 1500 cases were reported.

        First, you need to filter out countries that are less affected and have not yet reached a large number of confirmed cases. This can be done as follows:

>>> df = df.groupby(['Country/Region'])
>>> df = df.filter(lambda x: x['Value'].mean() > 1000) 

You can then retrieve only those rows that meet the specified criteria:

>>> df = df.loc[(df['Value'] > 100) & (df['Value'] < 1500)] 

        After doing these operations, the number of lines should be significantly reduced.

>>> df.count
… Country/Region Date Value
685 Austria 2020–03–08 104
686 Austria 2020–03–09 131
687 Austria 2020–03–10 182
688 Austria 2020–03–11 246
689 Austria 2020–03–12 302
… … … …
12261 United Kingdom 2020–03–11 459
12262 United Kingdom 2020–03–12 459
12263 United Kingdom 2020–03–13 802
12264 United Kingdom 2020–03–14 1144
12265 United Kingdom 2020–03–15 1145[118 rows x 3 columns] 

        At this point, you may want to review the entire dataset. This can be done with the following line of code:

>>> print(df.to_string())Country/Region Date Value
685 Austria 2020–03–08 104
686 Austria 2020–03–09 131
687 Austria 2020–03–10 182
688 Austria 2020–03–11 246
689 Austria 2020–03–12 302
690 Austria 2020–03–13 504
691 Austria 2020–03–14 655
692 Austria 2020–03–15 860
693 Austria 2020–03–16 1018
694 Austria 2020–03–17 1332
1180 Belgium 2020–03–06 109
1181 Belgium 2020–03–07 169… 

        All that's left is to count the number of rows for each country.

>>> df.groupby(['Country/Region']).size()
>>> print(df.to_string())Country/Region
Austria        10
Belgium        13
China          4
France         9
Germany        10
Iran           5
Italy          7
Korea, South   7
Netherlands    11
Spain          8
Switzerland    10
Turkey         4
US             9
United Kingdom 11 

        The above list answers the question of how many days it takes for the disease to reach approximately 100 confirmed cases from the date a country reports at least 1500 cases.

6. Postscript

        This series of texts starts here, and we will continue to describe the data analysis process in depth.

 Yuli Vasiliev – Medium

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132333000