Dataset: Gross Domestic Product of World Countries from 1960 to 2020
Data format: CSV
Data source: World Bank
Experimental environment: Jupyter Notebook
Network disk link: Baidu network disk - GDP data set
Article directory
1.1 Dependency preparation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
First import related dependencies, pandas/numpy for data processing, matplotlib.pyplot for data visualization
1.2 Data preparation
Read data This line of code needs to place the CSV file and the ipynb file in the same directory
df = pd.read_csv('GDP.csv',encoding = 'utf-8')
Once the data reading is done, the data is stored in the environment as a DataFrame type
1.3 Data Observation
(1) Observe the data shape
df.shape
(266, 65)
(2) Observe the first five lines of the data
df.head()
Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 |
---|---|---|---|---|---|---|---|
aruba | ABW | GDP (current US dollars) | NY.GDP.MKTP.CD | THAT | THAT | THAT | THAT |
THAT | AFE | GDP (current US dollars) | NY.GDP.MKTP.CD | 1.93E+10 | 1.97E+10 | 2.15E+10 | 2.57E+10 |
Afghanistan | AFG | GDP (current US dollars) | NY.GDP.MKTP.CD | 5.38E+08 | 5.49E+08 | 5.47E+08 | 7.51E+08 |
THAT | AFW | GDP (current US dollars) | NY.GDP.MKTP.CD | 1.04E+10 | 1.11E+10 | 1.19E+10 | 1.27E+10 |
Angola | AGO | GDP (current US dollars) | NY.GDP.MKTP.CD | THAT | THAT | THAT | THAT |
(3) Observation data column name list
df.columns
Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
'1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
'1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
'1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
'1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
'1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
'2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
'2014', '2015', '2016', '2017', '2018', '2019', '2020'],
dtype='object')
(4) Observe the data types of each column
df.dtypes
Country Name object
Country Code object
Indicator Name object
Indicator Code object
1960 float64
...
2016 float64
2017 float64
2018 float64
2019 float64
2020 float64
(5) Observation results
Combining the above code and running results, we observe that
- The two-dimensional data has a total of 266 rows and 65 columns
- The first four columns of data explain the country and indicator
- Subsequent columns are all year data types as floating point numbers (float64)
- Each row is a country entity
1.4 Data cleaning
After completing the data observation, we have an overall impression of the data set, and then perform data cleaning.
(1) Remove useless fields
Use the loc function to slice separately to extract two columns of data from the dataset
useless_column = df.loc[:,['Indicator Name','Indicator Code']]
We observe three lines before and after it
index | Indicator Name | Indicator Code |
---|---|---|
0 | GDP (current US dollars) | NY.GDP.MKTP.CD |
1 | GDP (current US dollars) | NY.GDP.MKTP.CD |
2 | GDP (current US dollars) | NY.GDP.MKTP.CD |
263 | GDP (current US dollars) | NY.GDP.MKTP.CD |
264 | GDP (current US dollars) | NY.GDP.MKTP.CD |
265 | GDP (current US dollars) | NY.GDP.MKTP.CD |
The same value is repeated for each entity, which cannot provide differentiated/valuable information for our data mining, so we directly delete the entire column.
print("进行删除前数据集的列数为:"+str(df.shape[1]))
df.drop(labels = 'Indicator Name',axis = 1,inplace = True)
df.drop(labels = 'Indicator Code',axis = 1,inplace = True)
print("完成删除后数据集的列数为:"+str(df.shape[1]))
(2) Identify missing values
Use the function isnull to determine the number of null values in the table
df.isnull().sum()
Country Name 2
Country Code 0
Indicator Name 0
Indicator Code 0
1960 138
...
2016 10
2017 10
2018 10
2019 13
2020 24
Draw the image to observe the number of missing values more intuitively
x = np.arange(0, df.shape[1])## 生成x轴数据
y = list(df.isnull().sum())## 生成y轴数据
plt.figure(figsize=(16,7))## 设置画布
plt.subplot(1, 2, 1)
## 原图
plt.plot(x,y)## 绘制sin曲线图
plt.title('列缺失值数目')
# plt.savefig('gen_pics/缺失值曲线.png')
plt.xlabel('列索引')## 添加横轴标签
plt.subplot(1, 2, 2)
x = np.arange(0, df.shape[1])## 生成x轴数据
y = list(i/df.shape[0] for i in df.isnull().sum())## 生成y轴数据
## 绘制散点1
plt.bar(x,y)
plt.xlabel('列索引')## 添加横轴标签
plt.title('列缺失值占比')
plt.show()
It was found that there were missing values in almost every column, and even during 1960-1970, the proportion of missing values in GDP data exceeded 40%
(3) Remove too many missing rows and columns
We set the column with a missing value higher than 20% to be deleted directly
The number of deletions here may be too much, mainly to show the method.
for name in df.columns:
if (df[name].isnull().sum()/df.shape[0])>0.2:
df.drop(labels = name,axis = 1,inplace = True)
The current data set retains the proportion of missing values of GDP data of each country from 1990 to 2020, all of which are less than 20%.
Observe the missing value of the column again
It can be felt that the missing value situation is much better than before.
Do the same for rows as above to remove countries with too many missing values.
(4) Fill missing values
The key code of linear interpolation is as follows:
# 对列进行前向线性插值
df = df.interpolate(method='linear', axis=0,inplace=False,limit_direction='forward')
# 对列进行后向线性插值
df = df.interpolate(method='linear', axis=0,inplace=False,limit_direction='backward')
Note that the conversion of data types can only be interpolated with numeric types.
(5) Check outliers
The numerical distribution of the data is almost all concentrated in the interval (μ-3σ, μ+3σ), and the part exceeding 3σ can be considered as abnormal data.
for name in df.columns:
min_GDP = df[0] < (df[0].mean() - 3*df[0].std())
max_GDP = df[0] > (df[0].mean() + 3*df[0].std())
GDP_fit = min_GDP | max_GDP
print(df.loc[GDP_fit,0])
You can also observe outliers more intuitively through the box plot
label= ['南非','阿根廷','津巴布韦']## 定义标签
gdp = (list(b[263]),list(b[9]),list(b[265]))
plt.figure(figsize=(6,4))
plt.boxplot(gdp,labels = label)
plt.title('国民生产总值箱线图')
plt.show()
If outliers are found, they are removed or smoothed.
(6) Remaining steps
This data set is of higher quality Complete data cleaning should also include the remaining steps:
- Remove unreasonable values, such as a country's GDP value that exceeds the global total
- Remove symbol errors, such as text filled in the GDP field
- Remove repeated rows and columns, for example, the GDP of a year is counted twice
- Correlation test, calculate the correlation between each field.
- etc.