Pandas: Cases of data cleaning of GDP datasets around the world

Dataset: Gross Domestic Product of World Countries from 1960 to 2020

Data format: CSV

Data source: World Bank

Experimental environment: Jupyter Notebook

Network disk link: Baidu network disk - GDP data set


1.1 Dependency preparation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

First import related dependencies, pandas/numpy for data processing, matplotlib.pyplot for data visualization

1.2 Data preparation

Read data This line of code needs to place the CSV file and the ipynb file in the same directory

df = pd.read_csv('GDP.csv',encoding = 'utf-8')

Once the data reading is done, the data is stored in the environment as a DataFrame type

1.3 Data Observation
(1) Observe the data shape
df.shape
(266, 65)
(2) Observe the first five lines of the data
df.head()
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963
aruba ABW GDP (current US dollars) NY.GDP.MKTP.CD THAT THAT THAT THAT
THAT AFE GDP (current US dollars) NY.GDP.MKTP.CD 1.93E+10 1.97E+10 2.15E+10 2.57E+10
Afghanistan AFG GDP (current US dollars) NY.GDP.MKTP.CD 5.38E+08 5.49E+08 5.47E+08 7.51E+08
THAT AFW GDP (current US dollars) NY.GDP.MKTP.CD 1.04E+10 1.11E+10 1.19E+10 1.27E+10
Angola AGO GDP (current US dollars) NY.GDP.MKTP.CD THAT THAT THAT THAT
(3) Observation data column name list
df.columns
Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020'],
      dtype='object')
(4) Observe the data types of each column
df.dtypes
Country Name       object
Country Code       object
Indicator Name     object
Indicator Code     object
1960              float64
                   ...   
2016              float64
2017              float64
2018              float64
2019              float64
2020              float64
(5) Observation results

Combining the above code and running results, we observe that

  • The two-dimensional data has a total of 266 rows and 65 columns
  • The first four columns of data explain the country and indicator
  • Subsequent columns are all year data types as floating point numbers (float64)
  • Each row is a country entity
1.4 Data cleaning

After completing the data observation, we have an overall impression of the data set, and then perform data cleaning.

(1) Remove useless fields

Use the loc function to slice separately to extract two columns of data from the dataset

useless_column = df.loc[:,['Indicator Name','Indicator Code']]

We observe three lines before and after it

index Indicator Name Indicator Code
0 GDP (current US dollars) NY.GDP.MKTP.CD
1 GDP (current US dollars) NY.GDP.MKTP.CD
2 GDP (current US dollars) NY.GDP.MKTP.CD
263 GDP (current US dollars) NY.GDP.MKTP.CD
264 GDP (current US dollars) NY.GDP.MKTP.CD
265 GDP (current US dollars) NY.GDP.MKTP.CD

The same value is repeated for each entity, which cannot provide differentiated/valuable information for our data mining, so we directly delete the entire column.

print("进行删除前数据集的列数为:"+str(df.shape[1]))
df.drop(labels = 'Indicator Name',axis = 1,inplace = True)
df.drop(labels = 'Indicator Code',axis = 1,inplace = True)
print("完成删除后数据集的列数为:"+str(df.shape[1]))
(2) Identify missing values

Use the function isnull to determine the number of null values ​​in the table

df.isnull().sum()
Country Name        2
Country Code        0
Indicator Name      0
Indicator Code      0
1960              138
                 ... 
2016               10
2017               10
2018               10
2019               13
2020               24

Draw the image to observe the number of missing values ​​more intuitively

x = np.arange(0, df.shape[1])## 生成x轴数据
y = list(df.isnull().sum())## 生成y轴数据
plt.figure(figsize=(16,7))## 设置画布
plt.subplot(1, 2, 1)
## 原图
plt.plot(x,y)## 绘制sin曲线图
plt.title('列缺失值数目')
# plt.savefig('gen_pics/缺失值曲线.png')
plt.xlabel('列索引')## 添加横轴标签

plt.subplot(1, 2, 2)
x = np.arange(0, df.shape[1])## 生成x轴数据
y = list(i/df.shape[0] for i in df.isnull().sum())## 生成y轴数据
## 绘制散点1
plt.bar(x,y)

plt.xlabel('列索引')## 添加横轴标签
plt.title('列缺失值占比')
plt.show()

insert image description here

It was found that there were missing values ​​in almost every column, and even during 1960-1970, the proportion of missing values ​​in GDP data exceeded 40%

(3) Remove too many missing rows and columns

We set the column with a missing value higher than 20% to be deleted directly

The number of deletions here may be too much, mainly to show the method.

for name in df.columns:
    if (df[name].isnull().sum()/df.shape[0])>0.2:
        df.drop(labels = name,axis = 1,inplace = True)

The current data set retains the proportion of missing values ​​of GDP data of each country from 1990 to 2020, all of which are less than 20%.

insert image description here

Observe the missing value of the column again

insert image description here

It can be felt that the missing value situation is much better than before.

Do the same for rows as above to remove countries with too many missing values.

(4) Fill missing values

The key code of linear interpolation is as follows:

# 对列进行前向线性插值
df = df.interpolate(method='linear', axis=0,inplace=False,limit_direction='forward')

# 对列进行后向线性插值
df = df.interpolate(method='linear', axis=0,inplace=False,limit_direction='backward')

Note that the conversion of data types can only be interpolated with numeric types.

(5) Check outliers

The numerical distribution of the data is almost all concentrated in the interval (μ-3σ, μ+3σ), and the part exceeding 3σ can be considered as abnormal data.

for name in df.columns:
    min_GDP = df[0] < (df[0].mean() - 3*df[0].std())
    max_GDP = df[0] > (df[0].mean() + 3*df[0].std())

    GDP_fit = min_GDP | max_GDP
    print(df.loc[GDP_fit,0])

You can also observe outliers more intuitively through the box plot

label= ['南非','阿根廷','津巴布韦']## 定义标签
gdp = (list(b[263]),list(b[9]),list(b[265]))
plt.figure(figsize=(6,4))
plt.boxplot(gdp,labels = label)
plt.title('国民生产总值箱线图')
plt.show()

insert image description here

If outliers are found, they are removed or smoothed.

(6) Remaining steps

This data set is of higher quality Complete data cleaning should also include the remaining steps:

  • Remove unreasonable values, such as a country's GDP value that exceeds the global total
  • Remove symbol errors, such as text filled in the GDP field
  • Remove repeated rows and columns, for example, the GDP of a year is counted twice
  • Correlation test, calculate the correlation between each field.
  • etc.

Guess you like

Origin blog.csdn.net/yt266666/article/details/127306966