Python entry project: data cleaning


foreword

This article is a simple study of data cleaning.
The data source used in this article is: data provided by Boya Reading Club
insert image description here


1. What is data cleaning?

Data cleaning refers to a series of operations that preprocess raw data to ensure high data quality and accuracy before data analysis or mining. Its purpose is to identify, modify or delete inaccurate, incomplete, duplicate, erroneous or illegal records in the data set to improve the efficiency and accuracy of the subsequent analysis and modeling process.

Data cleaning may include the following situations:

1. Missing value processing: Fill or delete missing data so that there are no missing values ​​in the data set.

2. Outlier processing: Judging and processing outliers in the data set to avoid affecting subsequent analysis.

3. Duplicate value processing: delete duplicate records in the data set to avoid redundancy and waste.

4. Data type conversion: Convert strings and other types in the data to numeric types so that more statistical analysis can be performed.

5. Data normalization: Standardize data of different dimensions to avoid analysis errors caused by differences in data units.

Through data cleaning, we can remove noise and redundant information in the original data, improve data quality, and better complete subsequent data analysis and modeling tasks.

2. Duplicate value processing

import pandas as pd
raw = pd.read_excel("shops_nm.xlsx")

#判断有没有重复的数据行
duplicate_raw = raw[raw.duplicated() == True]
if len(duplicate_raw) == 0:
    print("没有重复的数据行。")
else:
    print(duplicate_raw)
    
#制造一个重复的行  iloc[] 方法按照行、列的顺序提取
#print(raw.iloc[0,:])
raw.iloc[1,:] = raw.iloc[0,:] #把第一行赋给第二行
duplicate_raw = raw[raw.duplicated() == True]
if len(duplicate_raw) == 0:
    print("没有重复的数据行。")
else:
    print(duplicate_raw)
#判断店名
duplicte_shop = raw['店名'][raw['店名'].duplicated()==True]
if len(duplicte_shop) == 0:
    print("没有重复的数据行。")
else:
    print(duplicte_shop)
#去除重复的 drop_duplicates()
drop_duplicates_shops = raw.drop_duplicates(subset=['店名'])
print(drop_duplicates_shops.head())

The result of the program is shown in the figure:
insert image description here
You can see that there is no duplicate data row at the beginning, and then I assign the first row to the second row, artificially create a duplicate data row, and then find the duplicate data row , and look for any duplicates in the store name. The last step is to remove duplicate data rows.

Three missing value processing

A missing value does not necessarily mean that there is no corresponding data in this position, but that the data filled in this position cannot be used. For example, some irregular filling will also cause missing values.

import pandas as pd
raw = pd.read_excel("shops_nm.xlsx")
print(raw.shape)
#查找缺失值
null_raw = raw[raw['评价数'].isnull() == True]
print(null_raw)

#剔除
raw1 = raw[raw['评价数'].isnull() == False]
print(raw1.shape)

The effect diagram is as follows, you can see that the number of data rows that have been eliminated has been reduced by 4 rows
insert image description here

4. Data type conversion

What does data type conversion mean? Why data type conversion? To give a simple example:
insert image description here
In the per capita Series in the figure, we can see that some merchants fill in irregular data, which should be of float type, but there will be Chinese characters in it, which involves the conversion of data types. Otherwise, there will be big problems when processing the data later.

import pandas as pd
raw = pd.read_excel("shops_nm.xlsx")
print(len(raw))
#方法1:切片函数+for循环+if条件
filter_word = ["人均:","人均","大概","左右","差不多"]
for i in range(len(raw)):
    value = raw.loc[i,'人均']
    if type(value)== float or type(value)==int:
        continue
    for j in filter_word:
        if j in value:
            raw.loc[i,'人均']= raw.loc[i,'人均'].replace(j,'')
print(raw.head()['人均'])

The result is shown in the figure:
insert image description here
You can see that the characters are cleaned, but the above method is not recommended. There are more advanced methods to achieve the above functions: "apply() function encapsulates for loop + if conditional judgment " Methods.

#方法2apply()+for+if
def clean_price(x):
    filter_word = ["人均:", "人均", "大概", "左右", "差不多"]
    if type(x) == float or type(x) == int:
       return x
    for j in filter_word:
        if j in x:
            x.replace(j,'')
    return x

raw['人均']= raw['人均'].apply(clean_price)
print(raw.head()['人均'])

It looks the same, but the efficiency of apply() is 5 times that of for+if. If the amount of data is large, the gap will be obvious.

Guess you like

Origin blog.csdn.net/CSDN_Yangk/article/details/130194225