Data scientists 80% of the time spent on cleaning tasks?
The summary data cleansing rules for the four key points: "full unity"
- Integrity: a single data if there is a null value, the statistical field is perfect.
- Comprehensive: All values observed for a column, such as in an Excel sheet, we select one, the column can be seen that the average, maximum, minimum. We can use common sense to determine whether the column in question, such as: data definitions, unit identification, value itself.
- Legitimacy: the legitimacy of the type of data, content and size. For example, the presence of non-ASCII character data, the presence of unknown gender, age more than 150 years old and so on.
- Uniqueness: whether duplicate data records, because the data is usually aggregated from different sources, repetitive situation is common. Rows, columns, data needs to be unique, such a person can not be repeated many times record, and a person's weight can not be repeated many times in the record column indicators.
1 Integrity
Question 1: missing value
Some age, weight values are missing, these values are not collected, typically in three ways:
- Delete: Delete the missing data record
- Means: Using the mean of the current column
- Frequency: most current frequency data column occurs using
Df want to fill in the missing values by the average age of [ 'Age']:
df['Age'].fillna(df['Age'].mean(), inplace=True)
Were filled with the highest frequency data, the field can obtain the maximum frequency age_maxf Age through value_counts, and then the missing data Age field is filled with age_maxf:
age_maxf = train_features['Age'].value_counts().index[0]
train_features['Age'].fillna(age_maxf, inplace=True)
Question 2: blank lines
Dropna required after the data is read () processing, remove blank lines
# 删除全空的行
df.dropna(how='all',inplace=True)
2 comprehensiveness
Problem: Unit column data is not uniform
Kg using as unit of measure will be converted to lbs kg lbs kgs
# 获取 weight 数据列中单位为 lbs 的数据
rows_with_lbs = df['weight'].str.contains('lbs').fillna(False)
print df[rows_with_lbs]
# 将 lbs转换为 kgs, 2.2lbs=1kgs
for i,lbs_row in df[rows_with_lbs].iterrows():
# 截取从头开始到倒数第三个字符之前,即去掉lbs。
weight = int(float(lbs_row['weight'][:-3])/2.2)
df.at[i,'weight'] = '{}kgs'.format(weight)
3 rationality
Problem: non-ASCII characters
Delete or alternative ways to solve the problem of non-ASCII
# 删除非 ASCII 字符
df['first_name'].replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)
df['last_name'].replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)
4 uniqueness
Question 1: There are a number of parameters
Column name (Name) Firstname & Lastname two parameters, the Name column Firstname & Lastname split into two fields, use the split method, str.split (expand = True), the list is split into new columns, the original Name column deleted
# 切分名字,删除源数据列
df[['first_name','last_name']] = df['name'].str.split(expand=True)
df.drop('name', axis=1, inplace=True)
Question 2: duplicate data
There are duplicate records on the use of drop_duplicates () to remove duplicate data
# 删除重复数据行
df.drop_duplicates(['first_name','last_name'],inplace=True)
Exercises:
Cleaning delicacies:
food | ounces | animal |
---|---|---|
bacon | 4.0 | pig |
pulled pork | 3.0 | pig |
bacon | NaN | pig |
Pastrami | 6.0 | cow |
corned beef | 7.5 | cow |
Bacon | 8.0 | pig |
pastrami | -3.0 | cow |
honey ham | 5.0 | pig |
nova lox | 6.0 | salmon |
import pandas as pd
"""利用Pandas清洗美食数据"""
# 读取csv文件
df = pd.read_csv("c.csv")
df['food'] = df['food'].str.lower() # 统一为小写字母
df.dropna(inplace=True) # 删除数据缺失的记录
df['ounces'] = df['ounces'].apply(lambda a: abs(a)) # 负值不合法,取绝对值
# 查找food重复的记录,分组求其平均值
d_rows = df[df['food'].duplicated(keep=False)]
g_items = d_rows.groupby('food').mean()
g_items['food'] = g_items.index
print(g_items)
# 遍历将重复food的平均值赋值给df
for i, row in g_items.iterrows():
df.loc[df.food == row.food, 'ounces'] = row.ounces
df.drop_duplicates(inplace=True) # 删除重复记录
df.index = range(len(df)) # 重设索引值
print(df)
or
jupyter notebook,python3
import pandas as pd
df = pd.read_csv(r"D://Data_for_sci/food.csv")
df.index
df
# Data cleaning for lowercase
df['food'] = df['food'].str.lower()
df
# Delet NaN
df = df.dropna()
df.index = range(len(df)) # reset index
df
# Get bacon's mean value and delet second one
df.loc[0,'ounces'] = df[df['food'].isin(['bacon'])].mean()['ounces']
df.drop(df.index[4],inplace=True)
df.index = range(len(df)) # reset index
df
#Get pastrami's mean value and delet second one
df.loc[2,'ounces'] = df[df['food'].isin(['pastrami'])].mean()['ounces']
df.drop(df.index[4],inplace=True)
df.index = range(len(df)) # reset index
df