Data scientists 80% of the time spent on cleaning tasks?

Data scientists 80% of the time spent on cleaning tasks?

The summary data cleansing rules for the four key points: "full unity"

  • Integrity: a single data if there is a null value, the statistical field is perfect.
  • Comprehensive: All values ​​observed for a column, such as in an Excel sheet, we select one, the column can be seen that the average, maximum, minimum. We can use common sense to determine whether the column in question, such as: data definitions, unit identification, value itself.
  • Legitimacy: the legitimacy of the type of data, content and size. For example, the presence of non-ASCII character data, the presence of unknown gender, age more than 150 years old and so on.
  • Uniqueness: whether duplicate data records, because the data is usually aggregated from different sources, repetitive situation is common. Rows, columns, data needs to be unique, such a person can not be repeated many times record, and a person's weight can not be repeated many times in the record column indicators.

1 Integrity

Question 1: missing value

Some age, weight values ​​are missing, these values ​​are not collected, typically in three ways:

  • Delete: Delete the missing data record
  • Means: Using the mean of the current column
  • Frequency: most current frequency data column occurs using

Df want to fill in the missing values ​​by the average age of [ 'Age']:

df['Age'].fillna(df['Age'].mean(), inplace=True)

Were filled with the highest frequency data, the field can obtain the maximum frequency age_maxf Age through value_counts, and then the missing data Age field is filled with age_maxf:

age_maxf = train_features['Age'].value_counts().index[0]
train_features['Age'].fillna(age_maxf, inplace=True)

Question 2: blank lines

Dropna required after the data is read () processing, remove blank lines

# 删除全空的行
df.dropna(how='all',inplace=True) 

2 comprehensiveness

Problem: Unit column data is not uniform

Kg using as unit of measure will be converted to lbs kg lbs kgs

# 获取 weight 数据列中单位为 lbs 的数据
rows_with_lbs = df['weight'].str.contains('lbs').fillna(False)
print df[rows_with_lbs]
# 将 lbs转换为 kgs, 2.2lbs=1kgs
for i,lbs_row in df[rows_with_lbs].iterrows():
  # 截取从头开始到倒数第三个字符之前,即去掉lbs。
  weight = int(float(lbs_row['weight'][:-3])/2.2)
  df.at[i,'weight'] = '{}kgs'.format(weight) 

3 rationality

Problem: non-ASCII characters

Delete or alternative ways to solve the problem of non-ASCII

# 删除非 ASCII 字符
df['first_name'].replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)
df['last_name'].replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)

4 uniqueness

Question 1: There are a number of parameters

Column name (Name) Firstname & Lastname two parameters, the Name column Firstname & Lastname split into two fields, use the split method, str.split (expand = True), the list is split into new columns, the original Name column deleted

# 切分名字,删除源数据列
df[['first_name','last_name']] = df['name'].str.split(expand=True)
df.drop('name', axis=1, inplace=True)

Question 2: duplicate data

There are duplicate records on the use of drop_duplicates () to remove duplicate data

# 删除重复数据行
df.drop_duplicates(['first_name','last_name'],inplace=True)

Exercises:

Cleaning delicacies:

food ounces animal
bacon 4.0 pig
pulled pork 3.0 pig
bacon NaN pig
Pastrami 6.0 cow
corned beef 7.5 cow
Bacon 8.0 pig
pastrami -3.0 cow
honey ham 5.0 pig
nova lox 6.0 salmon
import pandas as pd
"""利用Pandas清洗美食数据"""

# 读取csv文件
df = pd.read_csv("c.csv")

df['food'] = df['food'].str.lower() # 统一为小写字母
df.dropna(inplace=True) # 删除数据缺失的记录
df['ounces'] = df['ounces'].apply(lambda a: abs(a)) # 负值不合法,取绝对值

# 查找food重复的记录,分组求其平均值
d_rows = df[df['food'].duplicated(keep=False)]
g_items = d_rows.groupby('food').mean()
g_items['food'] = g_items.index
print(g_items)

# 遍历将重复food的平均值赋值给df
for i, row in g_items.iterrows():
    df.loc[df.food == row.food, 'ounces'] = row.ounces
df.drop_duplicates(inplace=True) # 删除重复记录

df.index = range(len(df)) # 重设索引值
print(df)

or

jupyter notebook,python3

import pandas as pd
df = pd.read_csv(r"D://Data_for_sci/food.csv")
df.index

df

# Data cleaning for lowercase
df['food'] = df['food'].str.lower()
df

# Delet NaN
df = df.dropna()
df.index = range(len(df)) # reset index
df

# Get bacon's mean value and delet second one
df.loc[0,'ounces'] = df[df['food'].isin(['bacon'])].mean()['ounces']
df.drop(df.index[4],inplace=True)
df.index = range(len(df)) # reset index
df

#Get pastrami's mean value and delet second one
df.loc[2,'ounces'] = df[df['food'].isin(['pastrami'])].mean()['ounces']
df.drop(df.index[4],inplace=True)
df.index = range(len(df)) # reset index
df
Published 75 original articles · won praise 9 · views 9169

Guess you like

Origin blog.csdn.net/ywangjiyl/article/details/104770358