Today small for everyone to share a python for data cleansing (handling missing values and outliers), has a good reference value, we want to help. Come and see, to follow the small series together
1. Sql file written to a local mysql database
This article is written taob database table python
source [local file]
wherein total data 9616 rows, columns as title, link, price, respectively, Comment
2. Link and read the data using python
The data is summarized view
#-*- coding:utf-8 -*-
#author:M10
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import mysql.connector
conn = mysql.connector.connect(host='localhost',
user='root',
passwd='123456',
db='python')#链接本地数据库
sql = 'select * from taob'#sql语句
data = pd.read_sql(sql,conn)#获取数据
print(data.describe())
Description import data is correct, a simple analysis found that the problem is not so simple, because the comment could mean 562 is too large, the maximum number of comments 454,037 errors do occur, price 0 price is not likely to occur.
price comment
count 9616.00000 9616.000000
mean 64.49324 562.239601
std 176.10901 6078.909643
min 0.00000 0.000000
25% 20.00000 16.000000
50% 36.00000 58.000000
75% 66.00000 205.000000
max 7940.00000 454037.000000
3. Missing values
The price for the value 0 for median 36
#-*- coding:utf-8 -*-
#author:M10
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import mysql.connector
conn = mysql.connector.connect(host='localhost',
user='root',
passwd='123456',
db='python')#链接本地数据库
sql = 'select * from taob'#sql语句
data = pd.read_sql(sql,conn)#获取数据
data['price'][data['price']==0]=None
x = 0
for i in data.columns:
for j in range(len(data)):
if (data[i].isnull()) [j]:
data[i][j]='36'
x+=1
print(x)
#44
The results show that the modified data 44 line.
4. Outlier handling
#-*- coding:utf-8 -*-
#author:M10
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import mysql.connector
conn = mysql.connector.connect(host='localhost',
user='root',
passwd='123456',
db='python')#链接本地数据库
sql = 'select * from taob'#sql语句
data = pd.read_sql(sql,conn)#获取数据
#缺失值处理
data['price'][data['price']==0]=None
x = 0
for i in data.columns:
for j in range(len(data)):
if (data[i].isnull()) [j]:
data[i][j]='36'
x+=1
print(x)
#异常值处理
#绘制散点图,价格为横轴
data1 = data.T#转置
price = data1.values[2]
comment = data1.values[3]
plt.plot(price,comment,'o')
plt.show()
#print(price)
The results shown below, the price at around 0 comment is very likely an outlier, comments is 0, the maximum price possible.
Comments outliers processed next, assuming that the dividing line is set to an abnormal value 20w,
#-*- coding:utf-8 -*-
#author:M10
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import mysql.connector
conn = mysql.connector.connect(host='localhost',
user='root',
passwd='123456',
db='python')#链接本地数据库
sql = 'select * from taob'#sql语句
data = pd.read_sql(sql,conn)#获取数据
#缺失值处理
data['price'][data['price']==0]=None
x = 0
for i in data.columns:
for j in range(len(data)):
if (data[i].isnull()) [j]:
data[i][j]='36'
x+=1
print(x)
#异常值处理
da = data.values#重新赋值data
#异常值处理,将commments大于200000的数据comments设置为58
cont_clou = len(da)#获取行数
#遍历数据进行处理
for i in range(0,cont_clou):
if(data.values[i][3]>200000):
#print(data.values[i][3])
da[i][3]='58'
#print(da[i][3])
#绘制散点图,价格为横轴
data1 = da.T#转置
price = data1[2]
comment = data1[3]
plt.plot(price,comment,'o')
plt.xlabel('price')
plt.ylabel('comments')
plt.show()
Output the results of treatment are:
Recommended python our learning sites, click to enter , to see how old the program is to learn! From basic python script, reptiles, django
Zero-based data compilation, data mining, programming techniques, work experience, as well as senior careful study of small python project partners to combat
,! The method has timed programmer Python explain everyday technology, to share some of the learning and the need to pay attention to small details