python data cleaning implement (abnormal value processing missing values)

Today small for everyone to share a python for data cleansing (handling missing values and outliers), has a good reference value, we want to help. Come and see, to follow the small series together
1. Sql file written to a local mysql database

This article is written taob database table python
source [local file] Here Insert Picture Description
wherein total data 9616 rows, columns as title, link, price, respectively, Comment

2. Link and read the data using python

The data is summarized view

#-*- coding:utf-8 -*-
#author:M10
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import mysql.connector
conn = mysql.connector.connect(host='localhost',
            user='root',
            passwd='123456',
            db='python')#链接本地数据库
sql = 'select * from taob'#sql语句
data = pd.read_sql(sql,conn)#获取数据
print(data.describe())

Description import data is correct, a simple analysis found that the problem is not so simple, because the comment could mean 562 is too large, the maximum number of comments 454,037 errors do occur, price 0 price is not likely to occur.

price    comment
count 9616.00000  9616.000000
mean   64.49324   562.239601
std   176.10901  6078.909643
min    0.00000    0.000000
25%   20.00000   16.000000
50%   36.00000   58.000000
75%   66.00000   205.000000
max  7940.00000 454037.000000

3. Missing values

The price for the value 0 for median 36

#-*- coding:utf-8 -*-
#author:M10
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import mysql.connector
conn = mysql.connector.connect(host='localhost',
            user='root',
            passwd='123456',
            db='python')#链接本地数据库
sql = 'select * from taob'#sql语句
data = pd.read_sql(sql,conn)#获取数据
 
data['price'][data['price']==0]=None
x = 0
for i in data.columns:
  for j in range(len(data)):
    if (data[i].isnull()) [j]:
      data[i][j]='36'
      x+=1
print(x)
#44

The results show that the modified data 44 line.

4. Outlier handling

#-*- coding:utf-8 -*-
#author:M10
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import mysql.connector
conn = mysql.connector.connect(host='localhost',
            user='root',
            passwd='123456',
            db='python')#链接本地数据库
sql = 'select * from taob'#sql语句
data = pd.read_sql(sql,conn)#获取数据
#缺失值处理
data['price'][data['price']==0]=None
x = 0
for i in data.columns:
  for j in range(len(data)):
    if (data[i].isnull()) [j]:
      data[i][j]='36'
      x+=1
print(x)
#异常值处理
#绘制散点图,价格为横轴
data1 = data.T#转置
price = data1.values[2]
comment = data1.values[3]
plt.plot(price,comment,'o')
plt.show()
#print(price)

The results shown below, the price at around 0 comment is very likely an outlier, comments is 0, the maximum price possible. Here Insert Picture Description
Comments outliers processed next, assuming that the dividing line is set to an abnormal value 20w,

#-*- coding:utf-8 -*-
#author:M10
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import mysql.connector
conn = mysql.connector.connect(host='localhost',
            user='root',
            passwd='123456',
            db='python')#链接本地数据库
sql = 'select * from taob'#sql语句
data = pd.read_sql(sql,conn)#获取数据
#缺失值处理
data['price'][data['price']==0]=None
x = 0
for i in data.columns:
  for j in range(len(data)):
    if (data[i].isnull()) [j]:
      data[i][j]='36'
      x+=1
print(x)
#异常值处理
da = data.values#重新赋值data
#异常值处理,将commments大于200000的数据comments设置为58
cont_clou = len(da)#获取行数
#遍历数据进行处理
for i in range(0,cont_clou):
  if(data.values[i][3]>200000):
    #print(data.values[i][3])
    da[i][3]='58'
    #print(da[i][3])
 
#绘制散点图,价格为横轴
data1 = da.T#转置
price = data1[2]
comment = data1[3]
plt.plot(price,comment,'o')
plt.xlabel('price')
plt.ylabel('comments')
plt.show()

Output the results of treatment are: Here Insert Picture Description
Recommended python our learning sites, click to enter , to see how old the program is to learn! From basic python script, reptiles, django

Zero-based data compilation, data mining, programming techniques, work experience, as well as senior careful study of small python project partners to combat

,! The method has timed programmer Python explain everyday technology, to share some of the learning and the need to pay attention to small details

Published 48 original articles · won praise 21 · views 60000 +

Guess you like

Origin blog.csdn.net/haoxun11/article/details/105082087