Large python-based data analysis - data processing (actual codes)

Then the article continues. After the data acquisition and analysis, or can not be directly used because there are a lot of junk data is not valid, it must be treated before they can. The main content data includes data cleaning processing, data extraction, data exchange and data calculation.

Data cleaning

Data cleansing is the data value chain, the most critical step. Even the garbage data may also produce false results through the best analysis, and cause greater misleading.

Data cleaning is to handle missing data as well as information clearing meaningless, irrelevant data such as deleting the original data set, duplicate data, smooth noisy data, filtering out the data analysis and off-topic and so on.

Repeats the process values

Proceed as follows:

Method 1 using duplicated DataFrame Returns a Boolean Series, show whether duplicate rows. No display FALSE, there are repeated from the second row from the display to TRUE

2 drop_duplicates method using DataFrame Returns a duplicate rows removed DataFrame

duplicated format:

duplicated(subset=None, keep='first')

The arguments in brackets are optional, do not write all of the columns default judgment

for identifying a subset of duplicate serial number column or column labels, column labels all default

This means that, for the first keep the first time, the remaining data is marked as the same repeat; except for the last showing the last, the remaining data is marked as the same repeat; represents all of the same data are repeated false are marked

drop_duplicates format:

drop_duplicates()

If you want to specify a column to the column name added in brackets to

Import DataFrame PANDAS from 
from PANDAS Import Series 

# making data 
df = DataFrame ({ 'age' : Series ([26,85,85]), 'name': Series ([ 'xiaoqiang1', 'xiaoqiang2', 'xiaoqiang2'] )}) 
DF 

# determines whether duplicate rows 
df.duplicated () 

# remove duplicate rows 
df.drop_duplicates ()

Handle missing values

Handling missing values ​​generally includes two steps, namely identification and handling missing data missing data.

Identifying missing data

NaN pandas floating-point values ​​represented in floating-point and floating-point array of non-missing data, and the isnull and notnull function to determine or deletions.


# Identifying missing data 
from PANDAS Import DataFrame 
from PANDAS Import read_excel 

# missing data 
DF = read_excel (r'D: python_workspaceanacondarz.xlsx ', sheetname =' Sheet2 ') 
DF 

# identify missing data, NaN will display True. notnull opposite function 
df.isnull ()



rz.xlsx reads as follows

Treatment of missing data

For the missing data processing padded data, delete the row corresponding untreated. Here direct line and code interpreter

#接着上面的继续,进行数据的处理
#去除数据中值为空的数据行
newdf=df.dropna()
newdf

#用其他数值代替NaN
newdf2=df.fillna('--')
newdf2

#用前一个数据值代替NaN
newdf3=df.fillna(method='pad')
newdf3

#用后一个数据值代替NaN
newdf4=df.fillna(method='bfill')
newdf4

#传入一个字典对不同的列填充不同的值
newdf5=df.fillna({'数分':100,'高代':99})
newdf5

#用平均数来代替NaN。会自动计算有NaN两列的数据的平均数
newdf6=df.fillna(df.mean())
newdf6

#还可以使用strip()来去除数据左右的指定字符,这个是python的基础了,这里不做演示了


Guess you like

Origin blog.51cto.com/xqtesting/2411252