Then the article continues. After the data acquisition and analysis, or can not be directly used because there are a lot of junk data is not valid, it must be treated before they can. The main content data includes data cleaning processing, data extraction, data exchange and data calculation.
Data cleaning
Data cleansing is the data value chain, the most critical step. Even the garbage data may also produce false results through the best analysis, and cause greater misleading.
Data cleaning is to handle missing data as well as information clearing meaningless, irrelevant data such as deleting the original data set, duplicate data, smooth noisy data, filtering out the data analysis and off-topic and so on.
Repeats the process values
Proceed as follows:
Method 1 using duplicated DataFrame Returns a Boolean Series, show whether duplicate rows. No display FALSE, there are repeated from the second row from the display to TRUE
2 drop_duplicates method using DataFrame Returns a duplicate rows removed DataFrame
duplicated format:
duplicated(subset=None, keep='first')
The arguments in brackets are optional, do not write all of the columns default judgment
for identifying a subset of duplicate serial number column or column labels, column labels all default
This means that, for the first keep the first time, the remaining data is marked as the same repeat; except for the last showing the last, the remaining data is marked as the same repeat; represents all of the same data are repeated false are marked
drop_duplicates format:
drop_duplicates()
If you want to specify a column to the column name added in brackets to
Import DataFrame PANDAS from from PANDAS Import Series # making data df = DataFrame ({ 'age' : Series ([26,85,85]), 'name': Series ([ 'xiaoqiang1', 'xiaoqiang2', 'xiaoqiang2'] )}) DF # determines whether duplicate rows df.duplicated () # remove duplicate rows df.drop_duplicates ()
Handle missing values
Handling missing values generally includes two steps, namely identification and handling missing data missing data.
Identifying missing data
NaN pandas floating-point values represented in floating-point and floating-point array of non-missing data, and the isnull and notnull function to determine or deletions.
# Identifying missing data from PANDAS Import DataFrame from PANDAS Import read_excel # missing data DF = read_excel (r'D: python_workspaceanacondarz.xlsx ', sheetname =' Sheet2 ') DF # identify missing data, NaN will display True. notnull opposite function df.isnull ()
rz.xlsx reads as follows
Treatment of missing data
For the missing data processing padded data, delete the row corresponding untreated. Here direct line and code interpreter
#接着上面的继续,进行数据的处理 #去除数据中值为空的数据行 newdf=df.dropna() newdf #用其他数值代替NaN newdf2=df.fillna('--') newdf2 #用前一个数据值代替NaN newdf3=df.fillna(method='pad') newdf3 #用后一个数据值代替NaN newdf4=df.fillna(method='bfill') newdf4 #传入一个字典对不同的列填充不同的值 newdf5=df.fillna({'数分':100,'高代':99}) newdf5 #用平均数来代替NaN。会自动计算有NaN两列的数据的平均数 newdf6=df.fillna(df.mean()) newdf6 #还可以使用strip()来去除数据左右的指定字符,这个是python的基础了,这里不做演示了