Reptiles data processing data processing pandas

pandas data processing

Duplicated use () function to detect duplicate rows, returns a Boolean type elements Series object, each element corresponding to one line, if the line is not the first time, the element is True

- keep参数:指定保留哪一重复的行数据


 Mapping

 

1) replace () function: replace element

 

DataFrame replacement operation

  • Alternatively a single value
    • Common Replace: Replace all elements meet the requirements: to_replace = 15, value = 'e'
    • Alternatively a single column value specified: to_replace = {column labels: replacement value} value = 'value'
  • Multi-value replacement
    • Alternatively list: to_replace = [] value = []
    • Alternatively dictionary (recommended) to_replace = {to_replace: value, to_replace: value}

 

map () function: the new one, the map function is not df method, but the method of series

 
  • map () can be mapped data of a new
  • map () can be used in expressions lambd
  • map () method can be used, it may be a custom method

    eg:map({to_replace:value})

  • Note that  map () function sum can not be used like, for loop

Note: Not any kind of function can be used as the parameter map. Only when a function having an argument and return a value, then the function can serve as the parameters of the map.



Polymerization operations on the data and outlier detection filter

 

Use df.std () function can be determined DataFrame target standard deviation of each column


 

数据清洗
  • Cleaning null
    • dropna fillna isnull notnull any all
  • Cleaning duplicate values
    • drop_duplicates(keep)
  • Cleaning outliers
    • The results of monitoring an abnormal value (Boolean), as the washing of the filter conditions

 

 

Random sampling

When DataFrame scale large enough to be used directly np.random.permutation (x) function, it is with the take () function to achieve random sampling

 

 Data classification process [focus]

 

Data aggregation is the last step of data processing, usually to make each array to generate a single value.

Data classification process:

  • Packet: first data are divided into groups
  • 用函数处理:为不同组的数据应用不同的函数以转换数据
  • 合并:把不同组得到的结果合并起来

数据分类处理的核心:

 - groupby()函数
 - groups属性查看分组情况
 - eg: df.groupby(by='item').groups

 

高级数据聚合

 

使用groupby分组后,也可以使用transform和apply提供自定义函数实现更多的运算

  • df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
  • transform和apply都会进行运算,在transform或者apply中传入函数即可
  • transform和apply也可以传入一个lambda表达式

 














 

 

Guess you like

Origin www.cnblogs.com/XLHIT/p/11347436.html