pandas data processing
Duplicated use () function to detect duplicate rows, returns a Boolean type elements Series object, each element corresponding to one line, if the line is not the first time, the element is True
- keep参数:指定保留哪一重复的行数据
Mapping
1) replace () function: replace element
DataFrame replacement operation
- Alternatively a single value
- Common Replace: Replace all elements meet the requirements: to_replace = 15, value = 'e'
- Alternatively a single column value specified: to_replace = {column labels: replacement value} value = 'value'
- Multi-value replacement
- Alternatively list: to_replace = [] value = []
- Alternatively dictionary (recommended) to_replace = {to_replace: value, to_replace: value}
map () function: the new one, the map function is not df method, but the method of series
- map () can be mapped data of a new
- map () can be used in expressions lambd
-
map () method can be used, it may be a custom method
eg:map({to_replace:value})
- Note that map () function sum can not be used like, for loop
Note: Not any kind of function can be used as the parameter map. Only when a function having an argument and return a value, then the function can serve as the parameters of the map.
Polymerization operations on the data and outlier detection filter
Use df.std () function can be determined DataFrame target standard deviation of each column
数据清洗
- Cleaning null
- dropna fillna isnull notnull any all
- Cleaning duplicate values
- drop_duplicates(keep)
- Cleaning outliers
- The results of monitoring an abnormal value (Boolean), as the washing of the filter conditions
Random sampling
When DataFrame scale large enough to be used directly np.random.permutation (x) function, it is with the take () function to achieve random sampling
Data classification process [focus]
Data aggregation is the last step of data processing, usually to make each array to generate a single value.
Data classification process:
- Packet: first data are divided into groups
- 用函数处理:为不同组的数据应用不同的函数以转换数据
- 合并:把不同组得到的结果合并起来
数据分类处理的核心:
- groupby()函数
- groups属性查看分组情况
- eg: df.groupby(by='item').groups
高级数据聚合
使用groupby分组后,也可以使用transform和apply提供自定义函数实现更多的运算
- df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
- transform和apply都会进行运算,在transform或者apply中传入函数即可
- transform和apply也可以传入一个lambda表达式