table of Contents
1. missing values
Analyzing missing values 1.1
. 1, info ()
2, ISNULL (), can be used with any () and All ()
. 3, NotNull ()
Import data:
using the info () can view information about each column, each column can know how many non-null value.
Use ISNULL () determines a null value, determined by the column.
Use NotNull () is determined non-null, it is determined by columns.
1.2 discarded missing values
For the missing values can be discarded (dropna).
parameter:
- how: behavior discarded missing values specified, the default is any (ie missing value is deleted), all.
- axis: designated row or column to discard, default axis = 0, the row is discarded.
- thresh: Set the time when the number of non-null value reached, retain data.
- inplace: whether to place modification.
Import data:
use dropna () delete the missing values in rows. Originally had 1396 rows of data, delete missing values, the remaining 1098 lines.
Setting how
disposed axis = 1, as long as the missing values in the column to remove the column.
Set Thresh, any number greater than equal to the set value, the line will be retained.
1.3 fill in missing values
For the missing values, the processing can be filled (fillna).
parameter:
- value: the value of the specified filling, the dictionary may be provided in different columns (key) different filling value (value).
- method: filling up, a filling value (pad / ffill) before use; fill down, a filling value (backfill / bfill) after use.
- limit: Specifies the maximum number of consecutive NaN filled, if not specified, then filling all NaN.
- iinplace: whether to place modification.
Importing data:
fixed value filled
in accordance with the dictionary to fill
up the filling ffill
fill down bfill
limit frequency up to consecutive filling NaN
1.4 describe()
describe () for numeric and non-numeric columns column, is not the same as the information display.
Import Data:
2. Repeat value processing
2.1 Duplicate value
Duplicated () duplicate values found, the method returns the object type of a series, is a Boolean type.
Parameters:
Subset: specify which columns to determine whether based on repetition, the default is all the columns, that is, all values of a row of identical considered duplicates.
keep: Specifies the mark duplicate records rules, defaults first.
- first: in front of the recording mark to True
- last: the back of the record is marked as True
- False: All the record is marked as True
l, 2,3 rows repeated, for example, the
First: True True False
Last: True True False
First: True True True
Importing data:
See duplicates
subset determination as long as some of the same column, is considered to be repeated
2.2 Delete duplicate values
drop_duplicates () to remove duplicate values.
Parameters:
Subset: specify which columns determine whether to repeat basis.