Data cleaning pandas-


1. missing values

Analyzing missing values ​​1.1

. 1, info ()
2, ISNULL (), can be used with any () and All ()
. 3, NotNull ()

Import data:
Here Insert Picture Description
using the info () can view information about each column, each column can know how many non-null value.
Here Insert Picture Description
Use ISNULL () determines a null value, determined by the column.
Here Insert Picture Description
Use NotNull () is determined non-null, it is determined by columns.
Here Insert Picture Description

1.2 discarded missing values

For the missing values can be discarded (dropna).
parameter:

  • how: behavior discarded missing values ​​specified, the default is any (ie missing value is deleted), all.
  • axis: designated row or column to discard, default axis = 0, the row is discarded.
  • thresh: Set the time when the number of non-null value reached, retain data.
  • inplace: whether to place modification.

Import data:
Here Insert Picture Description
use dropna () delete the missing values in rows. Originally had 1396 rows of data, delete missing values, the remaining 1098 lines.
Here Insert Picture Description
Setting how
Here Insert Picture Description
disposed axis = 1, as long as the missing values in the column to remove the column.
Here Insert Picture Description
Set Thresh, any number greater than equal to the set value, the line will be retained.
Here Insert Picture Description

1.3 fill in missing values

For the missing values, the processing can be filled (fillna).
parameter:

  • value: the value of the specified filling, the dictionary may be provided in different columns (key) different filling value (value).
  • method: filling up, a filling value (pad / ffill) before use; fill down, a filling value (backfill / bfill) after use.
  • limit: Specifies the maximum number of consecutive NaN filled, if not specified, then filling all NaN.
  • iinplace: whether to place modification.

Importing data:
Here Insert Picture Description
fixed value filled
Here Insert Picture Description
in accordance with the dictionary to fill
Here Insert Picture Description
up the filling ffill
Here Insert Picture Description
fill down bfill
Here Insert Picture Description
limit frequency up to consecutive filling NaN
Here Insert Picture Description

1.4 describe()

describe () for numeric and non-numeric columns column, is not the same as the information display.

Import Data:
Here Insert Picture Description
Here Insert Picture Description

2. Repeat value processing

2.1 Duplicate value

Duplicated () duplicate values found, the method returns the object type of a series, is a Boolean type.
Parameters:
Subset: specify which columns to determine whether based on repetition, the default is all the columns, that is, all values of a row of identical considered duplicates.
keep: Specifies the mark duplicate records rules, defaults first.

  • first: in front of the recording mark to True
  • last: the back of the record is marked as True
  • False: All the record is marked as True
    l, 2,3 rows repeated, for example, the
    First: True True False
    Last: True True False
    First: True True True

Importing data:
Here Insert Picture Description
See duplicates
Here Insert Picture Description
subset determination as long as some of the same column, is considered to be repeated
Here Insert Picture Description

2.2 Delete duplicate values

drop_duplicates () to remove duplicate values.
Parameters:
Subset: specify which columns determine whether to repeat basis.
Here Insert Picture Description

Published 56 original articles · won praise 34 · views 3640

Guess you like

Origin blog.csdn.net/MicoOu/article/details/103960560