data preprocessing pandas
- dirty data
- Null handling
- Value processing is repeated
- Outliers
- Data type conversion
- structural problems
- Index Settings
*****************************************************************************************************************
- Null handling
* View df.isnull ()
* Remove df.dropna () default to delete contains a null value of this line and then deleted if this line are all null, set df.dropna (how = "all")
* Filling df.fillna () df.fillna ({ "Gender": "M", "age": "30"}) - multiple columns packed with different values
- Value processing is repeated
* Remove df.drop_duplicates () default line keep the first occurrence of retained = "last" reserved the last occurrence of keep = Flase delete all duplicates
- Outlier detection and treatment
* Detection: Compared to the normal data over / under data. (Designated over the normal range, points outside the vertical edge of the box in FIG., The value of the normal distribution deviation exceeds 3σ)
* Processing: Delete, filling, when the filter --Python special value in research, Replace (), etc.
- Data type conversion
.dtype () to view the data type .astype ( "float64") converting data types
- Index Settings
* Add index: df.index = [1,2,3,4,5]
* Reset Index: df.set_index ( "Order Number") - with order number as a new index level Index --set_index () was passed over two / parameters
* Renaming indexes: df.rename (index = {1: "a", 2: "two", 3: "three"}, columns = { "Order Number": "New Order ID"})
* Reset Index: For hierarchical index, the index into the column default all converted df.reset_index ()
(Level = 0) df.reset_index - the index into 0th columns df.reset_index (level = 1) - the index of level 1 into columns