Pandas data cleaning and feature processing

Data cleaning and feature processing

Missing value handling

Missing value identification

  • missing value type
    • NaN, np.NaN, np.nan, pd.NA
  • Missing value judgment
    • df.isna()
    • df.notna()
  • missing value statistics
    • df.isna().sum() # how many missing values ​​per column
    • df.isna().sum(1) # how many missing values ​​per row
    • df.isna().sum().sum() # how many missing values ​​in total
  • Missing Value Screening
    • df.loc[df.isna().any(1)] # rows with missing values
    • df.loc[:, df.isna().any()] # columns with missing values
>>> import pandas as pd
>>> df = pd.DataFrame({
    
    
...     'A':['a1', 'a1', 'a2', 'a2'],
...     'B':['b1', 'b2', None, 'b2'],
...     'C':[1, 2, 3, 4],
...     'D':[5, 6, None, 8],
...     'E':[5, None, 7, 8]
... })
>>> df
    A     B  C    D    E
0  a1    b1  1  5.0  5.0
1  a1    b2  2  6.0  NaN
2  a2  None  3  NaN  7.0
3  a2    b2  4  8.0  8.0
>>> df.isna()
       A      B      C      D      E
0  False  False  False  False  False
1  False  False  False  False   True
2  False   True  False   True  False
3  False  False  False  False  False
>>> df.isna().sum()
A    0
B    1
C    0
D    1
E    1
dtype: int64
>>> df.isna().sum(1)
0    0
1    1
2    2
3    0
dtype: int64
>>> df.isna().sum().sum()
3
>>> df.loc[df.isna().any(1)]
    A     B  C    D    E
1  a1    b2  2  6.0  NaN
2  a2  None  3  NaN  7.0
>>> df.loc[:, df.isna().any()]
      B    D    E
0    b1  5.0  5.0
1    b2  6.0  NaN
2  None  NaN  7.0
3    b2  8.0  8.0
>>> df.loc[~(df.isna().any(1))]
    A   B  C    D    E
0  a1  b1  1  5.0  5.0
3  a2  b2  4  8.0  8.0

Missing value filling

>>> df.fillna(0)
    A   B  C    D    E
0  a1  b1  1  5.0  5.0
1  a1  b2  2  6.0  0.0
2  a2   0  3  0.0  7.0
3  a2  b2  4  8.0  8.0
>>> df.replace({
    
    pd.NA:0})
    A   B  C    D    E
0  a1  b1  1  5.0  5.0
1  a1  b2  2  6.0  NaN
2  a2   0  3  NaN  7.0
3  a2  b2  4  8.0  8.0
>>> import numpy as np
>>> df.replace({
    
    np.nan:0})
    A   B  C    D    E
0  a1  b1  1  5.0  5.0
1  a1  b2  2  6.0  0.0
2  a2   0  3  0.0  7.0
3  a2  b2  4  8.0  8.0

interpolation fill

>>> df.interpolate()
    A     B  C    D    E
0  a1    b1  1  5.0  5.0
1  a1    b2  2  6.0  6.0
2  a2  None  3  7.0  7.0
3  a2    b2  4  8.0  8.0

Duplicate value handling

Duplicate values ​​and delete data

  • Duplicate Value Identification
    • df.duplicated() : defaultkeep = first
  • Duplicate value removal
    • df.drop_duplicates()
>>> df = pd.DataFrame({
    
    
...     'A': ['x', 'x', 'z'],
...     'B': ['x', 'x', 'x'],
...     'C': [1, 1, 2]
... })
>>> df
   A  B  C
0  x  x  1
1  x  x  1
2  z  x  2
>>> df.duplicated()
0    False
1     True
2    False
dtype: bool
>>> df[df.duplicated(keep = 'last')]
   A  B  C
0  x  x  1
>>> df.drop_duplicates()
   A  B  C
0  x  x  1
2  z  x  2
>>> df.drop([0,2])
   A  B  C
1  x  x  1
>>> df.drop(['A','C'], axis = 1)
   B
0  x
1  x
2  x

Continuous numerical feature and text feature processing

Continuous numerical feature processing

Data binning (also known as discrete combination or data bucketing) is a data preprocessing technique that divides the original data into several small areas, namely bins (small boxes), which are a form of quantization.

It has the effect of smoothing the input data and can also reduce overfitting in the case of small data sets .

Pandas is mainly based on discretization of continuous data with two functions.

df.Age.max(), df.Age.min()
(80.0, 0.0)
df['Age_num'] = pd.cut(df['Age'],bins = 5, labels = [1,2,3,4,5])
df.head()
PassengerId Survived Pclass Name Sex Age SibSp respect Ticket Fare Cabin Embarked Age_num
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 3
2 3 1 3 Heikkinen, Miss. A loan female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 2
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 3
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 3
df.Age.groupby(pd.cut(df.Age, bins =4)).count()
Age
(-0.08, 20.0]    356
(20.0, 40.0]     385
(40.0, 60.0]     128
(60.0, 80.0]      22
Name: Age, dtype: int64
df.Age.groupby(pd.qcut(df.Age, 4)).count()
Age
(-0.001, 6.0]    224
(6.0, 24.0]      230
(24.0, 35.0]     220
(35.0, 80.0]     217
Name: Age, dtype: int64

Text variable feature processing

View variable types

  • value_counts()
  • unique() , nunique()

convert text to numeric variable

  • replace()
  • map()
  • sklearn.preprocessing.LabelEncoder
df.Sex.unique()
array(['male', 'female', 0], dtype=object)
# 利用 replace 函数
df['Sex'].replace(['male','female'],[1,2]).unique()
array([1, 2, 0], dtype=int64)
# 利用 map 函数
df['Sex'].map({
    
    'male':1, 'female':2}).unique()
array([ 1.,  2., nan])
df.Cabin.nunique(), df.Ticket.nunique()
(135, 543)
# 多类别文本的处理
from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin', 'Ticket']:
    lbl = LabelEncoder()  
    label_dict = dict(zip(df[feat].unique(), range(df[feat].nunique())))
    df[feat + "_labelEncode"] = df[feat].map(label_dict)
    df[feat + "_labelEncode"] = lbl.fit_transform(df[feat].astype(str))
    
df.Cabin_labelEncode.unique()
array([135,  74,  50,   0, 119, 133,  45, 101,  11,  57,  92,  20,  18,
        73, 131, 129, 112,  10,  83,  89,  47,  33, 106,  97,  41, 130,
        55,  15,  12,  62, 132,  25,  39,   7,  94,  85,  80,  71,  93,
        76,  37, 124,  42,  52,  81,  49, 103,  28,  82,  56,  67, 115,
        65,  32,  69, 114,  59,  14,  51,  78, 117, 134, 113,  95,  21,
       121,  72,  43, 105,  64, 118,   8,  46,  48,  79, 116, 107, 123,
        22,  58,  88,  38, 111,  96,  36,  23,  24,  17,  75,  70,   2,
        44,  68,   1, 125,  26,   3,  87, 100, 104,   4,  30,   6,  98,
       122,  35,  31,  99,  29,  16, 128,  66, 110,  77,  53,  60, 127,
        13,  61,  91, 108,  84, 126,  19, 102,  40,  86,   9,  34, 120,
        90, 109,   5,  63,  27,  54])

Convert text to one-hot encoding

Dummy variable (Dummy Variable), also known as dummy variable, nominal variable or dummy variable, is an artificial variable used to reflect qualitative attributes. It is a quantified independent variable, usually with a value of 0 or 1, and is often used in one- hot feature extraction .

# 转为 one-hot 编码: Age 先分组后转换
pd.get_dummies(pd.cut(df['Age'],4, labels = [0,1,2,3]), prefix = 'Age').tail()
Age_0 Age_1 Age_2 Age_3
886 0 1 0 0
887 1 0 0 0
888 1 0 0 0
889 0 1 0 0
890 0 1 0 0

Extract the features of Titles from the plain text Name feature

.str.extract() can use regular expressions to extract the data in the text and form a separate column.

#写入代码
df['Title'] = df.Name.str.extract('([A-Za-z]+)\.', expand=False)
df.Title.value_counts()
Mr          398
Miss        146
Mrs         108
Master       36
Rev           6
Dr            6
Mlle          2
Col           2
Major         2
Sir           1
Jonkheer      1
Mme           1
Countess      1
Don           1
Lady          1
Ms            1
Capt          1
Name: Title, dtype: int64

Guess you like

Origin blog.csdn.net/qq_38869560/article/details/128731029