Cris 的 Python 数据分析笔记 06:Pandas 常见的数据预处理

版权声明:转载请注明出处~ 摸摸博主狗头 https://blog.csdn.net/cris_zz/article/details/84336138

1. Pandas 对指定列排序

import pandas as pd

'''
    sort_values 表示按照指定列进行排序;inplace 参数如果为 True,表示对原 DataFrame 进行排序处理,否则就是返回一个
    新的排序后的 DataFrame,NaN 表示缺失值;默认升序排序,可以使用 ascending 参数改变排序规则
'''
data = pd.read_csv('food_info.csv')
print(data['Sodium_(mg)'])
data.sort_values('Sodium_(mg)',inplace=True)
print(data['Sodium_(mg)'])
data.sort_values('Sodium_(mg)',inplace=True,ascending=False)
print(data['Sodium_(mg)'])
0        643.0
1        659.0
2          2.0
3       1146.0
4        560.0
5        629.0
6        842.0
7        690.0
8        644.0
9        700.0
10       604.0
11       364.0
12       344.0
13       372.0
14       308.0
15       406.0
16       365.0
17       812.0
18       917.0
19       800.0
20       600.0
21       819.0
22       714.0
23       800.0
24       600.0
25       627.0
26       710.0
27       619.0
28       682.0
29       628.0
         ...  
8588       2.0
8589       2.0
8590       7.0
8591     564.0
8592     464.0
8593     490.0
8594       1.0
8595     199.0
8596     297.0
8597      16.0
8598     486.0
8599       0.0
8600       2.0
8601    1297.0
8602    1435.0
8603    2838.0
8604      10.0
8605       2.0
8606      12.0
8607       0.0
8608    3326.0
8609    1765.0
8610    3750.0
8611      29.0
8612      58.0
8613    4450.0
8614     667.0
8615      58.0
8616      70.0
8617      68.0
Name: Sodium_(mg), Length: 8618, dtype: float64
760     0.0
758     0.0
405     0.0
761     0.0
2269    0.0
763     0.0
764     0.0
770     0.0
774     0.0
396     0.0
395     0.0
6827    0.0
394     0.0
393     0.0
391     0.0
390     0.0
787     0.0
788     0.0
2270    0.0
2231    0.0
407     0.0
748     0.0
409     0.0
747     0.0
702     0.0
703     0.0
704     0.0
705     0.0
706     0.0
707     0.0
       ... 
8153    NaN
8155    NaN
8156    NaN
8157    NaN
8158    NaN
8159    NaN
8160    NaN
8161    NaN
8163    NaN
8164    NaN
8165    NaN
8167    NaN
8169    NaN
8170    NaN
8172    NaN
8173    NaN
8174    NaN
8175    NaN
8176    NaN
8177    NaN
8178    NaN
8179    NaN
8180    NaN
8181    NaN
8183    NaN
8184    NaN
8185    NaN
8195    NaN
8251    NaN
8267    NaN
Name: Sodium_(mg), Length: 8618, dtype: float64
276     38758.0
5814    27360.0
6192    26050.0
1242    26000.0
1245    24000.0
1243    24000.0
1244    23875.0
292     17000.0
1254    11588.0
5811    10600.0
8575     9690.0
291      8068.0
1249     8031.0
5812     7893.0
1292     7851.0
293      7203.0
4472     7027.0
4836     6820.0
1261     6580.0
3747     6008.0
1266     5730.0
4835     5586.0
4834     5493.0
1263     5356.0
1553     5203.0
1552     5053.0
1251     4957.0
1257     4843.0
294      4616.0
8613     4450.0
         ...   
8153        NaN
8155        NaN
8156        NaN
8157        NaN
8158        NaN
8159        NaN
8160        NaN
8161        NaN
8163        NaN
8164        NaN
8165        NaN
8167        NaN
8169        NaN
8170        NaN
8172        NaN
8173        NaN
8174        NaN
8175        NaN
8176        NaN
8177        NaN
8178        NaN
8179        NaN
8180        NaN
8181        NaN
8183        NaN
8184        NaN
8185        NaN
8195        NaN
8251        NaN
8267        NaN
Name: Sodium_(mg), Length: 8618, dtype: float64

2. 泰坦尼克经典入门案例

import numpy as np

'''
    isnull 函数可以判断一列数据的缺失值,NaN 则返回 True,正常值则返回 False
'''
titanic_survival = pd.read_csv('titanic_train.csv')
titanic_survival.head()

age = titanic_survival['Age']
age_top_10 = (age[0:10])
age_is_null = pd.isnull(age_top_10)
print(age_is_null)

# 通过索引过滤得到缺失值的数据集
age_null = age_top_10[age_is_null]
print(age_null)
age_null_count = len(age_null)
print(age_null_count)
0    False
1    False
2    False
3    False
4    False
5     True
6    False
7    False
8    False
9    False
Name: Age, dtype: bool
5   NaN
Name: Age, dtype: float64
1

3. Pandas 常用数据预处理函数

3.1 缺失值处理

'''
    如果不对 NaN 值处理,得到的计算结果就是 nan 的~~~    
'''
average_age = sum(titanic_survival['Age'])/len(titanic_survival['Age'])
print(average_age)

'''
    非常厉害的缺失值处理:通过切片判断表达式得到所有不是 NaN 值的正常数据
'''
# 先通过 isnull 函数得到指定列的所有值,正常值正常显示,非正常值以 NaN 显示
all_age_null = pd.isnull(titanic_survival['Age'])
print(all_age_null)
# 然后通过切片表达式作为索引得到所有的正常值
good_ages = titanic_survival['Age'][all_age_null == False]
print(good_ages)
age_average = sum(good_ages)/len(good_ages)
# 29.69911764705882
print(age_average)
nan
0      False
1      False
2      False
3      False
4      False
5       True
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17      True
18     False
19      True
20     False
21     False
22     False
23     False
24     False
25     False
26      True
27     False
28      True
29      True
       ...  
861    False
862    False
863     True
864    False
865    False
866    False
867    False
868     True
869    False
870    False
871    False
872    False
873    False
874    False
875    False
876    False
877    False
878     True
879    False
880    False
881    False
882    False
883    False
884    False
885    False
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
18     31.0
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
27     19.0
30     40.0
33     66.0
34     28.0
35     42.0
37     21.0
38     18.0
       ... 
856    45.0
857    51.0
858    24.0
860    41.0
861    21.0
862    48.0
864    24.0
865    42.0
866    27.0
867    31.0
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
889    26.0
890    32.0
Name: Age, Length: 714, dtype: float64
29.69911764705882

3.2 Pandas 预处理函数自动过滤缺失值

# missing data is so common that many pandas methods automatically filter for it
# 虽然 Pandas 为我们提供了过滤缺失值的函数,但是仍然不是很推荐使用,因为数据最好不要轻易过滤,通常的做法都是
#  为其添加一份计算后的默认值
mean_age = titanic_survival['Age'].mean()
print(mean_age)
29.69911764705882

3.3 手动来计算每种船舱的平均价格

Pclass = [1,2,3]
Pclass_avg_price = {}
for this_pclass in Pclass:
    
    # 首先我们需要根据列来筛选出符合条件的行数据(样本数据),然后筛选出来的样本的指定列(特征值)的值求和并除以对应行数求均值
    # 得到的数据就是指定特征值的均值
    prices = titanic_survival[titanic_survival['Pclass'] == this_pclass]
#     Pclass_avg_price[this_pclass] = sum(prices['Fare'])/len(prices)
    # 求均值可以使用 3.2节所示的 Pandas 内置函数!
    Pclass_avg_price[this_pclass] = prices['Fare'].mean()
    
print(Pclass_avg_price)
{1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}

3.4 Pandas 的内置函数简化 3.3 节的计算

'''
    index tells the method which column to group by
    values is th column that we want to apply the calculation to 
    aggfunc specifies the calculation we want to perform 
'''
passenger_survival = titanic_survival.pivot_table(index='Pclass', values='Survived', aggfunc=np.mean)
print(passenger_survival)

# 注意:aggfunc 属性如果不写,默认就是求均值
avg_age = titanic_survival.pivot_table(index='Pclass', values='Age')
print(avg_age)
age = titanic_survival.pivot_table(index='Pclass', values='Age', aggfunc=np.mean)
print(age)
        Survived
Pclass          
1       0.629630
2       0.472826
3       0.242363
              Age
Pclass           
1       38.233441
2       29.877630
3       25.140620
              Age
Pclass           
1       38.233441
2       29.877630
3       25.140620

3.5 分组计算制定列之间的关系

# 这里根据登船地点进行分组,然后分别统计船票价格之和以及获救人数之和(按照分组显示)
Fare_survived = titanic_survival.pivot_table(index='Embarked', values=['Fare', 'Survived'], aggfunc=np.sum)
print(Fare_survived)
                Fare  Survived
Embarked                      
C         10072.2962        93
Q          1022.2543        30
S         17439.3988       217
# specifying axis = 1 or axis = 'columns' will drop any columns that have null values
drop_col = titanic_survival.dropna(axis=1)
print(drop_col.head())

# 如果 Age 和 Sex 列缺失值,那么丢掉这一行样本
new_data = titanic_survival.dropna(axis=0, subset=['Age','Sex'])
print(new_data.head())

# 对应的 fillna 函数则是对 null 值进行填充
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex  SibSp  Parch  \
0                            Braund, Mr. Owen Harris    male      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female      1      0   
2                             Heikkinen, Miss. Laina  female      0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female      1      0   
4                           Allen, Mr. William Henry    male      0      0   

             Ticket     Fare  
0         A/5 21171   7.2500  
1          PC 17599  71.2833  
2  STON/O2. 3101282   7.9250  
3            113803  53.1000  
4            373450   8.0500  
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

3.6 数据定位

# Pandas 根据行号和列名来定位具体的某个值
print(titanic_survival.loc[12,'Age'])
print(titanic_survival.loc[342,'Pclass'])
20.0
2

3.7 重排序索引

new_data = titanic_survival.sort_values('Age', ascending=False)
# 抛弃以前的索引,对排序后的数据的索引进行重新计算,inplace 为 True 表示对原数据直接更改
new_data.reset_index(drop=True,inplace=True)
print(new_data.head())
   PassengerId  Survived  Pclass                                  Name   Sex  \
0          631         1       1  Barkworth, Mr. Algernon Henry Wilson  male   
1          852         0       3                   Svensson, Mr. Johan  male   
2          494         0       1               Artagaveytia, Mr. Ramon  male   
3           97         0       1             Goldschmidt, Mr. George B  male   
4          117         0       3                  Connors, Mr. Patrick  male   

    Age  SibSp  Parch    Ticket     Fare Cabin Embarked  
0  80.0      0      0     27042  30.0000   A23        S  
1  74.0      0      0    347060   7.7750   NaN        S  
2  71.0      0      0  PC 17609  49.5042   NaN        C  
3  71.0      0      0  PC 17754  34.6542    A5        C  
4  70.5      0      0    370369   7.7500   NaN        Q  

3.8 自定义函数

# 定义新函数返回第一百行的数据
def handredth_data (column):
    data = column.loc[99]
    return data
data = titanic_survival.apply(handredth_data)
print(data)

# 获取每列的缺失值的样本数
def null_count (column):
    col_null = pd.isnull(column)
    null = column[col_null]
    return len(null)

count = titanic_survival.apply(null_count)
print('----------')
print(count)
print(help(pd.isnull))
PassengerId                  100
Survived                       0
Pclass                         2
Name           Kantor, Mr. Sinai
Sex                         male
Age                           34
SibSp                          1
Parch                          0
Ticket                    244367
Fare                          26
Cabin                        NaN
Embarked                       S
dtype: object
----------
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
Help on function isna in module pandas.core.dtypes.missing:

isna(obj)
    Detect missing values for an array-like object.
    
    This function takes a scalar or array-like object and indictates
    whether values are missing (``NaN`` in numeric arrays, ``None`` or ``NaN``
    in object arrays, ``NaT`` in datetimelike).
    
    Parameters
    ----------
    obj : scalar or array-like
        Object to check for null or missing values.
    
    Returns
    -------
    bool or array-like of bool
        For scalar input, returns a scalar boolean.
        For array input, returns an array of boolean indicating whether each
        corresponding element is missing.
    
    See Also
    --------
    notna : boolean inverse of pandas.isna.
    Series.isna : Detetct missing values in a Series.
    DataFrame.isna : Detect missing values in a DataFrame.
    Index.isna : Detect missing values in an Index.
    
    Examples
    --------
    Scalar arguments (including strings) result in a scalar boolean.
    
    >>> pd.isna('dog')
    False
    
    >>> pd.isna(np.nan)
    True
    
    ndarrays result in an ndarray of booleans.
    
    >>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
    >>> array
    array([[ 1., nan,  3.],
           [ 4.,  5., nan]])
    >>> pd.isna(array)
    array([[False,  True, False],
           [False, False,  True]])
    
    For indexes, an ndarray of booleans is returned.
    
    >>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None,
    ...                           "2017-07-08"])
    >>> index
    DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'],
                  dtype='datetime64[ns]', freq=None)
    >>> pd.isna(index)
    array([False, False,  True, False])
    
    For Series and DataFrame, the same type is returned, containing booleans.
    
    >>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
    >>> df
         0     1    2
    0  ant   bee  cat
    1  dog  None  fly
    >>> pd.isna(df)
           0      1      2
    0  False  False  False
    1  False   True  False
    
    >>> pd.isna(df[1])
    0    False
    1     True
    Name: 1, dtype: bool

None

3.9 每行迭代及数据转换

ages = titanic_survival['Age']
print(ages.head())

def which_class (row):
    pclass = row['Pclass']
    if pd.isnull(pclass):
        return 'Unknown'
    elif pclass == 1:
        return 'First Class'
    elif pclass == 2:
        return 'Second Class'
    else:
        return 'Third Class'
    
# apply 函数中,axis 属性为1,表示对每行进行函数判断,即数据迭代
result = titanic_survival.apply(which_class, axis=1)
print(result.head())

def age_class (row):
    age = row['Age']
    if pd.isna(age):
        return 'Unknown'
    elif age < 18:
        return '年轻人'
    elif age < 40:
        return '中年人'
    else:
        return '老年人'
age_lable = titanic_survival.apply(age_class, axis=1)
print(age_lable.tail())
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64
0    Third Class
1    First Class
2    Third Class
3    First Class
4    Third Class
dtype: object
886        中年人
887        中年人
888    Unknown
889        中年人
890        中年人
dtype: object

3.10 巧妙分组计算数据之间的关系

# 为 DataFrame 新增一列
titanic_survival['age_label'] = age_lable
result = titanic_survival.pivot_table(index='age_label', values='Survived')
print(result)
           Survived
age_label          
Unknown    0.293785
中年人        0.383562
年轻人        0.539823
老年人        0.374233

猜你喜欢

转载自blog.csdn.net/cris_zz/article/details/84336138