pandas常用预处理方法

  1. 求均值,表格中含有空值:

    #The result of this is that mean_age would be nan. This is because any calculations we do with a null value also result in a null value
    mean_age = sum(titanic_survival["Age"]) / len(titanic_survival["Age"])
    print (mean_age)
    

    运行结果:
    在这里插入图片描述

  2. 正确的均值

    age = titanic_survival["Age"]
    # print(age.loc[0:10])
    age_is_null = pd.isnull(age)
    #we have to filter out the missing values before we calculate the mean.
    good_ages = titanic_survival["Age"][age_is_null == False]
    #print good_ages
    correct_mean_age = sum(good_ages) / len(good_ages)
    print (correct_mean_age)
    

    运行结果:
    在这里插入图片描述

  3. mean()

    # missing data is so common that many pandas methods automatically filter for it
    correct_mean_age = titanic_survival["Age"].mean()
    print (correct_mean_age)
    

    运行结果:
    在这里插入图片描述

  4. 计算不同类别的均值

    #mean fare for each class
    passenger_classes = [1, 2, 3]
    fares_by_class = {}
    for this_class in passenger_classes:
        pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
        pclass_fares = pclass_rows["Fare"]
        fare_for_class = pclass_fares.mean()
        fares_by_class[this_class] = fare_for_class
    print fares_by_class
    

    运行结果:
    在这里插入图片描述

  5. 数据透视表 获救的比例

    #index tells the method which column to group by
    #values is the column that we want to apply the calculation to
    #aggfunc specifies the calculation we want to perform
    passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
    print (passenger_survival)
    

    运行结果:
    在这里插入图片描述

  6. 平均年龄

    passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age")
    print(passenger_age)
    

    运行结果:
    在这里插入图片描述

  7. 一个量和两个量之间的关系

    port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare","Survived"], aggfunc=np.sum)
    print(port_stats)
    

    运行结果:
    在这里插入图片描述

  8. dropna

    #specifying axis=1 or axis='columns' will drop any columns that have null values
    drop_na_columns = titanic_survival.dropna(axis=1)
    new_titanic_survival = titanic_survival.dropna(axis=0,subset=["Age", "Sex"])
    #print new_titanic_survival
    
  9. loc函数

    row_index_83_age = titanic_survival.loc[83,"Age"]
    row_index_766_pclass = titanic_survival.loc[766,"Pclass"]
    print (row_index_83_age)
    print (row_index_766_pclass) 
    

    运行结果:
    在这里插入图片描述

发布了301 篇原创文章 · 获赞 30 · 访问量 4万+

猜你喜欢

转载自blog.csdn.net/weixin_42260102/article/details/103428209