pandas from beginner to proficient-descriptive statistics


This article introduces some common knowledge about descriptive statistics in pandas, including pandas statistical functions, correlation coefficients and covariances, unique values, frequency statistics and membership relationships. Hope it can help those in need.

Descriptive statistics

pandas statistical functions

Methods that extract a single value (such as a sum or average) from a Series, or a range of values ​​from the rows or columns of a DataFrame. In contrast to similar methods on NumPy arrays, they have built-in handling of missing data, which is skipped by default.

  • Take sum as an example
#默认沿行计算,得到每一列的和
df.sum()
df.sum(axis='index')
#通过制定axis 变为沿列计算,得到每一行的和
df.sum(axis='columns')

#若不想跳过缺失数据,某行中有NA 则结果就为NA,使用 skipna
df.sum(axis='columns',skipna=False)

Parameters that can be selected:
Insert image description here

  • Find the index when the maximum value is reached df.idxmax()
  • Numeric type and non-numeric type statistical description df.describe()
    Numeric type returns non-zero count, mean, variance, maximum and minimum values, etc. Non-numeric strong returns the frequency of occurrence of the element
  • Other descriptive statistical methods
    Insert image description here
    Insert image description here

Correlation and covariance

  • Correlation
# 返回矩阵各个列之间的相关性系数
df.corr()

#计算某两列之间的相关性
df['col1'].corr(df['col2'])

#计算某列与整个矩阵之间的相关性
df.corrwith(df['col']
  • Covariance
# 返回矩阵各个列之间的协方差
df.cov()

#计算某两列之间的协方差
df['col1'].cov(df['col2'])

Insert image description here

Unique values, frequency statistics, membership

1. Series.unique()

Returns the result of removing duplicate values ​​from the elements in the Series, without sorting

2. Series/DataFrame/array.value_counts()

Series counts the number of occurrences of various values ​​in a certain column
DataFrame treats each row as a whole and counts the number of occurrences

Series.value_counts()
df.value_counts()

#计算DF中每一列每个值出现的次数
df.apply(pd.value_counts).fillna(0)

3. Series.isin()

Determine whether the element is in the Series and return a Boolean value

mask = obj.isin(['b','c'])
obj[mask]

4. get_indexer() index corresponding conversion

Index_A.get_indexer(Series_B) Gets the array of indexes of type A corresponding to the value of B

to_match = pd.Series(['c','a','b','b','c','a'])
unique_vals = pd.Series(['c','b','a'])

indices = pd.Index(unique_vals).get_indexer(to_match)

Insert image description here

Insert image description here

Guess you like

Origin blog.csdn.net/qq_48081868/article/details/132512380