Basic introduction to pandas-command template
This article introduces some common knowledge about descriptive statistics in pandas, including pandas statistical functions, correlation coefficients and covariances, unique values, frequency statistics and membership relationships. Hope it can help those in need.
Descriptive statistics
pandas statistical functions
Methods that extract a single value (such as a sum or average) from a Series, or a range of values from the rows or columns of a DataFrame. In contrast to similar methods on NumPy arrays, they have built-in handling of missing data, which is skipped by default.
- Take sum as an example
#默认沿行计算,得到每一列的和
df.sum()
df.sum(axis='index')
#通过制定axis 变为沿列计算,得到每一行的和
df.sum(axis='columns')
#若不想跳过缺失数据,某行中有NA 则结果就为NA,使用 skipna
df.sum(axis='columns',skipna=False)
Parameters that can be selected:
- Find the index when the maximum value is reached df.idxmax()
- Numeric type and non-numeric type statistical description df.describe()
Numeric type returns non-zero count, mean, variance, maximum and minimum values, etc. Non-numeric strong returns the frequency of occurrence of the element - Other descriptive statistical methods
Correlation and covariance
- Correlation
# 返回矩阵各个列之间的相关性系数
df.corr()
#计算某两列之间的相关性
df['col1'].corr(df['col2'])
#计算某列与整个矩阵之间的相关性
df.corrwith(df['col']
- Covariance
# 返回矩阵各个列之间的协方差
df.cov()
#计算某两列之间的协方差
df['col1'].cov(df['col2'])
Unique values, frequency statistics, membership
1. Series.unique()
Returns the result of removing duplicate values from the elements in the Series, without sorting
2. Series/DataFrame/array.value_counts()
Series counts the number of occurrences of various values in a certain column
DataFrame treats each row as a whole and counts the number of occurrences
Series.value_counts()
df.value_counts()
#计算DF中每一列每个值出现的次数
df.apply(pd.value_counts).fillna(0)
3. Series.isin()
Determine whether the element is in the Series and return a Boolean value
mask = obj.isin(['b','c'])
obj[mask]
4. get_indexer() index corresponding conversion
Index_A.get_indexer(Series_B) Gets the array of indexes of type A corresponding to the value of B
to_match = pd.Series(['c','a','b','b','c','a'])
unique_vals = pd.Series(['c','b','a'])
indices = pd.Index(unique_vals).get_indexer(to_match)