Overview and calculation of descriptive statistics
pandas equipped with a common mathematical objects. Collection of statistical methods. DataFrame in the following simple example:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1.4,np.nan],[7.1,4.5],
[np.nan,np.nan],[0.75,-1.3]],
index = list('abcd'),
columns=['one','tow'])
# one tow
# a 1.40 NaN
# b 7.10 4.5
# c NaN NaN
# d 0.75 -1.3
The method can be obtained using the sum and adding in the row, is returned as the Series:
df.sum()
# one 9.25
# tow 3.20
# dtype: float64
If we sum in the column, you can add axis parameters:
df.sum(axis = 1)
# a 1.40
# b 11.60
# c 0.00
# d -0.55
# dtype: float64
NaN is automatically excluded . Skipna values can be designated to choose not Skip NaN, i.e. as long as there is a NaN, then the result is NaN is False:
df.sum(axis = 1,skipna = False)
# a NaN
# b 11.60
# c NaN
# d -0.55
# dtype: float64
idxmax idxmin methods and each row or index can be obtained for each column the maximum value or the minimum value:
df.idxmax(axis = 1)
# a one
# b one
# c NaN
# d one
describe methods to produce a plurality of statistics values:
df.describe(())
# one tow
# count 3.000000 2.000000
# mean 3.083333 1.600000
# std 3.493685 4.101219
# min 0.750000 -1.300000
# 50% 1.400000 1.600000
# max 7.100000 4.500000
Common descriptive summary statistics and statistical methods:
method | description |
---|---|
count | The number of non-value NA |
describe | Statistical summary calculations Series collection of the columns or DataFrame |
min,max | Computing minimum, maximum |
argmin,argmax | Minimum value were calculated, where the position of the maximum index (integer) |
idxmin,idxmax | Calculate the minimum or maximum value index tab is located |
quantile | From the quantile samples is calculated from 0 to 1 |
sum | Addition sum |
mean | Means |
median | Median |
mad | The average absolute deviation from the mean |
prod | All product values |
where | Sample variance values |
std | Sample standard differential value |
skew | A sample scale (third time) value |
kurt | Sample kurtosis (fourth time) value |
cumsum | Cumulative value |
cumin,cummax | Maximum and minimum accumulated values |
cumpord | The cumulative value of the product |
diff | An arithmetic calculation of the difference (useful for time series) |
pct_change | Calculate the percentage |
One class of methods to extract information from the value contained in the Series, Series object first consider the following:
obj = pd.Series(list('cadaabbcc'))
# 0 c
# 1 a
# 2 d
# 3 a
# 4 a
# 5 b
# 6 b
# 7 c
# 8 c
# dtype: object
The first is the unique method, all the values which can be given object (unique), but not necessarily in the order given by ::
uniques = obj.unique()
# ['c' 'a' 'd' 'b']
value_counts () method may calculate a value occurs in a number of Series:
obj.value_counts()
# a 3
# c 3
# b 2
# d 1
# dtype: int64
The method according to the default number arranged from more to less, you may be added to cancel the alignment sort = False.
isin vectorized specified detection member, i.e., one by determining whether a given element in a set order with Boolean values may be filtered index sets:
obj.isin(['b','c'])
# 0 True
# 1 False
# 2 False
# 3 False
# 4 False
# 5 True
# 6 True
# 7 True
# 8 True
# dtype: bool
obj[obj.isin(['b','c'])]
# 0 c
# 5 b
# 6 b
# 7 c
# 8 c
# dtype: object
Unique value, and counting method set membership properties as follows:
method | description |
---|---|
ray | Characterization Series calculated values in the array of Boolean values is included in each incoming sequence |
match | Each integer index value calculation array, forming a single array of values, join and contribute to its data type of operation |
unique | Series array of unique values calculated values, are returned in the order of observation |
value_counts | Returns a Series, the sequence index is a unique value, the value is an odd number, sorted in descending order of the number of |