59_Pandas uses describe to obtain summary statistics for each column (mean, standard deviation, etc.)
Using the describe() method of pandas.DataFrame and pandas.Series, you can obtain summary statistics such as mean, standard deviation, maximum, minimum, and mode for each column.
Here, the following content will be described.
- Basic usage of describe()
- Specify the target type: include, exclude
- Specify non-numeric columns, such as strings
- Specify all types of columns
- select/exclude any type
- describe() the meaning of the items and the corresponding individual methods
- count: the number of elements
- unique: the number of elements with a unique value
- top:mode
- freq: mode (number of occurrences)
- mean: arithmetic mean
- std: standard deviation
- min: minimum value
- max: maximum value
- 50%: Median
- 25%, 75%: 1/4 quantile, 3/4 quantile
- Specify percentile increment
- Calculate frequency of numeric columns, etc.
- Calculate the mean, standard deviation, etc. of a string of numbers
- describe() in pandas.Series
- date and time (datetime64[ns] type)
- apply describe() to the row
In the sample code, a pandas.DataFrame with each column having a different type dtype is taken as an example.
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 1, 3],
'b': [0.4, 1.1, 0.1, 0.8],
'c': ['X', 'Y', 'X', 'Z'],
'd': ['3', '5', '2', '1'],
'e': [True, True, False, True]})
print(df)
# a b c d e
# 0 1 0.4 X 3 True
# 1 2 1.1 Y 5 True
# 2 1 0.1 X 2 False
# 3 3 0.8 Z 1 True
print(df.dtypes)
# a int64
# b float64
# c object
# d object
# e bool
# dtype: object
Basic usage of describe()
Specify the target type: include, exclude
Specify non-numeric columns, such as strings
Specify all types of columns
select/exclude any type
describe() the meaning of the items and the corresponding individual methods
count: the number of elements
unique: the number of elements with a unique value
top:mode
freq: mode (number of occurrences)
mean: arithmetic mean
std: standard deviation
min: minimum value
max: maximum value
50%: Median
25%, 75%: 1/4 quantile, 3/4 quantile
Specify percentile increment
Calculate frequency of numeric columns, etc.
Calculate the mean, standard deviation, etc. of a string of numbers
describe() in pandas.Series
date and time (datetime64[ns] type)
apply describe() to the row
Basic usage of describe()
Calling the describe() method without any arguments on the example pandas.DataFrame returns the result as a pandas.DataFrame.
print(df.describe())
# a b
# count 4.000000 4.000000
# mean 1.750000 0.600000
# std 0.957427 0.439697
# min 1.000000 0.100000
# 25% 1.000000 0.325000
# 50% 1.500000 0.600000
# 75% 2.250000 0.875000
# max 3.000000 1.100000
print(type(df.describe()))
# <class 'pandas.core.frame.DataFrame'>
Lines and elements can be obtained using loc and at.
print(df.describe().loc['std'])
# a 0.957427
# b 0.439697
# Name: std, dtype: float64
print(df.describe().at['std', 'b'])
# 0.439696865275764
For pandas.DataFrame, there are various types of columns. By default, only numerical columns (integer type int, floating point type float) are selected, and the mean and standard deviation std are calculated. The meaning of the items will be explained later.
Since the judgment is strictly based on the type dtype, the columns of numeric strings like column d in the example are excluded.
Any missing values NaN are excluded from calculations.
Specify the target type: include, exclude
To obtain summary statistics for non-numeric columns, set the parameters include and exclude.
include specifies types to include in the result, and exclude specifies types to exclude from the result. Note that the type is specified, not the column name.
Specify non-numeric columns, such as strings
Numeric types are represented by 'number', so using exclude='number' will compute results for non-numeric columns such as string types.
print(df.describe(exclude='number'))
# c d e
# count 4 4 4
# unique 3 4 2
# top X 3 True
# freq 2 1 3
As can be seen from the results, the values and other calculated items are different. The meaning of the items will be explained later.
In addition, if pandas.DataFrame does not contain numeric columns, it is also possible not to set any parameters.
df_notnum = df[['c', 'd', 'e']]
print(df_notnum)
# c d e
# 0 X 3 True
# 1 Y 5 True
# 2 X 2 False
# 3 Z 1 True
print(df_notnum.dtypes)
# c object
# d object
# e bool
# dtype: object
print(df_notnum.describe())
# c d e
# count 4 4 4
# unique 3 4 2
# top X 3 True
# freq 2 1 3
Specify all types of columns
If include='all', include all types of columns.
print(df.describe(include='all'))
# a b c d e
# count 4.000000 4.000000 4 4 4
# unique NaN NaN 3 4 2
# top NaN NaN X 3 True
# freq NaN NaN 2 1 3
# mean 1.750000 0.600000 NaN NaN NaN
# std 0.957427 0.439697 NaN NaN NaN
# min 1.000000 0.100000 NaN NaN NaN
# 25% 1.000000 0.325000 NaN NaN NaN
# 50% 1.500000 0.600000 NaN NaN NaN
# 75% 2.250000 0.875000 NaN NaN NaN
# max 3.000000 1.100000 NaN NaN NaN
However, since numeric columns and other types of columns calculate different items, the value of the item that is not calculated will be the missing value NaN.
select/exclude any type
Arbitrary types can be specified for the parameters include and exclude. Even if the result has only one column, it will return pandas.DataFrame type.
print(df.describe(include=int))
# a
# count 4.000000
# mean 1.750000
# std 0.957427
# min 1.000000
# 25% 1.000000
# 50% 1.500000
# 75% 2.250000
# max 3.000000
print(type(df.describe(include=int)))
# <class 'pandas.core.frame.DataFrame'>
Multiple types can be specified in the list. The items to be calculated are automatically determined by the selected type.
print(df.describe(include=[object, bool]))
# c d e
# count 4 4 4
# unique 3 4 2
# top X 3 True
# freq 2 1 3
print(df.describe(exclude=[float, object]))
# a e
# count 4.000000 4
# unique NaN 2
# top NaN True
# freq NaN 3
# mean 1.750000 NaN
# std 0.957427 NaN
# min 1.000000 NaN
# 25% 1.000000 NaN
# 50% 1.500000 NaN
# 75% 2.250000 NaN
# max 3.000000 NaN
'number' and 'all' must be enclosed in quotes ' or ", but Python's standard built-in types like int and float can be specified as include=int without quotes. include='int' is fine.
describe() the meaning of the items and the corresponding individual methods
Describes the meaning of the items computed by describe() and the methods you can use when you just want to count each item individually.
If you use loc[] to specify the pandas.DataFrame row obtained by describe(), you can select the value of each item, but if you don't need other items, it is wasteful to use the method alone.
As mentioned above, note that the item calculated by describe() is different for values and other types. Also, in describe(), missing values NaN are excluded when calculating any item.
count: the number of elements
Can be counted individually using the count() method.
print(df.count())
# a 4
# b 4
# c 4
# d 4
# e 4
# dtype: int64
unique: the number of elements with a unique value
Can be calculated individually using the nunique() method.
mode() returns a pandas.DataFrame. If there are values with the same frequency, there are multiple patterns, with a different number of patterns per column.
print(df.nunique())
# a 3
# b 4
# c 3
# d 4
# e 2
# dtype: int64
top:mode
Can be calculated individually using the mode() method.
print(df.mode())
# a b c d e
# 0 1.0 0.1 X 1 True
# 1 NaN 0.4 NaN 2 NaN
# 2 NaN 0.8 NaN 3 NaN
# 3 NaN 1.1 NaN 5 NaN
The number of modal values in each column can be obtained by applying the count() method to the result of mode(), which counts the number of elements that are not missing values NaN.
print(df.mode().count())
# a 1
# b 4
# c 1
# d 4
# e 1
# dtype: int64
If iloc[0] is used to select the first row, each column can get at least one mode value.
print(df.mode().iloc[0])
# a 1
# b 0.1
# c X
# d 1
# e True
# Name: 0, dtype: object
If describe()'s item top has multiple mode values, only one of them will be returned, but be aware that it doesn't always match the first line of mode().
freq: mode (number of occurrences)
Can be counted individually using the value_counts() method of pandas.Series.
value_counts() returns a pandas.Series whose unique element value is index and whose frequency (number of occurrences) is data.
By default, pandas.Series are sorted in descending order of occurrence, so the first value in a pandas.Series returned by the value_counts() method is the frequency of the pattern.
print(df['c'].value_counts().iat[0])
# 2
If you want to get the mode frequency for each column of a pandas.DataFrame, use the apply() method to apply an anonymous function that returns the highest value in the result of the value_counts() method to each column.
print(df.apply(lambda x: x.value_counts().iat[0]))
# a 2
# b 1
# c 2
# d 1
# e 3
# dtype: int64
In value_counts(), the elements of the original pandas.Series are the indices into the resulting pandas.Series. If the value is index, you cannot use [number] to specify the value (an error will be reported), so use iat [number] to specify it strictly.
mean: arithmetic mean
Can be calculated individually using the mean() method.
The parameter numeric_only=True produces only numeric columns. The following methods are the same.
print(df.mean(numeric_only=True))
# a 1.75
# b 0.60
# e 0.75
# dtype: float64
Columns of type bool are excluded in describe(), but treated as True=1, False=0 in mean(). The following methods are the same.
std: standard deviation
Divided by n-1, also called the sample standard deviation.
Can be calculated individually using the std() method.
print(df.std(numeric_only=True))
# a 0.957427
# b 0.439697
# e 0.500000
# dtype: float64
min: minimum value
It can be calculated separately using the min() method.
print(df.min(numeric_only=True))
# a 1.0
# b 0.1
# e 0.0
# dtype: float64
max: maximum value
Can be calculated separately using the max() method.
print(df.max(numeric_only=True))
# a 3.0
# b 1.1
# e 1.0
# dtype: float64
50%: Median
Also known as the 1/2 quantile or the 50th percentile.
It can be calculated separately with the median() method.
print(df.median(numeric_only=True))
# a 1.5
# b 0.6
# e 1.0
# dtype: float64
25%, 75%: 1/4 quantile, 3/4 quantile
Also known as 1st quartile, 3rd quartile, 25th percentile, 75th percentile, etc.
Can be calculated individually using the quantile() method. Specifies a list of percentiles to calculate, ranging from 0 to 1.
print(df.quantile(q=[0.25, 0.75], numeric_only=True))
# a b e
# 0.25 1.00 0.325 0.75
# 0.75 2.25 0.875 1.00
You can also use quantile() to calculate the minimum, maximum, and median all at once.
print(df.quantile(q=[0, 0.25, 0.5, 0.75, 1], numeric_only=True))
# a b e
# 0.00 1.00 0.100 0.00
# 0.25 1.00 0.325 0.75
# 0.50 1.50 0.600 1.00
# 0.75 2.25 0.875 1.00
# 1.00 3.00 1.100 1.00
The percentiles to be calculated can also be specified in describe() using the following parameter percentiles.
Specify percentile increment
As in the previous example, describe() by default calculates the minimum (0th percentile), median (50th percentile), maximum (100th percentile), and the 25th and 75th percentile.
The minimum, median and maximum values are always calculated, but other values can be specified using the parameter percentiles. Specifies a list of values ranging from 0 to 1.
print(df.describe(percentiles=[0.2, 0.4, 0.6, 0.8]))
# a b
# count 4.000000 4.000000
# mean 1.750000 0.600000
# std 0.957427 0.439697
# min 1.000000 0.100000
# 20% 1.000000 0.280000
# 40% 1.200000 0.480000
# 50% 1.500000 0.600000
# 60% 1.800000 0.720000
# 80% 2.400000 0.920000
# max 3.000000 1.100000
Calculate frequency of numeric columns, etc.
For example, with categorical data where males are represented by 0 and females by 1, or place names are assigned to numerical values, there are cases where we want to examine the mode and its frequency rather than the mean and standard deviation, even for numerical data.
print(df.astype('str').describe())
# a b c d e
# count 4 4 4 4 4
# unique 3 4 3 4 2
# top 1 1.1 X 3 True
# freq 2 1 2 1 3
print(df.astype({
'a': str}).describe(exclude='number'))
# a c d e
# count 4 4 4 4
# unique 3 3 4 2
# top 1 X 3 True
# freq 2 2 1 3
Calculate the mean, standard deviation, etc. of a string of numbers
Likewise, if you want to calculate the mean or standard deviation of a string of numbers, use the astype() method.
print(df.astype({
'd': int, 'e': int}).describe())
# a b d e
# count 4.000000 4.000000 4.000000 4.00
# mean 1.750000 0.600000 2.750000 0.75
# std 0.957427 0.439697 1.707825 0.50
# min 1.000000 0.100000 1.000000 0.00
# 25% 1.000000 0.325000 1.750000 0.75
# 50% 1.500000 0.600000 2.500000 1.00
# 75% 2.250000 0.875000 3.500000 1.00
# max 3.000000 1.100000 5.000000 1.00
describe() in pandas.Series
pandas.Series also has a describe() method. Returns pandas.Series.
s_int = df['a']
print(s_int)
# 0 1
# 1 2
# 2 1
# 3 3
# Name: a, dtype: int64
print(s_int.describe())
# count 4.000000
# mean 1.750000
# std 0.957427
# min 1.000000
# 25% 1.000000
# 50% 1.500000
# 75% 2.250000
# max 3.000000
# Name: a, dtype: float64
print(type(s_int.describe()))
# <class 'pandas.core.series.Series'>
The parameters include and exclude are ignored and items are computed according to type dtype. You can also use astype() for type conversion.
s_str = df['d']
print(s_str.describe())
# count 4
# unique 4
# top 3
# freq 1
# Name: d, dtype: object
print(s_str.astype('int').describe())
# count 4.000000
# mean 2.750000
# std 1.707825
# min 1.000000
# 25% 1.750000
# 50% 2.500000
# 75% 3.500000
# max 5.000000
# Name: d, dtype: float64
date and time (datetime64[ns] type)
Added first and last items for columns of type datetime64[ns].
df['dt'] = pd.to_datetime(['2018-01-01', '2018-03-15', '2018-02-20', '2018-03-15'])
print(df.dtypes)
# a int64
# b float64
# c object
# d object
# e bool
# dt datetime64[ns]
# dtype: object
print(df.describe(include='datetime'))
# dt
# count 4
# unique 3
# top 2018-03-15 00:00:00
# freq 2
# first 2018-01-01 00:00:00
# last 2018-03-15 00:00:00
Literally, first means the first date and last means the last date. It can be calculated separately with min() and max().
print(df['dt'].min())
# 2018-01-01 00:00:00
print(df['dt'].max())
# 2018-03-15 00:00:00
apply describe() to the row
describe() takes no arguments specifying row and column axes. Use .T to transpose, then call describe() to get the result in the original row.
print(df.T.describe())
# 0 1 2 3
# count 6 6 6 6
# unique 5 6 6 6
# top 1 2018-03-15 00:00:00 2018-02-20 00:00:00 2018-03-15 00:00:00
# freq 2 1 1 1
In pandas, each column has a type dtype, so it is basically assumed that each column is arranged with the same type of data.
Therefore, it is usually not necessary to get summary statistics for each row, and if each row is arranged with the same kind of data, it is better to transpose, which makes it easier to do all kinds of processing, not just describe().