59_Pandas uses describe to obtain summary statistics for each column (mean, standard deviation, etc.)

59_Pandas uses describe to obtain summary statistics for each column (mean, standard deviation, etc.)

Using the describe() method of pandas.DataFrame and pandas.Series, you can obtain summary statistics such as mean, standard deviation, maximum, minimum, and mode for each column.

Here, the following content will be described.

  • Basic usage of describe()
  • Specify the target type: include, exclude
    • Specify non-numeric columns, such as strings
    • Specify all types of columns
    • select/exclude any type
  • describe() the meaning of the items and the corresponding individual methods
    • count: the number of elements
    • unique: the number of elements with a unique value
    • top:mode
    • freq: mode (number of occurrences)
    • mean: arithmetic mean
    • std: standard deviation
    • min: minimum value
    • max: maximum value
    • 50%: Median
    • 25%, 75%: 1/4 quantile, 3/4 quantile
  • Specify percentile increment
  • Calculate frequency of numeric columns, etc.
  • Calculate the mean, standard deviation, etc. of a string of numbers
  • describe() in pandas.Series
  • date and time (datetime64[ns] type)
  • apply describe() to the row

In the sample code, a pandas.DataFrame with each column having a different type dtype is taken as an example.

import pandas as pd

df = pd.DataFrame({
    
    'a': [1, 2, 1, 3],
                   'b': [0.4, 1.1, 0.1, 0.8],
                   'c': ['X', 'Y', 'X', 'Z'],
                   'd': ['3', '5', '2', '1'],
                   'e': [True, True, False, True]})

print(df)
#    a    b  c  d      e
# 0  1  0.4  X  3   True
# 1  2  1.1  Y  5   True
# 2  1  0.1  X  2  False
# 3  3  0.8  Z  1   True

print(df.dtypes)
# a      int64
# b    float64
# c     object
# d     object
# e       bool
# dtype: object

Basic usage of describe()

Specify the target type: include, exclude

Specify non-numeric columns, such as strings

Specify all types of columns

select/exclude any type

describe() the meaning of the items and the corresponding individual methods

count: the number of elements

unique: the number of elements with a unique value

top:mode

freq: mode (number of occurrences)

mean: arithmetic mean

std: standard deviation

min: minimum value

max: maximum value

50%: Median

25%, 75%: 1/4 quantile, 3/4 quantile

Specify percentile increment

Calculate frequency of numeric columns, etc.

Calculate the mean, standard deviation, etc. of a string of numbers

describe() in pandas.Series

date and time (datetime64[ns] type)

apply describe() to the row

Basic usage of describe()

Calling the describe() method without any arguments on the example pandas.DataFrame returns the result as a pandas.DataFrame.

print(df.describe())
#               a         b
# count  4.000000  4.000000
# mean   1.750000  0.600000
# std    0.957427  0.439697
# min    1.000000  0.100000
# 25%    1.000000  0.325000
# 50%    1.500000  0.600000
# 75%    2.250000  0.875000
# max    3.000000  1.100000

print(type(df.describe()))
# <class 'pandas.core.frame.DataFrame'>

Lines and elements can be obtained using loc and at.

print(df.describe().loc['std'])
# a    0.957427
# b    0.439697
# Name: std, dtype: float64

print(df.describe().at['std', 'b'])
# 0.439696865275764

For pandas.DataFrame, there are various types of columns. By default, only numerical columns (integer type int, floating point type float) are selected, and the mean and standard deviation std are calculated. The meaning of the items will be explained later.

Since the judgment is strictly based on the type dtype, the columns of numeric strings like column d in the example are excluded.

Any missing values ​​NaN are excluded from calculations.

Specify the target type: include, exclude

To obtain summary statistics for non-numeric columns, set the parameters include and exclude.

include specifies types to include in the result, and exclude specifies types to exclude from the result. Note that the type is specified, not the column name.

Specify non-numeric columns, such as strings

Numeric types are represented by 'number', so using exclude='number' will compute results for non-numeric columns such as string types.

print(df.describe(exclude='number'))
#         c  d     e
# count   4  4     4
# unique  3  4     2
# top     X  3  True
# freq    2  1     3

As can be seen from the results, the values ​​and other calculated items are different. The meaning of the items will be explained later.

In addition, if pandas.DataFrame does not contain numeric columns, it is also possible not to set any parameters.

df_notnum = df[['c', 'd', 'e']]
print(df_notnum)
#    c  d      e
# 0  X  3   True
# 1  Y  5   True
# 2  X  2  False
# 3  Z  1   True

print(df_notnum.dtypes)
# c    object
# d    object
# e      bool
# dtype: object

print(df_notnum.describe())
#         c  d     e
# count   4  4     4
# unique  3  4     2
# top     X  3  True
# freq    2  1     3

Specify all types of columns

If include='all', include all types of columns.

print(df.describe(include='all'))
#                a         b    c    d     e
# count   4.000000  4.000000    4    4     4
# unique       NaN       NaN    3    4     2
# top          NaN       NaN    X    3  True
# freq         NaN       NaN    2    1     3
# mean    1.750000  0.600000  NaN  NaN   NaN
# std     0.957427  0.439697  NaN  NaN   NaN
# min     1.000000  0.100000  NaN  NaN   NaN
# 25%     1.000000  0.325000  NaN  NaN   NaN
# 50%     1.500000  0.600000  NaN  NaN   NaN
# 75%     2.250000  0.875000  NaN  NaN   NaN
# max     3.000000  1.100000  NaN  NaN   NaN

However, since numeric columns and other types of columns calculate different items, the value of the item that is not calculated will be the missing value NaN.

select/exclude any type

Arbitrary types can be specified for the parameters include and exclude. Even if the result has only one column, it will return pandas.DataFrame type.

print(df.describe(include=int))
#               a
# count  4.000000
# mean   1.750000
# std    0.957427
# min    1.000000
# 25%    1.000000
# 50%    1.500000
# 75%    2.250000
# max    3.000000

print(type(df.describe(include=int)))
# <class 'pandas.core.frame.DataFrame'>

Multiple types can be specified in the list. The items to be calculated are automatically determined by the selected type.

print(df.describe(include=[object, bool]))
#         c  d     e
# count   4  4     4
# unique  3  4     2
# top     X  3  True
# freq    2  1     3

print(df.describe(exclude=[float, object]))
#                a     e
# count   4.000000     4
# unique       NaN     2
# top          NaN  True
# freq         NaN     3
# mean    1.750000   NaN
# std     0.957427   NaN
# min     1.000000   NaN
# 25%     1.000000   NaN
# 50%     1.500000   NaN
# 75%     2.250000   NaN
# max     3.000000   NaN

'number' and 'all' must be enclosed in quotes ' or ", but Python's standard built-in types like int and float can be specified as include=int without quotes. include='int' is fine.

describe() the meaning of the items and the corresponding individual methods

Describes the meaning of the items computed by describe() and the methods you can use when you just want to count each item individually.

If you use loc[] to specify the pandas.DataFrame row obtained by describe(), you can select the value of each item, but if you don't need other items, it is wasteful to use the method alone.

As mentioned above, note that the item calculated by describe() is different for values ​​and other types. Also, in describe(), missing values ​​NaN are excluded when calculating any item.

count: the number of elements

Can be counted individually using the count() method.

print(df.count())
# a    4
# b    4
# c    4
# d    4
# e    4
# dtype: int64

unique: the number of elements with a unique value

Can be calculated individually using the nunique() method.

mode() returns a pandas.DataFrame. If there are values ​​with the same frequency, there are multiple patterns, with a different number of patterns per column.

print(df.nunique())
# a    3
# b    4
# c    3
# d    4
# e    2
# dtype: int64

top:mode

Can be calculated individually using the mode() method.

print(df.mode())
#      a    b    c  d     e
# 0  1.0  0.1    X  1  True
# 1  NaN  0.4  NaN  2   NaN
# 2  NaN  0.8  NaN  3   NaN
# 3  NaN  1.1  NaN  5   NaN

The number of modal values ​​in each column can be obtained by applying the count() method to the result of mode(), which counts the number of elements that are not missing values ​​NaN.

print(df.mode().count())
# a    1
# b    4
# c    1
# d    4
# e    1
# dtype: int64

If iloc[0] is used to select the first row, each column can get at least one mode value.

print(df.mode().iloc[0])
# a       1
# b     0.1
# c       X
# d       1
# e    True
# Name: 0, dtype: object

If describe()'s item top has multiple mode values, only one of them will be returned, but be aware that it doesn't always match the first line of mode().

freq: mode (number of occurrences)

Can be counted individually using the value_counts() method of pandas.Series.

value_counts() returns a pandas.Series whose unique element value is index and whose frequency (number of occurrences) is data.

By default, pandas.Series are sorted in descending order of occurrence, so the first value in a pandas.Series returned by the value_counts() method is the frequency of the pattern.

print(df['c'].value_counts().iat[0])
# 2

If you want to get the mode frequency for each column of a pandas.DataFrame, use the apply() method to apply an anonymous function that returns the highest value in the result of the value_counts() method to each column.

print(df.apply(lambda x: x.value_counts().iat[0]))
# a    2
# b    1
# c    2
# d    1
# e    3
# dtype: int64

In value_counts(), the elements of the original pandas.Series are the indices into the resulting pandas.Series. If the value is index, you cannot use [number] to specify the value (an error will be reported), so use iat [number] to specify it strictly.

mean: arithmetic mean

Can be calculated individually using the mean() method.

The parameter numeric_only=True produces only numeric columns. The following methods are the same.

print(df.mean(numeric_only=True))
# a    1.75
# b    0.60
# e    0.75
# dtype: float64

Columns of type bool are excluded in describe(), but treated as True=1, False=0 in mean(). The following methods are the same.

std: standard deviation

Divided by n-1, also called the sample standard deviation.

Can be calculated individually using the std() method.

print(df.std(numeric_only=True))
# a    0.957427
# b    0.439697
# e    0.500000
# dtype: float64

min: minimum value

It can be calculated separately using the min() method.

print(df.min(numeric_only=True))
# a    1.0
# b    0.1
# e    0.0
# dtype: float64

max: maximum value

Can be calculated separately using the max() method.

print(df.max(numeric_only=True))
# a    3.0
# b    1.1
# e    1.0
# dtype: float64

50%: Median

Also known as the 1/2 quantile or the 50th percentile.

It can be calculated separately with the median() method.

print(df.median(numeric_only=True))
# a    1.5
# b    0.6
# e    1.0
# dtype: float64

25%, 75%: 1/4 quantile, 3/4 quantile

Also known as 1st quartile, 3rd quartile, 25th percentile, 75th percentile, etc.

Can be calculated individually using the quantile() method. Specifies a list of percentiles to calculate, ranging from 0 to 1.

print(df.quantile(q=[0.25, 0.75], numeric_only=True))
#          a      b     e
# 0.25  1.00  0.325  0.75
# 0.75  2.25  0.875  1.00

You can also use quantile() to calculate the minimum, maximum, and median all at once.

print(df.quantile(q=[0, 0.25, 0.5, 0.75, 1], numeric_only=True))
#          a      b     e
# 0.00  1.00  0.100  0.00
# 0.25  1.00  0.325  0.75
# 0.50  1.50  0.600  1.00
# 0.75  2.25  0.875  1.00
# 1.00  3.00  1.100  1.00

The percentiles to be calculated can also be specified in describe() using the following parameter percentiles.

Specify percentile increment

As in the previous example, describe() by default calculates the minimum (0th percentile), median (50th percentile), maximum (100th percentile), and the 25th and 75th percentile.

The minimum, median and maximum values ​​are always calculated, but other values ​​can be specified using the parameter percentiles. Specifies a list of values ​​ranging from 0 to 1.

print(df.describe(percentiles=[0.2, 0.4, 0.6, 0.8]))
#               a         b
# count  4.000000  4.000000
# mean   1.750000  0.600000
# std    0.957427  0.439697
# min    1.000000  0.100000
# 20%    1.000000  0.280000
# 40%    1.200000  0.480000
# 50%    1.500000  0.600000
# 60%    1.800000  0.720000
# 80%    2.400000  0.920000
# max    3.000000  1.100000

Calculate frequency of numeric columns, etc.

For example, with categorical data where males are represented by 0 and females by 1, or place names are assigned to numerical values, there are cases where we want to examine the mode and its frequency rather than the mean and standard deviation, even for numerical data.

print(df.astype('str').describe())
#         a    b  c  d     e
# count   4    4  4  4     4
# unique  3    4  3  4     2
# top     1  1.1  X  3  True
# freq    2    1  2  1     3

print(df.astype({
    
    'a': str}).describe(exclude='number'))
#         a  c  d     e
# count   4  4  4     4
# unique  3  3  4     2
# top     1  X  3  True
# freq    2  2  1     3

Calculate the mean, standard deviation, etc. of a string of numbers

Likewise, if you want to calculate the mean or standard deviation of a string of numbers, use the astype() method.

print(df.astype({
    
    'd': int, 'e': int}).describe())
#               a         b         d     e
# count  4.000000  4.000000  4.000000  4.00
# mean   1.750000  0.600000  2.750000  0.75
# std    0.957427  0.439697  1.707825  0.50
# min    1.000000  0.100000  1.000000  0.00
# 25%    1.000000  0.325000  1.750000  0.75
# 50%    1.500000  0.600000  2.500000  1.00
# 75%    2.250000  0.875000  3.500000  1.00
# max    3.000000  1.100000  5.000000  1.00

describe() in pandas.Series

pandas.Series also has a describe() method. Returns pandas.Series.

s_int = df['a']
print(s_int)
# 0    1
# 1    2
# 2    1
# 3    3
# Name: a, dtype: int64

print(s_int.describe())
# count    4.000000
# mean     1.750000
# std      0.957427
# min      1.000000
# 25%      1.000000
# 50%      1.500000
# 75%      2.250000
# max      3.000000
# Name: a, dtype: float64

print(type(s_int.describe()))
# <class 'pandas.core.series.Series'>

The parameters include and exclude are ignored and items are computed according to type dtype. You can also use astype() for type conversion.

s_str = df['d']
print(s_str.describe())
# count     4
# unique    4
# top       3
# freq      1
# Name: d, dtype: object

print(s_str.astype('int').describe())
# count    4.000000
# mean     2.750000
# std      1.707825
# min      1.000000
# 25%      1.750000
# 50%      2.500000
# 75%      3.500000
# max      5.000000
# Name: d, dtype: float64

date and time (datetime64[ns] type)

Added first and last items for columns of type datetime64[ns].

df['dt'] = pd.to_datetime(['2018-01-01', '2018-03-15', '2018-02-20', '2018-03-15'])

print(df.dtypes)
# a              int64
# b            float64
# c             object
# d             object
# e               bool
# dt    datetime64[ns]
# dtype: object

print(df.describe(include='datetime'))
#                          dt
# count                     4
# unique                    3
# top     2018-03-15 00:00:00
# freq                      2
# first   2018-01-01 00:00:00
# last    2018-03-15 00:00:00

Literally, first means the first date and last means the last date. It can be calculated separately with min() and max().

print(df['dt'].min())
# 2018-01-01 00:00:00

print(df['dt'].max())
# 2018-03-15 00:00:00

apply describe() to the row

describe() takes no arguments specifying row and column axes. Use .T to transpose, then call describe() to get the result in the original row.

print(df.T.describe())
#         0                    1                    2                    3
# count   6                    6                    6                    6
# unique  5                    6                    6                    6
# top     1  2018-03-15 00:00:00  2018-02-20 00:00:00  2018-03-15 00:00:00
# freq    2                    1                    1                    1

In pandas, each column has a type dtype, so it is basically assumed that each column is arranged with the same type of data.

Therefore, it is usually not necessary to get summary statistics for each row, and if each row is arranged with the same kind of data, it is better to transpose, which makes it easier to do all kinds of processing, not just describe().

Guess you like

Origin blog.csdn.net/qq_18351157/article/details/130069107