pandas data analysis (4) data grouping

    This article still uses Paddle's practice questions and datasets to sort out the pandas data grouping, and still throws out the link to these ten sets of exercises to teach you how to use Pandas for data analysis - Flying Paddle AI Studio , thanks for providing the data set. This article will still be carried out according to the practice questions. Of course, I am not following the script, and there will still be some expansions and pitfalls.

Table of contents

1. Read data 

2. Data analysis

1. Average beer consumption per continent

2. Average beer consumption in all countries

3. Descriptive statistics

4. The average consumption of various beverages in each continent

5. Median Consumption of Various Beverages in Each Continent

6. Average, maximum and minimum consumption of spirit drinks per continent

7. Avoid pits    

3. Complete code


 

1. Read data 

    Trilogy: import pandas library, set maximum row and column, read data

import pandas as pd

# 设置显示的最大列、宽等参数,消掉打印不完全中间的省略号
pd.set_option('display.width', 1000)
# 显示所有列
pd.set_option('display.max_columns', None)
# 显示所有行
pd.set_option('display.max_rows', None)
# 读取csv文件
path1 = "exercise_data/drinks.csv"
data = pd.read_csv(path1)

    After reading the data, first head to see what each column represents. Including country, number of beer consumption, number of Sprite consumption, number of wine consumption, alcohol content per liter, continent:

0a834c1bc95a4e2a9b677d7f7f3692f6.png

2. Data analysis

1. Average beer consumption per continent

     'beer_servings' average value, 'continent' groupby:

print(data.groupby('continent')['beer_servings'].mean())

2. Average beer consumption in all countries

    Averaging directly over 'beer_servings':

print(round(data['beer_servings'].mean()))

3. Descriptive statistics

    For example to print descriptive statistics for beer consumption per continent:

print(data.groupby('continent')['beer_servings'].describe())

    The printed descriptive statistics include quantity, mean, standard deviation, minimum value, 25th percentile, 50th percentile, 75th percentile, and maximum value: 

525d15b61abb42e6a1dd2e76de6f29dd.png

4. The average consumption of various beverages in each continent

print(data.groupby('continent').mean())

8b3992f54e024d60a6f8c0396cc32484.png            注意,这里报了一个FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  print(data.groupby('continent').mean())

It means that the mean() function defaults to numeric_only=True, that is, it only takes effect for columns of numeric type. However, later versions may default to false, so you must select a column of numeric type or specify it as True before it can be used. Therefore, we directly specify True:

print(data.groupby('continent').mean(numeric_only=True))

5. Median Consumption of Various Beverages in Each Continent

print(data.groupby('continent').median(numeric_only=True))

e67f81b67a3f4225a7092c2d90f167b7.png

6. Average, maximum and minimum consumption of spirit drinks per continent

print(data.groupby('continent')['spirit_servings'].agg(['mean', 'min', 'max']))

c0d6474e1b45418ea452a19dd0c48dc8.png

7. Avoid pits    

    At this point, we have finished this article with the exercises. However, I wonder if you have found a very big problem? Aren't there seven continents on earth? Let's break our fingers: Asia, Europe, Africa, Oceania, Antarctica, South America, North America, but why do we only have 5 after groupby according to the latitude of continents? We ignore that Antarctica is not inhabited, and there is indeed no Antarctica in the data set. Also missing is North America? So where is North America?

45f59d9beac549a3899d367574dd9510.png

    As above, North America is abbreviated as NA, and pandas automatically ignores NA and considers NA to be illegal. So how to solve this problem? pandas provides dropna and fillna to solve this problem, dropna can be like this:

print(data.groupby('continent', dropna=False)['spirit_servings'].agg(['mean', 'min', 'max']))

     By doing this, pandas will not ignore NAs, but will display them as NaNs:

c32289e26b4c444381255ee047fb8f6f.png

    So how to make it correctly displayed as NA? You can use fillna to do it:

print(data.fillna({'continent': 'NA'}).groupby('continent')['spirit_servings'].agg(['mean', 'min', 'max']))

79cbfbb28ba44748bb222b1aa54c5b99.png

    Note: In many cases, we cannot simply discard NAs. Of course, it cannot be directly displayed as NA. The particularity of this case is that the abbreviation of North America is NA, which coincides with the NA of python. . . .

3. Complete code

import pandas as pd

# 设置显示的最大列、宽等参数,消掉打印不完全中间的省略号
pd.set_option('display.width', 1000)
# 显示所有列
pd.set_option('display.max_columns', None)
# 显示所有行
pd.set_option('display.max_rows', None)
# 读取csv文件
path1 = "exercise_data/drinks.csv"
data = pd.read_csv(path1)
print(data.head())
# 每个大陆的啤酒消费平均数
print(data.groupby('continent')['beer_servings'].mean())
# 所有国家平均啤酒消费数
print(round(data['beer_servings'].mean()))
# 每个大陆的啤酒消费的描述性统计值
print(data.fillna({'continent': 'NA'}).groupby('continent')['beer_servings'].describe())
# 每个大陆各种饮料消费平均值
print(data.fillna({'continent': 'NA'}).groupby('continent').mean(numeric_only=True))
# 每个大陆各种饮料消费中位数
print(data.fillna({'continent': 'NA'}).groupby('continent').median(numeric_only=True))
# 每个大陆spirit饮品消耗的平均值,最大值和最小值
print(data.fillna({'continent': 'NA'}).groupby('continent')['spirit_servings'].agg(['mean', 'min', 'max']))

     At this point, we're really done. I personally have a little bit of complaints. Although the Flying Paddle platform provides some pandas exercises and data sets, there are some pitfalls that they may not have noticed, or they are not very friendly to novices. In my recent series of blogs, I will try my best to sort out and correct some pitfalls and found problems while following the practice questions.

 

Guess you like

Origin blog.csdn.net/qq_21154101/article/details/127591769