Use of pandas module (2)

Join of data merging:

join:默认情况下他是把行索引相同的数据合并到一起

In [6]: t1 = pd.DataFrame(np.zeros((2,5)),index=["A","B"],columns=list("VWXYZ"))

In [7]: t1
Out[7]: 
     V    W    X    Y    Z
A  0.0  0.0  0.0  0.0  0.0
B  0.0  0.0  0.0  0.0  0.0

In [8]: t2 = pd.DataFrame(np.ones((3,4)),index=list("ABC"),columns=list("0123"))

In [9]: t2
Out[9]: 
     0    1    2    3
A  1.0  1.0  1.0  1.0
B  1.0  1.0  1.0  1.0
C  1.0  1.0  1.0  1.0

In [10]: t1.join(t2)
Out[10]: 
     V    W    X    Y    Z    0    1    2    3
A  0.0  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
B  0.0  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0

In [11]: t2.join(t1)
Out[11]: 
     0    1    2    3    V    W    X    Y    Z
A  1.0  1.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0
B  1.0  1.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0
C  1.0  1.0  1.0  1.0  NaN  NaN  NaN  NaN  NaN


可以看到join是将index相同的行进行了合并,以左操作数为基础进行合并

Merge of data merge:

In [25]: t1
Out[25]: 
     V    W  X    Y    Z
A  0.0  0.0  c  0.0  0.0
B  0.0  0.0  d  0.0  0.0

In [26]: t2
Out[26]: 
     M    N    P    Q  O
A  1.0  1.0  1.0  1.0  a
B  1.0  1.0  1.0  1.0  b
C  1.0  1.0  1.0  1.0  c

In [27]: t1.merge(t2,left_on="X",right_on="O")        # 默认的合并方式inner,交集
Out[27]: 
     V    W  X    Y    Z    M    N    P    Q  O
0  0.0  0.0  c  0.0  0.0  1.0  1.0  1.0  1.0  c

In [28]: t1.merge(t2,left_on="X",right_on="O",how="inner") # 内连接
Out[28]: 
     V    W  X    Y    Z    M    N    P    Q  O
0  0.0  0.0  c  0.0  0.0  1.0  1.0  1.0  1.0  c

In [29]: t1.merge(t2,left_on="X",right_on="O",how="outer") # 外连接 merge outer,并集,NaN补全
Out[29]: 
     V    W    X    Y    Z    M    N    P    Q    O
0  0.0  0.0    c  0.0  0.0  1.0  1.0  1.0  1.0    c
1  0.0  0.0    d  0.0  0.0  NaN  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0    a
3  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0    b

In [30]: t1.merge(t2,left_on="X",right_on="O",how="left") # 左连接 merge left,左边为准,NaN补全
Out[30]: 
     V    W  X    Y    Z    M    N    P    Q    O
0  0.0  0.0  c  0.0  0.0  1.0  1.0  1.0  1.0    c
1  0.0  0.0  d  0.0  0.0  NaN  NaN  NaN  NaN  NaN

In [31]: t1.merge(t2,left_on="X",right_on="O",how="right") # 右连接 merge right,右边为准,NaN补全
Out[31]: 
     V    W    X    Y    Z    M    N    P    Q  O
0  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0  a
1  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0  b
2  0.0  0.0    c  0.0  0.0  1.0  1.0  1.0  1.0  c



可以看到merge是以指定的columns对应的两个列中元素相同的连接为一行

example:

  • Now we have a set of statistics about Starbucks stores around the world. If I want to know which number of Starbucks in the United States is higher than that in China, or I want to know the number of Starbucks in each province in China, what should I do?

  • To count the number of Starbucks in the United States and China, what should we do?

  • Data source: https://www.kaggle.com/starbucks/store-locations/data

  • Data Format:

      Brand	Store Number	Store Name	Ownership Type	Street Address	City	State/Province	Country	Postcode	Phone Number	Timezone	Longitude	Latitude
      Starbucks	47370-257954	Meritxell, 96	Licensed	Av. Meritxell, 96	Andorra la Vella	7	AD	AD500	376818720	GMT+1:00 Europe/Andorra	1.53	42.51
      Starbucks	22331-212325	Ajman Drive Thru	Licensed	1 Street 69, Al Jarf	Ajman	AJ	AE			GMT+04:00 Asia/Dubai	55.47	25.42
      Starbucks	47089-256771	Dana Mall	Licensed	Sheikh Khalifa Bin Zayed St.	Ajman	AJ	AE			GMT+04:00 Asia/Dubai	55.47	25.39
    

Code:

import pandas as pd

df = pd.read_csv("./starbucks_store_worldwide.csv")

# print(df.info())
# print(df.head(1))

# 按照国家进行分组(聚合)
country_info = df.groupby(by="Country")

# 遍历输出分组后的信息
for i,j in country_info:
    
    print("-"*50)
    print(i)
    print("*"*50)
    print(j)

# 计算分组后每一个国家牌子的数量
country_num = country_info["Brand"].count()
print(country_num)

df[df["Country"]=="US"]

# 分别输出美国和中国的星巴克Brand的数量
print(country_num["US"])
print(country_num["CN"])

# 统计中国每个省店铺的数量
china_data = df[df["Country"] == "CN"]

# 按照省分组
grouped = china_data.groupby(by="State/Province").count()["Brand"]

print(grouped)


# 将数据按照多个条件分组
grouped = df["Brand"].groupby(by=[df["Country"],df["State/Province"]]).count()
print(grouped)
print(type(grouped))

# 按多条件进行分组,返回DataFrame
grouped1 = df[["Brand"]].groupby(by=[df["Country"],df["State/Province"]]).count()
grouped2 = df.groupby(by=[df["Country"],df["State/Province"]]).count()[["Brand"]]
grouped3 = df.groupby(by=[df["Country"],df["State/Province"]])[["Brand"]].count()

print(grouped1,type(grouped1)) # <class 'pandas.core.frame.DataFrame'>
print("*"*50)
print(grouped2,type(grouped2)) # <class 'pandas.core.frame.DataFrame'>
print("*"*50)
print(grouped3,type(grouped3)) # <class 'pandas.core.frame.DataFrame'>

Grouping and aggregation:

在pandas中类似的分组的操作我们有很简单的方式来完成

df.groupby(by="columns_name")

grouped = df.groupby(by="columns_name")
grouped是一个DataFrameGroupBy对象,是可迭代的
grouped中的每一个元素是一个元组
元组里面是(索引(分组的值),分组之后的DataFrame)



如果我们需要对国家和省份进行分组统计,应该怎么操作呢?

grouped = df.groupby(by=[df["Country"],df["State/Province"]])


很多时候我们只希望对获取分组之后的某一部分数据,或者说我们只希望对某几列数据进行分组,这个时候我们应该怎么办呢?


获取分组之后的某一部分数据:

   df.groupby(by=["Country","State/Province"])["Country"].count()

对某几列数据进行分组:

   df["Country"].groupby(by=[df["Country"],df["State/Province"]]).count()


观察结果,由于只选择了一列数据,所以结果是一个Series类型

t1 = df[["Country"]].groupby(by=[df["Country"],df["State/Province"]]).count()t2 = df.groupby(by=["Country","State/Province"])[["Country"]].count()

以上的两条命令结果一样
和之前的结果的区别在于当前返回的是一个DataFrame类型.

DataFrameGroupBy对象有很多经过优化的方法:
    
    函数名	        说明
    count       分组中非NA值的数量			
    sum         非NA值的和	
    mean        非NA值的平均值	
    median	    非NA值的算术中位数		
    std、var    无偏(分母为n-1)标准差和方差
    min、max	    非NA值的最小值和最大值	

Index and compound index:

简单的索引操作:
获取index: df.index
指定index: df.index = ['x','y']
重新设置index: df.reindex(list("abcedf")) # 新的index对应的值都为NaN
指定某一列作为index: df.set_index("Country",drop=False) # drop为False时在数据中保留原来的列
返回index的唯一值: df.set_index("Country").index.unique()



假设a为一个DataFrame,那么当a.set_index(["c","d"])即设置两个索引的时候是什么样子的结果呢?

a = pd.DataFrame({'a': range(7),'b': range(7, 0, -1),'c': ['one','one','one','two','two','two', 'two'],'d': list("hjklmno")})

Series composite index:

In [52]: a
Out[52]: 
   a  b    c  d
0  0  7  one  h
1  1  6  one  j
2  2  5  one  k
3  3  4  two  l
4  4  3  two  m
5  5  2  two  n
6  6  1  two  o

In [53]: X = a.set_index(["c","d"])["a"]

In [54]: X
Out[54]: 
c    d
one  h    0
     j    1
     k    2
two  l    3
     m    4
     n    5
     o    6
Name: a, dtype: int64

In [55]: X["one","h"]    # Series符合索引取值,直接在括号里面写索引就行
Out[55]: 0

In [10]: type(X)
Out[10]: pandas.core.series.Series


In [11]: X.swaplevel()      # 交换索引的里外层
Out[11]: 
d  c  
h  one    0
j  one    1
k  one    2
l  two    3
m  two    4
n  two    5
o  two    6
Name: a, dtype: int64

In [12]: X.swaplevel()["h"] # 此时可以直接取"h"索引
Out[12]: 
c
one    0
Name: a, dtype: int64

In [13]: X.index.levels
Out[13]: FrozenList([['one', 'two'], ['h', 'j', 'k', 'l', 'm', 'n', 'o']])

In [14]: X.swaplevel().index.levels
Out[14]: FrozenList([['h', 'j', 'k', 'l', 'm', 'n', 'o'], ['one', 'two']])


In [18]: a
Out[18]: 
   a  b    c  d
0  0  7  one  h
1  1  6  one  j
2  2  5  one  k
3  3  4  two  l
4  4  3  two  m
5  5  2  two  n
6  6  1  two  o

In [19]: x = a.set_index(["c","d"])[["a"]]   # pandas.core.frame.DataFrame

In [20]: x
Out[20]: 
       a
c   d   
one h  0
    j  1
    k  2
two l  3
    m  4
    n  5
    o  6

In [21]: x.loc["one"]
Out[21]: 
   a
d   
h  0
j  1
k  2

In [22]: x.loc["one"].loc["h"]
Out[22]: 
a    0
Name: h, dtype: int64

According to the data of the previous example:

  • Use matplotlib to show the top 10 countries in the total number of stores
  • Use matplotlib to show the number of stores in each city in China

Code 1:

import pandas as pd
from matplotlib import pyplot as plt

# 准备数据
df = pd.read_csv("./starbucks_store_worldwide.csv")

# 提取数据
country_data = df.groupby(by="Country")["Brand"].count().sort_values(ascending=False)[:10]

# 设置图片大小
plt.figure(figsize=(20,8),dpi=80)


# 画条型图
plt.bar(range(len(country_data)),country_data,width=0.4,color="pink")

# 设置x刻度
plt.xticks(range(len(country_data)),country_data.index)

# 显示图片
plt.show()

Effect picture:

Insert picture description here

Code 2:

import pandas as pd
from matplotlib import pyplot as plt
import matplotlib


font = {
    
    'family' : 'WenQuanYi Micro Hei',
        'weight' : 'bold',          
        'size'   : '10'}        
# 设置中文字体
matplotlib.rc("font",**font)

# 准备数据
df = pd.read_csv("./starbucks_store_worldwide.csv")

print(df.info())

# 提取数据
df = df[df["Country"]=="CN"]
china_data = df.groupby(by="City")["Brand"].count().sort_values(ascending=False)[:25]
print(china_data)


# 设置图片大小
plt.figure(figsize=(20,8),dpi=80)

# 绘制直方图
plt.bar(range(25),china_data.values,width=0.4,color="green")

# 设置x刻度
plt.xticks(range(25),china_data.index)


# 显示图片
plt.show()

Effect picture:

Insert picture description here

example:

Now we have data on the top 10,000 books in the world, so please count the following questions:

  • Number of books in different years
  • Average rating of books in different years

Receipt source: https://www.kaggle.com/zygmunt/goodbooks-10k

Data Format:

Insert picture description here

Code:

import pandas as pd
from matplotlib import pyplot as plt

# 准备数据
df = pd.read_csv("./books.csv")
# print(df.info())
# print(df.head(1))

# 去除空数据所在行
# df = df[pd.notnull(df["original_publication_year"])]
# 提取数据
# data_book_count = df.groupby(by="original_publication_year").count()["title"]


data_book_avg = df["average_rating"].groupby(by=df["original_publication_year"]).mean()

_x = data_book_avg.index
_y = data_book_avg.values
# 设置图片大小
plt.figure(figsize=(20,8),dpi=80)


# 画折线图
plt.plot(range(len(_x)),_y)


# 设置x刻度
plt.xticks(list(range(len(_x)))[::10],_x[::10],rotation=45)

# 显示
plt.show()

Effect picture:

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_46456049/article/details/108922693