数据分析学习线路图
3、数据的合并
3.1 数据的合并之join
思考:如果 t2.jion(t1) 会怎样
如果 t2.jion(t1) ,那么就会以t2数据为基准,jion()之后,那么t1的C行会丢掉。
3.2 数据的合并之merge
动手练习:
4、数据的分组聚合
- 现在我们有一组关于全球星巴克店铺的统计数据,如果我想知道美国的星巴克数量和中国的哪个多,或者我想知道中国每个省份星巴克的数量的情况,那么应该怎么办?
思路:遍历一遍,每次加1 ???
数据来源:https://www.kaggle.com/starbucks/store-locations/data - 在pandas中类似的分组的操作我们有很简单的方式来完成
- df.groupby(by=“columns_name”)
- 那么问题来了,调用groupby方法之后返回的是什么内容?
import pandas as pd
import numpy as np
file_path = "./starbucks_store_worldwide.csv"
df = pd.read_csv(file_path)
# print(df.head(1))
# print(df.info())
grouped = df.groupby(by="Country")
print(grouped)
# DataFrameGroupBy
# 可以进行遍历
for i, j in grouped:
print(i)
print("-"*100)
print(j, type(j))
print("*"*100)
结果输出:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001D89D62F9C8>
AD # 第一个国家
----------------------------------------------------------------------------------------------------
Brand Store Number ... Longitude Latitude
0 Starbucks 47370-257954 ... 1.53 42.51
[1 rows x 13 columns] <class 'pandas.core.frame.DataFrame'>
****************************************************************************************************
AE # 第二个国家
----------------------------------------------------------------------------------------------------
Brand Store Number ... Longitude Latitude
1 Starbucks 22331-212325 ... 55.47 25.42
2 Starbucks 47089-256771 ... 55.47 25.39
3 Starbucks 22126-218024 ... 54.38 24.48
4 Starbucks 17127-178586 ... 54.54 24.51
5 Starbucks 17688-182164 ... 54.49 24.40
.. ... ... ... ... ...
140 Starbucks 34253-62541 ... 55.38 25.33
141 Starbucks 1359-138434 ... 55.38 25.32
142 Starbucks 34259-54260 ... 55.37 25.30
143 Starbucks 34217-27108 ... 55.48 25.30
144 Starbucks 22697-223524 ... 55.54 25.53
[144 rows x 13 columns] <class 'pandas.core.frame.DataFrame'>
****************************************************************************************************
# 因为数据过大后续不再打印
# coding=utf-8
import pandas as pd
import numpy as np
file_path = "./starbucks_store_worldwide.csv"
df = pd.read_csv(file_path)
grouped = df.groupby(by="Country")
print(grouped)
# DataFrameGroupBy
# 调用聚合方法
print(grouped["Brand"].count())
结果输出:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001DCABF33748>
Country
AD 1
AE 144
AR 108
AT 18
AU 22
...
TT 3
TW 394
US 13608
VN 25
ZA 3
Name: Brand, Length: 73, dtype: int64
分析
打印分析:美国的星巴克数量和中国的哪个多
# coding=utf-8
import pandas as pd
import numpy as np
file_path = "./starbucks_store_worldwide.csv"
df = pd.read_csv(file_path)
grouped = df.groupby(by="Country")
print(grouped)
# 调用聚合方法
# print(grouped["Brand"].count())
country_count = grouped["Brand"].count()
print(country_count["US"])
print(country_count["CN"])
输出结果:
13608
2734
统计中国每个省份星巴克的数量
# coding=utf-8
import pandas as pd
import numpy as np
file_path = "./starbucks_store_worldwide.csv"
df = pd.read_csv(file_path)
grouped = df.groupby(by="Country")
# DataFrameGroupBy
# 可以进行遍历
# 调用聚合方法
# 统计中国每个省店铺的数量
china_data = df[df["Country"] =="CN"]
grouped = china_data.groupby(by="State/Province").count()["Brand"]
print(grouped)
输出结果:
State/Province
11 236
12 58
13 24
14 8
15 8
21 57
22 13
23 16
31 551
32 354
33 315
34 26
...
91 162
92 13
Name: Brand, dtype: int64
# coding=utf-8
import pandas as pd
import numpy as np
file_path = "./starbucks_store_worldwide.csv"
df = pd.read_csv(file_path)
grouped = df.groupby(by="Country")
# 数据按照多个条件进行分组,返回Series
grouped = df["Brand"].groupby(by=[df["Country"], df["State/Province"]]).count()
print(grouped)
print(type(grouped))
# 数据按照多个条件进行分组,返回DataFrame 、方法grouped1、grouped2、grouped3是一样的
grouped1 = df[["Brand"]].groupby(by=[df["Country"], df["State/Province"]]).count()
# grouped2= df.groupby(by=[df["Country"],df["State/Province"]])[["Brand"]].count()
# grouped3 = df.groupby(by=[df["Country"],df["State/Province"]]).count()[["Brand"]]
print(grouped1, type(grouped1))
# print("*"*100)
输出结果:
Country State/Province
AD 7 1
AE AJ 2
AZ 48
DU 82
FU 2
..
US WV 25
WY 23
VN HN 6
SG 19
ZA GT 3
Name: Brand, Length: 545, dtype: int64
<class 'pandas.core.series.Series'>
Brand
Country State/Province
AD 7 1
AE AJ 2
AZ 48
DU 82
FU 2
... ...
US WV 25
WY 23
VN HN 6
SG 19
ZA GT 3
[545 rows x 1 columns] <class 'pandas.core.frame.DataFrame'>
5、索引和复合索引
index,指的是行索引。columns:直到是列索引
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(8).reshape((2, 4)), index=list("ab"), columns=list("WXYZ"))
print(df1)
print(df1.index)
print(df1.reindex(list("af")))
print(df1.reindex(list("abcf")))
print(df1.set_index("W"))
print(df1.set_index("W", drop=False))
print(df1.set_index("W").index)
df2 = pd.DataFrame(np.arange(8).reshape((2, 4)), index=list("AB"), columns=list("abcd"))
print(df2.set_index(["a", "b"]))
print(df2.set_index(["a", "b"], drop=False))
print(df2.set_index(["a", "b"]).index)
a = pd.DataFrame(
{
'a': range(7), 'b': range(7, 0, -1), 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'], 'd': list("hjklmno")})
print(a)
print(a.set_index(["c", "d"]))
输出结果:
E:\PycharmProjects\untitled\venv\Scripts\python.exe E:/PycharmProjects/untitled/数据分析/day03_pandas/test.py
W X Y Z #df1
a 0 1 2 3
b 4 5 6 7
Index(['a', 'b'], dtype='object') # df1.index
W X Y Z # df1.reindex(list("af"))
a 0.0 1.0 2.0 3.0
f NaN NaN NaN NaN
W X Y Z # df1.reindex(list("abcf"))
a 0.0 1.0 2.0 3.0
b 4.0 5.0 6.0 7.0
c NaN NaN NaN NaN
f NaN NaN NaN NaN
X Y Z # df1.set_index("W")
W
0 1 2 3
4 5 6 7
W X Y Z # df1.set_index("W", drop=False)
W
0 0 1 2 3
4 4 5 6 7
Int64Index([0, 4], dtype='int64', name='W') # df1.set_index("W").index
c d # df2.set_index(["a", "b"])
a b
0 1 2 3
4 5 6 7
a b c d # df2.set_index(["a", "b"], drop=False)
a b
0 1 0 1 2 3
4 5 4 5 6 7
MultiIndex([(0, 1), # df2.set_index(["a", "b"]).index
(4, 5)],
names=['a', 'b'])
a b c d
0 0 7 one h
1 1 6 one j
2 2 5 one k
3 3 4 two l
4 4 3 two m
5 5 2 two n
6 6 1 two o
a b
c d
one h 0 7
j 1 6
k 2 5
two l 3 4
m 4 3
n 5 2
o 6 1
Process finished with exit code 0
5.1 Series复合索引
图中的意思是取出a的那一列,X的索引是复合索引,为【c,d】
如果我们想要去H对应的值,那么我们则需要通过swaplevel一下,把h索引转化为外索引,然后X.swaplevel()[“h”]即可
5.2 Series复合索引
import pandas as pd
import numpy as np
a = pd.DataFrame(
{
'a': range(7), 'b': range(7, 0, -1), 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'], 'd': list("hjklmno")})
b = a.set_index(["c", "d"])
print(b)
c = b["a"]
print(c)
print(c["one"]["j"])
print(c["one"])
# 对于d来说他是一个Series类型
d = a.set_index(["d", "c"])["a"]
print(d)
# 做取one操作
print(d.index)
# 第一步、d.swaplevel()
print(d.swaplevel())
print(d.swaplevel()["one"])
# 对于b来说他是一个DataFrame、如果想取[one][j]对应的值,应该怎么取?
print(b)
print(b.loc["one"].loc["h"])
# 如果想去h呢?同样我们需要swaplevel()一下
print(b.swaplevel())
print(b.swaplevel().loc["h"])
输出结果:
a b
c d
one h 0 7
j 1 6
k 2 5
two l 3 4
m 4 3
n 5 2
o 6 1
c d
one h 0
j 1
k 2
two l 3
m 4
n 5
o 6
Name: a, dtype: int64
1
d
h 0
j 1
k 2
Name: a, dtype: int64
d c
h one 0
j one 1
k one 2
l two 3
m two 4
n two 5
o two 6
Name: a, dtype: int64
MultiIndex([('h', 'one'),
('j', 'one'),
('k', 'one'),
('l', 'two'),
('m', 'two'),
('n', 'two'),
('o', 'two')],
names=['d', 'c'])
c d
one h 0
j 1
k 2
two l 3
m 4
n 5
o 6
Name: a, dtype: int64
d
h 0
j 1
k 2
Name: a, dtype: int64
a b
c d
one h 0 7
j 1 6
k 2 5
two l 3 4
m 4 3
n 5 2
o 6 1
a 0
b 7
Name: h, dtype: int64
a b
d c
h one 0 7
j one 1 6
k one 2 5
l two 3 4
m two 4 3
n two 5 2
o two 6 1
a b
c
one 0 7
Process finished with exit code 0