Pandas_ merge and group aggregation

1. Pandas Statistics Movie Classification

## 重新构造一个全为0的数组,列名为分类
## 如果某一条数据中分类出现过,就让0变为1

data_movies = r"C:\Users\dell\Desktop\Python学习\14100_HM数据科学库课件\day04\datasets_IMDB-Movie-Data.csv"
df = pd.read_csv(data_movies)
print(df["Genre"])
# 统计各电影类型的个数
# 统计分类的列表
temp_list = df["Genre"].str.split(",").tolist()  # 一个列表形式:[[],[]]
genre_list = list(set([i for j in temp_list for i in j]))
# 构造全为0的数组
zeros_df = pd.DataFrame(np.zeros(shape=(df.shape[0], len(genre_list)), dtype=int), columns=genre_list)

# 给每个电影出现分类的位置赋值
for i in range(df.shape[0]):
    zeros_df.loc[i, temp_list[i]] = 1

# 统计数量和,列项求和
genre_count = zeros_df.sum(axis=0)
print(genre_count)

Two, join usage There are three join methods of pd.concat
as shown below
Insert picture description here

df1=pd.DataFrame(np.ones((3,4))*0,columns=['A','B','C','D'],index=[1,2,3])
df2=pd.DataFrame(np.ones((3,4))*1,columns=['B','C','D','E'],index=[2,3,4])
print('第一个数据为:')
print(df1)
print('\n')
print('第二个数据为:')
print(df2)
print('\n')
 
print('join行往外合并:相当于全连接')
res=pd.concat([df1,df2],axis=1,join='outer')
print(res)
print('\n')
 
print('join行相同的进行合并:相当于内连接')
res2=pd.concat([df1,df2],axis=1,join='inner')
print(res2)
print('\n')
 
print('以df1的序列进行合并:相当于左连接')
res3=pd.concat([df1,df2],axis=1,join_axes=[df1.index])
print(res3)

Three, merge usage

(1)利用一组key进行合并
左联、右联、内联、外联

left=pd.DataFrame({
    
    'key':['k0','k1','k2','k3'],
                     'A':['A0','A1','A2','A3'],
                     'B':['B0','B1','B2','B3']})
print('第一个数据为:')
print(left)
print('\n')
 
right=pd.DataFrame({
    
    'key':['k0','k1','k2','k3'],
                      'C':['C0','C1','C2','C3'],
                      'D':['D0','D1','D2','D3']})
print('第二个数据为:')
print(right)
print('\n')
 
print('依据key进行merge:')
res=pd.merge(left,right,on='key')
print(res)

(2)利用两组key进行合并
left=pd.DataFrame({
    
    'key1':['k0','k1','k2','k3'],
                   'key2':['k0','k1','k0','k1'],
                      'A':['A0','A1','A2','A3'],
                      'B':['B0','B1','B2','B3']})
print('第一个数据为:')
print(left)
print('\n')
 
right=pd.DataFrame({
    
    'key1':['k0','k1','k2','k3'],
                    'key2':['k0','k0','k0','k0'],
                       'C':['C0','C1','C2','C3'],
                       'D':['D0','D1','D2','D3']})
print('第二个数据为:')
print(right)
print('\n')
 
print('内联合并')
res=pd.merge(left,right,on=['key1','key2'],how='inner')
print(res)
print('\n')
 
print('外联合并')
res2=pd.merge(left,right,on=['key1','key2'],how='outer')
print(res2)
print('\n')
 
print('左联合并')
res3=pd.merge(left,right,on=['key1','key2'],how='left')
print(res3)
print('\n')
 
print('右联合并')
res4=pd.merge(left,right,on=['key1','key2'],how='right')
print(res4)3)利用索引合并
left=pd.DataFrame({
    
    'A':['A0','A1','A2'],
                   'B':['B0','B1','B2']},
                   index=['k0','k1','k2'])
 
right=pd.DataFrame({
    
    'C':['C0','C1','C2'],
                    'D':['D0','D1','D2']},
                   index=['k0','k2','k3']
)
 
print('第一个数据为:')
print(left)
print('\n')
 
print('第二个数据为:')
print(right)
print('\n')
 
print('根据index索引进行合并 并选择外联合并')
res=pd.merge(left,right,left_index=True,right_index=True,how='outer')
print(res)
print('\n')
 
print('根据index索引进行合并 并选择内联合并')
res2=pd.merge(left,right,left_index=True,right_index=True,how='inner')
print(res2)
print('\n')

Four, groupby

genre_count = genre_count.sort_values()
grouped = df.groupby(by="columns_name")
grouped是一个DataFrameGroupBy对象,是可迭代的
grouped中的每一个元素是一个元组
元组里面是(索引(分组的值),分组之后的DataFrame)
很多时候我们只希望对获取分组之后的某一部分数据,
或者说我们只希望对某几列数据进行分组,这个时候
我们应该怎么办呢?
(1)
获取分组之后的某一部分数据:
获取Countr这一列,直接加上["Country"].count()

 df.groupby(by=["Country","State/Province"])["Country"].count()

(2)
对某几列数据进行分组:
 df["Country"].groupby(by=[df["Country"],df["State/Province"]]).count()

补充:
t1 = df[["Country"]].groupby(by= [df["Country"],df["State/Province"]]).count()
t2 = df.groupby(by=["Country","State/Province"])[["Country"]].count()

以上的两条命令结果一样,和之前的结果的区别在于当前返回的是一个DataFrame类型

Five, composite index

(1)series
(2)dataframe

获取index:df.index
指定index :df.index = ['x','y']
重新设置index : df.reindex(list("abcedf"))
指定某一列作为index :df.set_index("Country",drop=False)
返回index的唯一值:df.set_index("Country").index.unique()



假设a为一个DataFrame,那么当a.set_index(["c","d"])即设置两个索引的时候是什么样子的结果呢?

a = pd.DataFrame({'a': range(7),'b': range(7, 0, -1),'c': ['one','one','one','two','two','two', 'two'],'d': list("hjklmno")})

Insert picture description here
Insert picture description here
Six, examples
Insert picture description here

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
# 以下两句是显示中文的方法
from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei'] #有效的方法
file_path='C:/Users/ming/Desktop/DataAnalysis-master/day05/code/books.csv'
df=pd.read_csv(file_path)
print(df.info())
# 不同年份书的数量
# 去除nan数据
# data1=df[pd.notnull(df['original_publication_year'])]
# grouped=data1.groupby(by='original_publication_year').count()['title']
# 不同年份数的评分情况
data1=df[pd.notnull(df['original_publication_year'])]
grouped=data1['average_rating'].groupby(by=data1['original_publication_year']).mean()
# print(grouped)
_x=grouped.index
_y=grouped.values
plt.plot(range(len(_x)),_y)
plt.xticks(list(range(len(_x)))[::10],_x[::10].astype(int),rotation=45)
plt.show()

Guess you like

Origin blog.csdn.net/tjjyqing/article/details/113643965