[Data Processing-Economic Management] How to use Python to perform grouping operations on data?

1. Preface and data format introduction

        In the research of economic management, such a data structure is often encountered:

Name Year IncomeType Income
A 2015 Investment and wealth management income 123
A 2015 Salary income 124
A

2015

labor remuneration income 125
A 2015 Other income 126
A 2016 Salary income 213
A 2016 labor remuneration income 214
A 2016 Other income 521
A 2017 Salary income 213
A 2018 Salary income 322
A 2018 labor remuneration income 123
....... .......
B 2015 Investment and wealth management income 122

        The above table is based on the income obtained by resident A from one source of income in year Y collected in the relevant questionnaire in the "questionnaire format" (the data is randomly compiled). Or there may be listed companies, years, income categories (expense categories, etc.), income (expenses), and similar data forms. What's more, this situation may also occur:

Name Year IncomeType Income
A 2015 Investment wealth management income 1 123
A 2015 Investment wealth management income 2 124
A

2015

Labor remuneration income 1 125
.... .... .... ....

       So, how to deal with this kind of data and get the total income classified according to Name, Year, and Incometype? Or get the panel data of the annual total income fixed according to Name, Year, and Incometype? Below I will use Python to answer:

2. Code introduction

       The following code answers the above questions. Changing the code can realize the calculation of the total income and the regular reorganization of the department for calculation, which is convenient and fast. Of course, there may be some cases that are difficult to classify when classifying sectors (classification of income sources), and we may need to remove them. The following code also performs calculations.

import pandas as pd

# 读取Excel数据
df = pd.read_excel(".xlsx", engine='openpyxl')
print(df['revenue'].apply(type).value_counts())

import pandas as pd

# 对公司、年份和部门进行分组加总
grouped_df = df.groupby(['code', 'stkcd' ,'year', 'Sectors'])['revenue'].sum().reset_index()

# 计算每个公司每年所有部门的总收入
total_income = df.groupby(['code', 'stkcd' ,'year'])['revenue'].sum().reset_index()
total_income.columns = ['code', 'stkcd' ,'year', '公司年度总收入']

# 计算剔除"不便于分类"部门后,每个公司每年所有部门的总收入
df_without_unclassified = df[df['Sectors'] != '不便于分类']
total_income_without_unclassified = df_without_unclassified.groupby(['code', 'stkcd' ,'year'])['revenue'].sum().reset_index()
total_income_without_unclassified.columns = ['code', 'stkcd' ,'year', '剔除不便于分类后公司年度总收入']

# 将总收入和剔除"不便于分类"部门后的总收入合并到原始数据集
final_df = pd.merge(grouped_df, total_income, on=['code', 'stkcd' ,'year'])
final_df = pd.merge(final_df, total_income_without_unclassified, on=['code', 'stkcd' ,'year'], how='left')

# 保存到新的Excel文件
final_df.to_excel(".xlsx", index=False)

3. Further processing

        Of course, we may also need to get the largest income item, identify it as the main income, and others as auxiliary income sources. The following code can also be further processed on the basis of the above code.

import pandas as pd

df = pd.read_excel(".xlsx", sheet_name='')

# 对数据进行分组并获取每个组的最大收入
grouped = df.groupby(['公司', '年度'])
max_revenues = grouped['收入'].max()

# 获取每个组收入最大的部门
max_department_indexes = df.groupby(['公司', '年度'])['收入'].idxmax()
max_departments = df.loc[max_department_indexes, '部门']

# 将每个公司、年份的主要部门设置为'Main_industry',其他部门设置为'Sup_industry'
df['部门类型'] = 'Sup_industry'
for i in range(len(max_departments)):
    df.loc[(df['公司'] == max_revenues.index[i][0]) & (df['年度'] == max_revenues.index[i][1]) & (df['部门'] == max_departments.loc[max_department_indexes[i]]), '部门类型'] = 'Main_industry'

# 将结果输出到一个新的Excel文件
with pd.ExcelWriter(".xlsx") as writer:
    df.to_excel(writer, index=False)

# 输出完成消息
print('结果已输出到result.xlsx文件中。')

Guess you like

Origin blog.csdn.net/m0_56120502/article/details/130537454