1. Preface and data format introduction
In the research of economic management, such a data structure is often encountered:
Name | Year | IncomeType | Income |
A | 2015 | Investment and wealth management income | 123 |
A | 2015 | Salary income | 124 |
A | 2015 |
labor remuneration income | 125 |
A | 2015 | Other income | 126 |
A | 2016 | Salary income | 213 |
A | 2016 | labor remuneration income | 214 |
A | 2016 | Other income | 521 |
A | 2017 | Salary income | 213 |
A | 2018 | Salary income | 322 |
A | 2018 | labor remuneration income | 123 |
....... | ....... | ||
B | 2015 | Investment and wealth management income | 122 |
The above table is based on the income obtained by resident A from one source of income in year Y collected in the relevant questionnaire in the "questionnaire format" (the data is randomly compiled). Or there may be listed companies, years, income categories (expense categories, etc.), income (expenses), and similar data forms. What's more, this situation may also occur:
Name | Year | IncomeType | Income |
A | 2015 | Investment wealth management income 1 | 123 |
A | 2015 | Investment wealth management income 2 | 124 |
A | 2015 |
Labor remuneration income 1 | 125 |
.... | .... | .... | .... |
So, how to deal with this kind of data and get the total income classified according to Name, Year, and Incometype? Or get the panel data of the annual total income fixed according to Name, Year, and Incometype? Below I will use Python to answer:
2. Code introduction
The following code answers the above questions. Changing the code can realize the calculation of the total income and the regular reorganization of the department for calculation, which is convenient and fast. Of course, there may be some cases that are difficult to classify when classifying sectors (classification of income sources), and we may need to remove them. The following code also performs calculations.
import pandas as pd
# 读取Excel数据
df = pd.read_excel(".xlsx", engine='openpyxl')
print(df['revenue'].apply(type).value_counts())
import pandas as pd
# 对公司、年份和部门进行分组加总
grouped_df = df.groupby(['code', 'stkcd' ,'year', 'Sectors'])['revenue'].sum().reset_index()
# 计算每个公司每年所有部门的总收入
total_income = df.groupby(['code', 'stkcd' ,'year'])['revenue'].sum().reset_index()
total_income.columns = ['code', 'stkcd' ,'year', '公司年度总收入']
# 计算剔除"不便于分类"部门后,每个公司每年所有部门的总收入
df_without_unclassified = df[df['Sectors'] != '不便于分类']
total_income_without_unclassified = df_without_unclassified.groupby(['code', 'stkcd' ,'year'])['revenue'].sum().reset_index()
total_income_without_unclassified.columns = ['code', 'stkcd' ,'year', '剔除不便于分类后公司年度总收入']
# 将总收入和剔除"不便于分类"部门后的总收入合并到原始数据集
final_df = pd.merge(grouped_df, total_income, on=['code', 'stkcd' ,'year'])
final_df = pd.merge(final_df, total_income_without_unclassified, on=['code', 'stkcd' ,'year'], how='left')
# 保存到新的Excel文件
final_df.to_excel(".xlsx", index=False)
3. Further processing
Of course, we may also need to get the largest income item, identify it as the main income, and others as auxiliary income sources. The following code can also be further processed on the basis of the above code.
import pandas as pd
df = pd.read_excel(".xlsx", sheet_name='')
# 对数据进行分组并获取每个组的最大收入
grouped = df.groupby(['公司', '年度'])
max_revenues = grouped['收入'].max()
# 获取每个组收入最大的部门
max_department_indexes = df.groupby(['公司', '年度'])['收入'].idxmax()
max_departments = df.loc[max_department_indexes, '部门']
# 将每个公司、年份的主要部门设置为'Main_industry',其他部门设置为'Sup_industry'
df['部门类型'] = 'Sup_industry'
for i in range(len(max_departments)):
df.loc[(df['公司'] == max_revenues.index[i][0]) & (df['年度'] == max_revenues.index[i][1]) & (df['部门'] == max_departments.loc[max_department_indexes[i]]), '部门类型'] = 'Main_industry'
# 将结果输出到一个新的Excel文件
with pd.ExcelWriter(".xlsx") as writer:
df.to_excel(writer, index=False)
# 输出完成消息
print('结果已输出到result.xlsx文件中。')