Data division processing (based on the dataframe data structure in python's pandas)

Data division processing (based on the dataframe data structure in python's pandas)

We often need to get related subtables from a table.

As in the question: Obtain the annual change in reserves of gold, silver and copper from the table of primary resources by country. (as the picture shows)

A simpler approach can be: multiple loops

But the editor is often not satisfied with this processing method. On the contrary, he prefers to use related existing functions to process the entire table at the same time.

 

 

Analyzing the above practical problems, if the data sets are first divided into three data sets by country (Wei, Shu, Wu), the reserves can be uniquely determined by (resource name, year) .

Just like df['gold']['1960'] = 11 , where df is an empty dataframe generated in advance.

data_Shu = data[data['国家'] == '蜀']
data_Wu = data[data['国家'] == '吴']
data_Wei = data[data['国家'] == '魏']

The result of data_Shu is:  

  Resource Name Country Year Reserve
0 Gold Shu 1960 11
1 Gold Shu 1961 12
2 Gold Shu 1962 13
3 Silver Shu 1960 14
4 Silver Shu 1961 15
5 Silver Shu 1962 16
6 Copper Shu 1960 17 7 Copper Shu
1961 18 8 Copper
Shu 1962 19

So the problem can be seen as generating a new dataframe from this dataframe, with the resource name listed as columns, the year listed as index, and the reserves as data.

Here, the editor asks the big guys: Is there any ready-made function that can realize such a function?

The editor doesn't know, so I need to write such a function myself, and the calling situation is as follows:

df = create_df_by_2col(data_Shu, -1, 2, 0)  
# 意思是由数据框data_Shu,以最后一列"储量"为要填充的数据,
# 以第二列"年份"为index,以第0列"资源名"为columns,生成新的数据框df

The specific definition of the function is as follows:

def create_df_by_2col(dataframe, col_no_as_data, col_no_as_index, col_no_as_columns):
    columns = dataframe.columns.tolist() # 结果为 ['资源名', '国家', '年份', '储量']
    new_data_name = columns[col_no_as_data]
    new_index_name = columns[col_no_as_index]
    new_columns_name = columns[col_no_as_columns]
#    index_from_col = dataframe[new_index_name]
#    columns_from_col = dataframe[new_columns_name]
    index_from_col = dataframe.iloc[:,col_no_as_index].unique()
    columns_from_col = dataframe.iloc[:,col_no_as_columns].unique()
    # 创建一个空的dataframe
    df = pd.DataFrame(index = index_from_col, columns = columns_from_col)
    for row in dataframe.itertuples():
        # 如 df['金']['1960'] = 11
        df[getattr(row, new_columns_name)][getattr(row, new_index_name)] = getattr(row, new_data_name)
    return df

So the result can be obtained by the following code

df_Shu = create_df_by_2col(data_Shu, -1, 2, 0)
df_Wu  = create_df_by_2col(data_Wu, -1, 2, 0)
df_Wei = create_df_by_2col(data_Wei, -1, 2, 0)

The result of df_Shu is:

       Gold Silver Copper
1960 11 14 17
1961 12 15 18
1962 13 16 19

So far, the problem has come to an end.

Now, if you want to use (gold, silver, copper and iron) as columns in the original data frame (without iron), what should you do?

col_index = [1960,1961,1962]
col_columns = ['金','银','铜','铁']
df2_Shu = by_2col(data_Shu,-1,col_index,col_columns)

def by_2col(dataframe, col_data_index, col_index, col_columns):
    # 创建一个空的dataframe
    df = pd.DataFrame(index = col_index, columns = col_columns)
    for index, row in dataframe.iterrows():  
        row = row.tolist()
        df[row[0]][row[2]] = row[col_data_index] # 当然这些数字也可以变成函数的参数
    return df

Finally, the editor still sincerely asks, is there any ready-made function that can realize such a function? From this dataframe, the resource name is listed as columns, the year is listed as index, and the reserves are used as data to generate a new dataframe.

Hope to have an answer, or have a better way, please feel free to comment, thank you!

 

Guess you like

Origin blog.csdn.net/Cameback_Tang/article/details/102876947