Data Science Essentials Pandas Practical Operation Data Summary of Various Splicing Operations

Learn Python data science, play games, learn Japanese, and engage in programming all in one package.

This time, let's learn how the data in the "Three Kingdoms" game is spliced.

In the field of data science, when using Python to process large-scale data sets, it is often necessary to use the methods of merging and linking to integrate data sets. The data types used include Series and DataFrame, and there are many methods that can be used, such as the one introduced in this article. The three methods of .merge(), .join() and .concat() can be used to maximize the use of the data set after splicing.

insert image description here

merge operation

The .merge() method is used to combine data on general columns or indexes. This method is somewhat similar to the join operation in MySQL, and can implement operations such as left splicing, right splicing, and full join.

Splicing through the index of the keyword to achieve many-to-one, one-to-many, and many-to-many (Cartesian product) connections.

Explanation of parameters in merge:

  • how: Define the merging method. The selection parameters are "inner", "outer", "left'", "right".
  • on: Define the columns that must be included in both DataFrames for join (index keys).
  • left_on and right_on: Specifies the column or index present in the left or right object to be merged.
  • left_index and right_index: Default is False, set to use the index column as the merge basis.
  • suffixes: A tuple of strings to append to the same column name that is not a merge key.

merge splicing method

A picture can see how the different keyword arguments merge.
insert image description here

merge example

data read

We need to perform the splicing operation of the direct relationship between the power and the character. The data read includes the following two lists, and the data without power in the character history login data is eliminated.

import pandas as pd
country  = pd.read_excel("Romance of the Three Kingdoms 13/势力列表.xlsx")
people = pd.read_excel("Romance of the Three Kingdoms 13/人物历史登入数据.xlsx")

# 剔除不包含的势力数据,即武将在野的状态
people = people[people["勢力"]!="-"]

country.head()

insert image description here

people.head()

insert image description here

inner join

Use the merge default parameter to directly inner join, matching the result of the intersection of two DataFrames.

Splicing the character and the power to which they belong, here we take the power to which the character finally belongs, that is, the last piece of data information after the data of the changed character is aggregated.

people_new = people.groupby('名前').nth(-1)
people_new["名前"] = people_new.index
people_new.reset_index(drop=True,inplace=True)
people_new

insert image description here

The order of the DataFrames in the merge determines the order of the concatenated results.

inner_merged_total = pd.merge(country,people_new,on=["勢力"])
inner_merged_total.head()

insert image description here

inner_merged_total = pd.merge(people_new,country,on=["勢力"])
inner_merged_total.head()

insert image description here

outer join

In an outer join (also called a full outer join), all rows from both DataFrames will appear in the new DataFrame.

In essence, the outer splicing is performed on the full df_A of the data and the included df_B, which is equivalent to pd.merge(df_A ,df_B,on=[“key”]) .

outer_merged = pd.merge(people_new,country,how="outer",on=["勢力"])
outer_merged.head()

insert image description here
If we don't remove the data in the wild generals, we will find that the entire form is spliced.

country  = pd.read_excel("Romance of the Three Kingdoms 13/势力列表.xlsx")
people = pd.read_excel("Romance of the Three Kingdoms 13/人物历史登入数据.xlsx")
outer_merged = pd.merge(people_new,country,how="outer",on=["勢力"])
outer_merged

insert image description here

left join

The newly merged DataFrame is retained with all rows in the left DataFrame (ie, the first dataframe in the merge), while rows in the right DataFrame that do not have a match in the left DataFrame's key column are discarded.

left_merged = pd.merge(people_new,country,how="left",on=["勢力"])
left_merged

insert image description here

right join

The newly merged DataFrame is retained with all rows in the right DataFrame (ie, the second dataframe in the merge), while rows in the right DataFrame that do not have a match in the left DataFrame's key column are discarded.

right_merged = pd.merge(people_new,country,how="right",on=["勢力"])
right_merged 

insert image description here

join operation

The join operation is very similar to merge, which combines data on columns or indexes. Join is equivalent to specifying the first DataFreme in the merge. And columns with conflicting names can be renamed by defining a suffix.

This result is very similar to the previous left and right merge.

Parameter explanation in join:

  • other: Defines the DataFrame to be concatenated.
  • on: Specifies an optional column or index name for the left DataFrame. If set to None, this is the default index connection.
  • how: It is the same as how in merge, if no column is specified, index splicing is used.
  • lsuffix and rsuffix: Similar to the suffix in merge().
  • sort: Sort the generated DataFrame.

join example

people_new.join(country, lsuffix="left", rsuffix="right")

insert image description here
It's just a horizontal stitching of the index.
insert image description here

concat operation

Concat is more flexible in operation, and can perform horizontal splicing operations as well as vertical splicing operations.

Vertical splicing operation
insert image description here
Horizontal splicing operation
insert image description here
parameter explanation in concat:

  • objs: Any data objects to concatenate. Can be List, Serices, DataFrame, Dict, etc.
  • axis: The axis to connect to. The default is 0 (row axis), 1 (vertical) connection.
  • join: Similar to the how parameter in merge, only accepts the value inner or outer.
  • ignore_index: Defaults to False. True to set the new combined dataset will not preserve the original index values ​​in the axis specified in the axis parameter.
  • keys: Build hierarchical indexes for querying the original dataset from which different rows come.
  • copy: Whether to copy the source data, the default value is True.

concat example

We use the treasure data of the Three Kingdoms to observe, the data is 74 lines.

import pandas as pd
items  = pd.read_excel("Romance of the Three Kingdoms 13/道具列表.xlsx")
items.head()

insert image description here

After horizontal splicing, the maximum number of rows of data is kept at 74.

pd.concat([items, items], axis=1)

insert image description here
After vertical splicing, the maximum number of lines becomes 2 times of 74.

pd.concat([items, items], axis=0)

insert image description here

append example

Append is also an effective way to splicing DataFrame data. The method is the same as the vertical splicing of concat. The returned result needs to redefine the variable to take effect.

Note the difference between the following 2 append lines

items.append(items)
items

insert image description here

items = items.append(items)
items

insert image description here

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/124269511