Learn Python data science, play games, learn Japanese, and engage in programming all in one package.
This time, let's learn how the data in the "Three Kingdoms" game is spliced.
In the field of data science, when using Python to process large-scale data sets, it is often necessary to use the methods of merging and linking to integrate data sets. The data types used include Series and DataFrame, and there are many methods that can be used, such as the one introduced in this article. The three methods of .merge(), .join() and .concat() can be used to maximize the use of the data set after splicing.
Article directory
merge operation
The .merge() method is used to combine data on general columns or indexes. This method is somewhat similar to the join operation in MySQL, and can implement operations such as left splicing, right splicing, and full join.
Splicing through the index of the keyword to achieve many-to-one, one-to-many, and many-to-many (Cartesian product) connections.
Explanation of parameters in merge:
- how: Define the merging method. The selection parameters are "inner", "outer", "left'", "right".
- on: Define the columns that must be included in both DataFrames for join (index keys).
- left_on and right_on: Specifies the column or index present in the left or right object to be merged.
- left_index and right_index: Default is False, set to use the index column as the merge basis.
- suffixes: A tuple of strings to append to the same column name that is not a merge key.
merge splicing method
A picture can see how the different keyword arguments merge.
merge example
data read
We need to perform the splicing operation of the direct relationship between the power and the character. The data read includes the following two lists, and the data without power in the character history login data is eliminated.
import pandas as pd
country = pd.read_excel("Romance of the Three Kingdoms 13/势力列表.xlsx")
people = pd.read_excel("Romance of the Three Kingdoms 13/人物历史登入数据.xlsx")
# 剔除不包含的势力数据,即武将在野的状态
people = people[people["勢力"]!="-"]
country.head()
people.head()
inner join
Use the merge default parameter to directly inner join, matching the result of the intersection of two DataFrames.
Splicing the character and the power to which they belong, here we take the power to which the character finally belongs, that is, the last piece of data information after the data of the changed character is aggregated.
people_new = people.groupby('名前').nth(-1)
people_new["名前"] = people_new.index
people_new.reset_index(drop=True,inplace=True)
people_new
The order of the DataFrames in the merge determines the order of the concatenated results.
inner_merged_total = pd.merge(country,people_new,on=["勢力"])
inner_merged_total.head()
inner_merged_total = pd.merge(people_new,country,on=["勢力"])
inner_merged_total.head()
outer join
In an outer join (also called a full outer join), all rows from both DataFrames will appear in the new DataFrame.
In essence, the outer splicing is performed on the full df_A of the data and the included df_B, which is equivalent to pd.merge(df_A ,df_B,on=[“key”]) .
outer_merged = pd.merge(people_new,country,how="outer",on=["勢力"])
outer_merged.head()
If we don't remove the data in the wild generals, we will find that the entire form is spliced.
country = pd.read_excel("Romance of the Three Kingdoms 13/势力列表.xlsx")
people = pd.read_excel("Romance of the Three Kingdoms 13/人物历史登入数据.xlsx")
outer_merged = pd.merge(people_new,country,how="outer",on=["勢力"])
outer_merged
left join
The newly merged DataFrame is retained with all rows in the left DataFrame (ie, the first dataframe in the merge), while rows in the right DataFrame that do not have a match in the left DataFrame's key column are discarded.
left_merged = pd.merge(people_new,country,how="left",on=["勢力"])
left_merged
right join
The newly merged DataFrame is retained with all rows in the right DataFrame (ie, the second dataframe in the merge), while rows in the right DataFrame that do not have a match in the left DataFrame's key column are discarded.
right_merged = pd.merge(people_new,country,how="right",on=["勢力"])
right_merged
join operation
The join operation is very similar to merge, which combines data on columns or indexes. Join is equivalent to specifying the first DataFreme in the merge. And columns with conflicting names can be renamed by defining a suffix.
This result is very similar to the previous left and right merge.
Parameter explanation in join:
- other: Defines the DataFrame to be concatenated.
- on: Specifies an optional column or index name for the left DataFrame. If set to None, this is the default index connection.
- how: It is the same as how in merge, if no column is specified, index splicing is used.
- lsuffix and rsuffix: Similar to the suffix in merge().
- sort: Sort the generated DataFrame.
join example
people_new.join(country, lsuffix="left", rsuffix="right")
It's just a horizontal stitching of the index.
concat operation
Concat is more flexible in operation, and can perform horizontal splicing operations as well as vertical splicing operations.
Vertical splicing operation
Horizontal splicing operation
parameter explanation in concat:
- objs: Any data objects to concatenate. Can be List, Serices, DataFrame, Dict, etc.
- axis: The axis to connect to. The default is 0 (row axis), 1 (vertical) connection.
- join: Similar to the how parameter in merge, only accepts the value inner or outer.
- ignore_index: Defaults to False. True to set the new combined dataset will not preserve the original index values in the axis specified in the axis parameter.
- keys: Build hierarchical indexes for querying the original dataset from which different rows come.
- copy: Whether to copy the source data, the default value is True.
concat example
We use the treasure data of the Three Kingdoms to observe, the data is 74 lines.
import pandas as pd
items = pd.read_excel("Romance of the Three Kingdoms 13/道具列表.xlsx")
items.head()
After horizontal splicing, the maximum number of rows of data is kept at 74.
pd.concat([items, items], axis=1)
After vertical splicing, the maximum number of lines becomes 2 times of 74.
pd.concat([items, items], axis=0)
append example
Append is also an effective way to splicing DataFrame data. The method is the same as the vertical splicing of concat. The returned result needs to redefine the variable to take effect.
Note the difference between the following 2 append lines
items.append(items)
items
items = items.append(items)
items