Ali Tianchi: Task 04: Python Data Analysis: Complete a data analysis battle from 0 (Part 4)

Ali Tianchi: Task 04: Python Data Analysis: Complete a data analysis battle from 0 (Part 4)


Learning content: Python data analysis


Python training camp content:


Title and content introduction section


Pokémon data analysis-civilians' strongest Pokémon choice

The advent of the data age has refreshed the way people explore the unknown, from basic energy construction to aerospace engineering. Dr. Omu, who has been tirelessly researching the Pokémon in the Damu Research Institute of Zhenxin Town, Guandu District, is no exception. In the play, we can often see the wizard illustrations made by Dr. Omu have been providing treasures to explorers. Simple analysis of dreams, including Pokémon’s height, weight, characteristics, etc. But as a person who has longed to go to the Oki Doctoral Institute since I was a child to become the strongest trainer/Pokemon research master of the Pokémon Alliance, just analyzing the data of a single Pokémon is far from satisfying my needs.

Different from the way other explorers travel to challenge the gymnasium, I decided to use data analysis to help me better understand the magical creature Pokémon, and then choose the most economical, simple and easy to grasp Pokémon Come to challenge the alliance. Using a search engine, I found a data set containing a total of 801 Pokémon from the first generation to the seventh generation. Then, because the Ph.D’s laboratory did not have rich research funding, I decided to choose  the  DSW Explorer version with  free computing resources  and  pre-installed many commonly used data analysis dependent libraries  to help me complete the analysis process. You can click here to explore the entire analysis process.


Data set download


# 数据集下载
!wget -O pokemon_data.csv https://pai-public-data.oss-cn-beijing.aliyuncs.com/pokemon/pokemon.csv

Then we import the three most commonly used items: Pandas, Seaborn, Matplotlib, and read the data

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("./pokemon_data.csv")

First, let's observe the size of the data, which can be achieved through  df.shape this. Of course,  df.info() it can give us more detailed information about each column. Here we can find that this data set contains a total of 801 rows and 41 columns of data. Explain that there are a total of 801 Pokémon, and each Pokémon has 41 characteristics to describe them.

Then came our first question: Is there any missing data for so many features? After all, some Pokémon are more mysterious, even Dr. Omu might not know. Here we can observe the lack of each feature through the following code:

# 计算出每个特征有多少百分比是缺失的
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({
    'column_name': df.columns,
    'percent_missing': percent_missing
})
# 查看Top10缺失的
missing_value_df.sort_values(by='percent_missing', ascending=False).head(10)

By looking at the above data, we can find that type2 the missing rate of this field is the highest, reaching about 48%. It shows that more than half of Pokémon still have only one attribute, and the rest have two attributes.

Then the second question is: how many Pokémon are there in each generation? Here we can get it simply  df['generation'].value_counts() . But in order to more intuitively show the difference in the number of different generations of Pokémon, here we can use the pandasbuilt-in drawing function to draw a histogram: by looking at the above data, we can find that type2 the rate of missing this field is the highest, reaching By about 48%. It shows that more than half of Pokémon still have only one attribute, and the rest have two attributes.

Then the second question is: how many Pokémon are there in each generation? Here we can get it simply  df['generation'].value_counts() . But in order to more intuitively show the difference in the number of Pokémon of different generations, here we can use the pandasbuilt-in drawing function to draw a histogram:

# 查看各代口袋妖怪的数量
df['generation'].value_counts().plot.bar()

It is not difficult to find that the largest number of Pokémon is in the 5th generation, and the least is in the 6th generation. Then we look at the distribution of different main attributes. Here we can make some simple assumptions. For example, there are more types of Pokémon with insect attributes because they appear frequently in the drama and have many evolutions.

# 查看每个系口袋妖怪的数量
df['type1'].value_counts().sort_values(ascending=True).plot.barh()

Here we have turned the previous histogram horizontally to make it easier to observe. Here we can see that the largest number of Pokémon is water type, then normal, then grass type. The insect line only ranked fourth, not as many as expected.

After reading some basic distributions, I will want to do some simple correlation analysis. We can generate the correlation diagram through the following code

# 相关性热力图分析
plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title("Correlation Heatmap")
corr = df.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

To understand the relationship between different characteristics, it is very helpful for us to understand the characteristics of Pokémon. For example, by observing that the characteristic of attack is positively correlated with height_m, we can conclude that the higher the Pokémon, the higher the attack power. But looking at height_m again, we will find that it is negatively correlated with base_happiness. At this time, we can make another conclusion: Pokémon that grows tall may not be happy.

Next, we analyze this set of data from the perspective of Pokémon in actual combat. Here we only focus on six basic values: HP, attack power, defense power, special attack, special defense, and speed. Because only these six basic values ​​determine the combat effectiveness of a Pokémon without considering faction restraint.

interested = ['hp','attack','defense','sp_attack','sp_defense','speed']
sns.pairplot(df[interested])

Here we can see that most of them are proportional, and an increase in one value tends to increase another value. We can also see this through the correlation heat map

# 通过相关性分析heatmap分析五个基础属性
plt.subplots(figsize=(10,8))
ax = plt.axes()
ax.set_title("Correlation Heatmap")
corr = df[interested].corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values,
            annot=True, fmt="f",cmap="YlGnBu")

After reading these, we can start to calculate the race value and then select our civilian monster. After all, not everyone can conquer the legendary Pokémon like Deoxys, Super Dream, and Dream. Here we can do a feature type conversion first, and then calculate

for c in interested:
    df[c] = df[c].astype(float)
df = df.assign(total_stats = df[interested].sum(axis=1)) 

In this way, we have completed the total_stats feature of using  this field to store the race value. We can make a histogram visualization to see what the distribution of race values ​​looks like:

# 种族值分布
total_stats = df.total_stats
plt.hist(total_stats,bins=35)
plt.xlabel('total_stats')
plt.ylabel('Frequency')

At the same time, we can also look at different attributes:

# 不同属性的种族值分布
plt.subplots(figsize=(20,12))
ax = sns.violinplot(x="type1", y="total_stats",
                    data=df, palette="muted")

Find a legendary Pokémon that is not a legendary Pokémon but has a racial value

Finally, we can find the Pokémon we should capture by simple filtering and sorting:

df[(df.total_stats >= 570) & (df.is_legendary == 0)]['name'].head(10)

Judging from the results, the Top 10 Pokémon that our civilian Pokémon trainers should consider should be: Wonderful Frog Flower, Fire-breathing Dragon, Water Arrow Turtle, Bi Diao, Hu Di, Stupid Hippo, Geng Ghost, Marsupial, Dajia , Tyrannosaurus. In this way, through simple data analysis, we have completed most of the tasks impossible for trainers in Pokémon animation. Think about it this way, get promoted and raise your salary, become CEO, and win Bai Fumei. The day when you become the director of the research institute is just around the corner!

The results are for reference only. . .

 

 

Copyright statement: This article is the original article of the blogger and follows the  CC 4.0 BY-SA  copyright agreement. Please attach the original source link and this statement for reprinting.

Link to this article: https://blog.csdn.net/adminkeys/article/details/108532946

Attached ipython and data set file address: https://download.csdn.net/download/adminkeys/12837523

Guess you like

Origin blog.csdn.net/adminkeys/article/details/108532946