Football-EDA historical data analysis and visualization

background

The dataset includes the results of 44,341 international football matches from the first official match in 1872 to 2023. Competitions range from the FIFA World Cup to the FIFA Wild Cup to regular friendlies. These games are strictly men's international and the data do not include the Olympics or games in which at least one team is a national B team, U-23 or league select team.

Data introduction

results.csv includes the following columns:

  • date - the date of the match
  • home_team - the name of the home team
  • away_team - away team name
  • home_score - full-time home team score, including extra time, excluding penalty shootouts
  • away_score - full-time away score, including extra time, excluding penalties
  • tournament - the name of the tournament
  • city ​​- the name of the city/town/administrative unit where the match is held
  • country - the name of the country where the match is played
  • neutral - true/false column indicating whether the match is played on a neutral ground

Some directions to follow when exploring your data:

who is the best team of all time

Which teams dominated football in different eras

Throughout the ages, what are the trends in international football - home field advantage, total goals scored, team strength distribution, etc.

Can we say something about geopolitics from football - how the number of countries changes

which teams like to play each other

Which countries hosted the most games they did not attend

How much hosting a major event helps a country's odds in the game

Which teams are the most active in friendlies and friendlies - is that helping or hurting them?

data processing

import numpy as np 
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import library

import matplotlib.pyplot as plt
import seaborn as sns

data exploration

df = pd.read_csv('/kaggle/input/international-football-results-from-1872-to-2017/results.csv')
df.head()

insert image description here

print(f"This Dataset Includes {
      
      df.shape}")

insert image description here

df.info()

insert image description here

df.describe()

insert image description here

df.describe(include=object)

insert image description here

df.isna().sum()

insert image description here

Convert "date" column to datetime type

df['date'] = pd.to_datetime(df['date'])

data visualization

Match analysis

plt.figure(figsize=(20, 12))
sns.countplot(x='tournament', data=df)
plt.xticks(rotation=90)
plt.title('Tournament Distribution')
plt.xlabel('Tournament')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

insert image description here

home and away score

plt.figure(figsize=(12, 8))
plt.subplot(1, 2, 1)
sns.histplot(df['home_score'], bins=20, kde=True)
plt.title('Distribution of Home Scores')
plt.xlabel('Home Score')
plt.ylabel('Frequency')
#Setting limit for first plot
plt.ylim(0, 40000)


plt.subplot(1, 2, 2)
sns.histplot(df['away_score'], bins=20, kde=True)
plt.title('Distribution of Away Scores')
plt.xlabel('Away Score')
plt.ylabel('Frequency')
# Share y-axis between subplots
plt.ylim(0, 40000)

plt.tight_layout()
plt.show()

insert image description here

correlation analysis

correlation_matrix = df.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

insert image description here

time series analysis

# 为年份创建新列
df['year'] = df['date'].dt.year

#时间序列分析
plt.figure(figsize=(10, 6))
sns.lineplot(x='year', y='home_score', data=df, label='Home Score')
sns.lineplot(x='year', y='away_score', data=df, label='Away Score')
plt.title('Trends in Home and Away Scores over Time')
plt.xlabel('Year')
plt.ylabel('Score')
plt.legend()
plt.tight_layout()
plt.show()

insert image description here

Summarize

The above is the content shared today

Guess you like

Origin blog.csdn.net/m0_66106755/article/details/132487403
Recommended