Football-EDA historical data analysis and visualization
background
The dataset includes the results of 44,341 international football matches from the first official match in 1872 to 2023. Competitions range from the FIFA World Cup to the FIFA Wild Cup to regular friendlies. These games are strictly men's international and the data do not include the Olympics or games in which at least one team is a national B team, U-23 or league select team.
Data introduction
results.csv includes the following columns:
- date - the date of the match
- home_team - the name of the home team
- away_team - away team name
- home_score - full-time home team score, including extra time, excluding penalty shootouts
- away_score - full-time away score, including extra time, excluding penalties
- tournament - the name of the tournament
- city - the name of the city/town/administrative unit where the match is held
- country - the name of the country where the match is played
- neutral - true/false column indicating whether the match is played on a neutral ground
Some directions to follow when exploring your data:
who is the best team of all time
Which teams dominated football in different eras
Throughout the ages, what are the trends in international football - home field advantage, total goals scored, team strength distribution, etc.
Can we say something about geopolitics from football - how the number of countries changes
which teams like to play each other
Which countries hosted the most games they did not attend
How much hosting a major event helps a country's odds in the game
Which teams are the most active in friendlies and friendlies - is that helping or hurting them?
data processing
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
import library
import matplotlib.pyplot as plt
import seaborn as sns
data exploration
df = pd.read_csv('/kaggle/input/international-football-results-from-1872-to-2017/results.csv')
df.head()
print(f"This Dataset Includes {
df.shape}")
df.info()
df.describe()
df.describe(include=object)
df.isna().sum()
Convert "date" column to datetime type
df['date'] = pd.to_datetime(df['date'])
data visualization
Match analysis
plt.figure(figsize=(20, 12))
sns.countplot(x='tournament', data=df)
plt.xticks(rotation=90)
plt.title('Tournament Distribution')
plt.xlabel('Tournament')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
home and away score
plt.figure(figsize=(12, 8))
plt.subplot(1, 2, 1)
sns.histplot(df['home_score'], bins=20, kde=True)
plt.title('Distribution of Home Scores')
plt.xlabel('Home Score')
plt.ylabel('Frequency')
#Setting limit for first plot
plt.ylim(0, 40000)
plt.subplot(1, 2, 2)
sns.histplot(df['away_score'], bins=20, kde=True)
plt.title('Distribution of Away Scores')
plt.xlabel('Away Score')
plt.ylabel('Frequency')
# Share y-axis between subplots
plt.ylim(0, 40000)
plt.tight_layout()
plt.show()
correlation analysis
correlation_matrix = df.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
time series analysis
# 为年份创建新列
df['year'] = df['date'].dt.year
#时间序列分析
plt.figure(figsize=(10, 6))
sns.lineplot(x='year', y='home_score', data=df, label='Home Score')
sns.lineplot(x='year', y='away_score', data=df, label='Away Score')
plt.title('Trends in Home and Away Scores over Time')
plt.xlabel('Year')
plt.ylabel('Score')
plt.legend()
plt.tight_layout()
plt.show()
Summarize
The above is the content shared today