Python code and Bayesian theory tell you who is the best baseball player

Compilation: Li Lei, Zhang Xinyue, Wang Mengze, Xiaoyu

In addition to the code blocks attached to the text, you can also find a link to the Jupyter Notebook for the entire program at the end of the text.

Among the many topics in data science or statistics, one that I find both interesting and difficult to understand is Bayesian analysis. In one course, I had the opportunity to learn Bayesian statistical analysis, but I also needed to do some review and reinforcement of it.

From a personal point of view, I just want to better understand Bayesian theory and how to apply it in real life.

This article is mainly inspired by Rasmus Bååth's Youtube series "Introduction to Bayesian Data Analysis". Rasmus Bååth is very good at letting you understand Bayesian analysis intuitively. Instead of throwing various complicated formulas to you, it guides you to think step by step.

Video link by Rasmus Bååth:

https://www.youtube.com/user/rasmusab/feed

This article will analyze the hit rate of baseball players through Bayesian theory, and teach you how to use Bayesian theory for analysis. To be honest, I'm not a sports fan and rarely watch sports.

So why choose baseball?

"The beauty of baseball, whether you know it or not, is precision. No other sport is as completely dependent on the continuity, statistics, and order of athletic data as baseball. Baseball fans care more about numbers than CPAs."

—Sports reporter Jim Murray

Some say baseball may be the most well-documented sport in the world. History has accumulated baseball statistics for nearly a hundred years.

However, just collecting statistics doesn't make baseball interesting in terms of statistics, perhaps more importantly, the characteristics of the sport itself.

For example, in the process of completing an At Bats (At Bats, a term in baseball, referring to the number of hits a batter completes), who plays in the outfield affects whether the batter can hit a home run. Very little.

In other sports, especially football and basketball, the significance of player statistics can be downplayed by important events happening elsewhere on the field. In baseball, statistics play an important role in comparing player performance.

Baseball stats consist of many metrics, some of which are straightforward to define, others more complex. The measurement I choose to watch is the Batting Average (AVG). In baseball, batting average is defined by the number of hits divided by the number of hits, usually accurate to three decimal places.

Some have questioned the role of batting average, but as C. Trent Rosecrans said, "Nevertheless, batting average does have historical and contextual significance relative to other stats. We all know how AVG is at .300 batters, We also know how bad an AVG is at .200 batters and how great an AVG is at .400 batters."

In Major League Baseball (MLB), spring training is a series of practices and exhibition games before the start of the regular season.

I will try to solve the following two problems:

How to interpret batting averages in spring training 2018

How to compare the batting average of two players

Before getting into the code stuff, I'll briefly go over what Rasmus Bååth talks about in his video.

First, we need three things to perform Bayesian analysis.

1. Data

2. Generative model

3. Prior probability

In my case, the data is the batting average we observed for spring training in 2018.

A generative model is one that generates data when given parameters as input. These input parameters are used to generate a probability distribution. For example, if you know the mean and standard deviation, you can easily generate normally distributed data for a selected dataset by running the following code. Later we will see other types of distributions used in Bayesian analysis.

import matplotlib.pyplot as pltimport numpy as npmu, sigma = 0, 0.1 # mean and standard deviations = np.random.normal(mu, sigma, 1000)plt.hist(s)

In the case of Bayesian analysis, we reverse-generate the model and try to infer parameters from the observed data.

Introduction to Bayesian Data Analysis Part 1

Finally, prior probabilities refer to information that the model already has before processing the data. For example, are events equally probable? Is there some prior data that can be leveraged? Is it possible to make an educated guess?

First I'll define a function that grabs player data from Fox Sports, and then grabs a player's batting stats for spring training or regular season.

Fox Sports link:

https://www.foxsports.com/mlb/stats

import pandas as pdimport seaborn as snsimport requestsfrom bs4 import BeautifulSoupplt.style.use('fivethirtyeight')%matplotlib inline%config InlineBackend.figure_format = 'retina'def batting_stats(url,season): r = requests.get(url) soup = BeautifulSoup(r.text, 'lxml') table = soup.find_all("table",{"class": "wisbb_standardTable tablesorter"})[0] table_head = soup.find_all("thead",{"class": "wisbb_tableHeader"})[0]if season == 'spring': row_height = len(table.find_all('tr')[:-1])else: row_height = len(table.find_all('tr')[:-2]) result_df = pd.DataFrame(columns=[row.text.strip() for row in table_head.find_all('th')], index = range(0,row_height))

row_marker = 0for row in table.find_all('tr')[:-1]: column_marker = 0 columns = row.find_all('td')for column in columns: result_df.iat[row_marker,column_marker] = column.text.strip() column_marker += 1 row_marker += 1return result_df

Next, we select a player of interest and analyze its statistics.

New York Mets spring training stats page

If you rank by player batting average (AVG), you can see that the first place is Dominic Smith (DS), while Gavin Cecchini (GC) is second. Are they good players? I have no idea. But if you just look at AVG, the DS tops the chart with an AVG value of 1.000.

Googling around, I found that "in recent years, the league-wide batting average has typically been around .260". If so, then the AVG for DS and GC seems to be too high. By further looking at the at-bats (AB) and hits (H) of the two players, it is clear that DS has only 1 AB and GC has 7. And after looking at the ABs of other players, I found that the highest AB in 2018 was 13, while the highest AB of the New York Mets in 2017 was 60.

scene one

Assuming I don't know anything about the players' past performance, the 2018 spring training is the only source of data, so I don't know the range of values ​​for AVG. So, how should I interpret the 2018 spring training stats?

First let's grab the spring training data for DS.

ds_url_st = "https://www.foxsports.com/mlb/dominic-smith-player-stats?seasonType=3"dominic_smith_spring = batting_stats(ds_url_st,'spring')dominic_smith_spring.iloc[-1]

n_draw = 20000prior_ni = pd.Series(np.random.uniform(0, 1, size = n_draw))plt.figure(figsize=(8,5))plt.hist(prior_ni)plt.title('Uniform distribution(0,1)')plt.xlabel('Prior on AVG')plt.ylabel('Frequency')

Prior probabilities represent our general beliefs about something before we have specific data. In the above distribution, all probabilities are nearly identical (there are slight differences due to random generation).

So that means I don't know anything about the players and can't even make any reasonable guesses about AVG. I'm assuming that the probability of AVG being 0.000 is the same as AVG being 1.000, or equal to the probability that the AVG value is any number between 0 and 1.

Now we observe data that when there is 1 AB and 1 H, the AVG is 1.000, which can be represented by the binomial distribution. A random variable X with a binomial distribution represents the number of successes in a sequence of n independent yes/no trials, where the probability of success on each trial is p.

In our case, AVG is the probability of success, AB is the number of trials, and H is the number of successes.

Keep these terms in mind before we define the reverse generative model.

We will randomly pick a probability value from the defined uniform distribution and use this probability as a parameter for the generative model. Let's say we randomly pick a probability value of 0.230, which means a 23% chance of success in the binomial distribution.

The number of trials is 1 (DS has 1 AB), and if the results of the generative model match what we observed (DS has 1 H), then the probability value of 0.230 remains the same. If we repeat this process and filter, we end up with a probability distribution that yields the same results as we observed.

This is the posterior probability.

def posterior(n_try, k_success, prior): hit = list()for p in prior: hit.append(np.random.binomial(n_try, p)) posterior = prior[list(map(lambda x: x == k_success, hit))] plt.figure(figsize=(8,5)) plt.hist(posterior) plt.title('Posterior distribution') plt.xlabel('Posterior on AVG') plt.ylabel('Frequency') print('Number of draws left: %d, Posterior mean: %.3f, Posterior median: %.3f, Posterior 95%% quantile interval: %.3f-%.3f' % (len(posterior), posterior.mean(), posterior.median(), posterior.quantile(.025), posterior.quantile(.975)))ds_n_trials = int(dominic_smith_spring[['AB','H']].iloc[-1][0])ds_k_success = int(dominic_smith_spring[['AB','H']].iloc[-1][1])posterior(ds_n_trials, ds_k_success, prior_ni)

The 95% quantile interval in the posterior probability distribution is called the credible interval, which is slightly different from the confidence interval in frequentist statistics. There is another credible interval that can be used, which I will mention later when I get to Pymc3.

The main difference between confidence intervals in Bayesian statistics and confidence intervals in frequentist statistics is that they have different interpretations. Bayesian probability reflects people's subjective beliefs. According to this theory, we can consider the probability that the true parameter is within the confidence interval to be measurable. This statement is appealing because it allows us to describe parameters directly in terms of probabilities.

Many people consider this concept to be a more natural way of understanding probability intervals and easy to interpret. Confidence intervals allow you to determine whether an interval contains a true parameter.

If we collect a new sample, calculate the confidence interval, and repeat the process multiple times, then the 95% confidence interval we calculate will contain the true AVG value.

Confidence Interval: Based on the observed data, there is a 95% probability that the true value of AVG falls within the confidence interval.

Confidence Intervals: When we calculate confidence intervals with this type of data, 95% of the confidence intervals will contain the true value of AVG.

Note the difference between the two. The credible interval is the probabilistic description of the parameter value given a fixed boundary, and the confidence interval is the boundary probability given the fixed parameter value.

In real life, we want to know the real parameters and not the bounds, so Bayesian credible intervals are a more appropriate choice. In this case, we are only interested in the player's real AVG.

With the posterior distribution above, I am 95% confident that the true AVG of DS will be between 0.155 and 0.987. But the scope is too large. In other words, without prior knowledge and having only observed one trial, I'm not quite sure what the true AVG of the DS is.

scene two

For the second scenario, we assume that we know the statistics for the previous year's spring training.

dominic_smith_spring.iloc[-2:]

Now that we have statistics for spring training 2017, our prior assumptions should reflect this information. Note that the AVG of DS at spring training in 2017 was 0.167, so the stats for 2017 are not evenly distributed.

The beta distribution is a continuous probability distribution that has two parameters, alpha and beta. One of the most common uses of the beta distribution is to model the uncertainty in the probability of success of an experiment.

In particular, the conditional distribution of X is a beta distribution with alpha=k+1, beta=n-k+1, given that k successes are observed in n trials.

Beta distribution related content:

https://www.statlect.com/probability-distributions/beta-distribution

n_draw = 20000prior_trials = int(dominic_smith_spring.iloc[3].AB)prior_success = int(dominic_smith_spring.iloc[3].H)prior_i = pd.Series(np.random.beta(prior_success+1, prior_trials-prior_success+1, size = n_draw))plt.figure(figsize=(8,5))plt.hist(prior_i)plt.title('Beta distribution(a=%d, b=%d)' % (prior_success+1,prior_trials-prior_success+1))plt.xlabel('Prior on AVG')plt.ylabel('Frequency')

posterior(ds_n_trials, ds_k_success, prior_i)

Here 95% of the quantile region has been shrunk compared to the posterior obtained using the prior assumption of uniform distribution in Scenario 1. Now I am 95% sure that the AVG of the DS will be between 0.095 and 0.340.

However, generally speaking, an AVG above 0.300 is already a good hitter, and the AVG estimate here means that the player can be the worst hitter or the best hitter. So we need more data to narrow the confidence interval.

scene three

In this scenario, let's say I have not only the 2017 spring training stats, but also the 2017 regular season stats. So how does this affect the posterior results and conclusions?

ds_url = "https://www.foxsports.com/mlb/dominic-smith-player-stats?seasonType=1"dominic_smith_reg = batting_stats(ds_url,'regular')dominic_smith = dominic_smith_reg.append(dominic_smith_spring.iloc[3], ignore_index=True)dominic_smith

ds_prior_trials = pd.to_numeric(dominic_smith.AB).sum()ds_prior_success = pd.to_numeric(dominic_smith.H).sum()n_draw = 20000prior_i_02 = pd.Series(np.random.beta(ds_prior_success+1, ds_prior_trials-ds_prior_success+1, size = n_draw))plt.figure(figsize=(8,5))plt.hist(prior_i_02)plt.title('Beta distribution(a=%d, b=%d)' % (ds_prior_success+1,ds_prior_trials-ds_prior_success+1))plt.xlabel('Prior on AVG')plt.ylabel('Frequency')

posterior(ds_n_trials, ds_k_success, prior_i_02)

Now I'm 95% sure that the true AVG of the DS will be between 0.146 and 0.258. Although the range is not very precise, the credible interval for scenario three is much narrower compared to scenarios one and two.

scene four

I would like to compare the two players and see who is better at AVG. The data I observed is from spring training in 2018, and the prior knowledge is spring training and regular season in 2017. Now I'm going to compare the batting averages of DS and GC.

In scenario three, I eliminated all parameters whose generated results were inconsistent with the observed data, and then performed simulated sampling. But this type of random sample generation and filtering is computationally expensive and slow.

Therefore, we can use some tools to make the sampler spend more time in high-probability regions to improve efficiency. Probabilistic programming tools like Pymc3 can handle the sampling process efficiently by using clever algorithms like HMC-NUTS.

Pymc3 link:

https://github.com/pymc-devs/pymc3

HMC-NUTS link:

http://blog.fastforwardlabs.com/2017/01/30/the-algorithms-behind-probabilistic-programming.html

Let's start by grabbing Gavin Cecchini's stats from Fox Sports.

gc_url_st = "https://www.foxsports.com/mlb/gavin-cecchini-player-stats?seasonType=3"gc_url_reg = "https://www.foxsports.com/mlb/gavin-cecchini-player-stats?seasonType=1"gavin_cecchini_spring = batting_stats(gc_url_st,'spring')gavin_cecchini_reg = batting_stats(gc_url_reg,'regular')gc_n_trials = int(gavin_cecchini_spring.iloc[1].AB)gc_k_success = int(gavin_cecchini_spring.iloc[1].H)gc_prior = pd.DataFrame(gavin_cecchini_reg.iloc[1]).transpose().append(gavin_cecchini_spring.iloc[0])gc_prior

gc_prior_trials = pd.to_numeric(gc_prior.AB).sum()gc_prior_success = pd.to_numeric(gc_prior.H).sum()def observed_data_generator(n_try,observed_data): result = np.ones(observed_data) fails = n_try - observed_data result = np.append(result, np.zeros(fails))return resultds_observed = observed_data_generator(ds_n_trials,ds_k_success)gc_observed = observed_data_generator(gc_n_trials,gc_k_success)

Next, we fit a Pymc3 model.

import pymc3 as pmwith pm.Model() as model_a: D_p = pm.Beta('DS_AVG', ds_prior_success+1, ds_prior_trials-ds_prior_success+1) G_p = pm.Beta('GC_AVG', gc_prior_success+1, gc_prior_trials-gc_prior_success+1) DS = pm.Bernoulli('DS', p=D_p, observed=ds_observed) GC = pm.Bernoulli('GC', p=G_p, observed=gc_observed) DvG = pm.Deterministic('DvG', D_p - G_p) start = pm.find_MAP() trace = pm.sample(10000, start=start)pm.plot_posterior(trace, varnames=['DS_AVG','GC_AVG','DvG'],ref_val=0)

If we plot the posterior distributions of DS_AVG, GC_AVG, and DvG (DS_AVG - GC_AVG) with the plot_posterior function in Pymc3, we can see that the term appears in the graph as HPD instead of quantile.

The Highest Posterior Density (HPD) interval is another credible interval we can use for the posterior density function. The HPD interval selects the narrowest interval where the maximum posterior probability density value, including the mode, is located.

In another article by Rasmus Bååth, quantile intervals and highest density intervals are compared, with a simple and clear comparison plot. Below are the modes in the six different posterior distributions and the highest density interval covering 95% of the probability density.

Article link:

http://www.sumsar.net/blog/2014/10/probable-points-and-credible-intervals-part-one/

Likely Points and Confidence Intervals, Part 1: Graphical Summary

The quantile interval includes the median. The probability of the median falling on the left side of the interval is 50%, and the probability of falling on the right side is also 50%. At the same time, taking the 95% confidence interval as an example, the probability of falling on either side of the interval is 50%. The probability is 2.5%.

Likely Points and Confidence Intervals, Part 1: Graphical Summary

As far as the AVGs of DS and GC are concerned, their modes and medians do not seem to be much different. If this is the case, the HPD intervals and quantile intervals of the AVGs of the two players should be roughly the same. Let's see what they look like.

pm.summary(tracehuarenyl.cn/)

We can see that for both players DS and GC, the HPD interval and quantile interval are either exactly the same, or only slightly different after the decimal point.

The thing is, I want to judge who is the better player based on AVG, and at the moment, I can't be sure. At least I'm 95% sure that the AVGs of these two players are about the same.

The calculated results and the generated graphs show that the difference in AVG between the two players is between -0.162 and 0.033 (we use DvG (DS-GC) to represent the difference in their AVG, if DvG is positive it means DS is better, otherwise GC better).

From the results, the interval includes 0.000, which means that there is no difference in the AVG of the two players. So even though there is evidence that GC is better than DS (since the posterior distribution of DvG has a larger area in the negative region than in the positive region), I am 95% sure that the AVG of these two players is not the same No difference.

Maybe with more data, I can determine the difference between them. After all, this is the essence of Bayesian theory. It's not that the truth doesn't exist, but the process of learning the truth is slow, and as technology continues to improve, all we can do is to constantly revise our perceptions.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324582836&siteId=291194637