2023 MCM competition C question Wordle prediction problem solution!
question two

2023 MCM competition C question Wordle prediction problem solution!

Question one

The number of reported results (Number of reported results) changes daily.
- develop a model to account for this variation ,
- and use your model to create aprediction interval。
- Do any properties of the word affect the reported percentage of scores played in hardmode ? If so, how? If not, why not?

read data

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
from datetime import date, timedelta

import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
import seaborn as sns
%matplotlib inline

data=  pd.read_excel("./Problem_C_Data_Wordle.xlsx",header=1)
data

insert image description here

data preprocessing

data = data.drop(columns='Unnamed: 0')
data['Date'] = pd.to_datetime(data['Date'])
data

insert image description here

data.set_index("Date", inplace=True)
data.sort_index(ascending=True,inplace=True)

data=data.reset_index()
data

insert image description here

data analysis

Data Trends

plt.figure(figsize=(15,6))

data["Date"] =  pd.to_datetime(data["Date"])

plt.plot(data['Date'],data['Number of  reported results'],'r-o', markersize=3)

plt.legend(['Number of reported results'],fontsize=20)

plt.xlabel('Date',fontsize=14)
plt.ylabel('Number of reported results',fontsize=14)

insert image description here

data distribution

plt.figure(figsize=(10,8))
kdeplo=data['Number of  reported results']

g=sns.kdeplot(kdeplo,legend=True,shade=True,color='b',label='Number of  reported results') 

plt.legend(loc='best', fontsize='large')

insert image description here

from scipy.stats import norm, skew
plt.figure(figsize=(10,8))
(mu, sigma) = norm.fit(data['Number of  reported results'])
print('\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
g = sns.distplot(data['Number of  reported results'], fit=norm)
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
           loc='best')

g.set(ylabel='Frequency') 
g.set(title=' distribution')
plt.show()

insert image description here

Data statistics - mean, variance, maxima and minima...

data.describe()

insert image description here

data dependency

corr = abs(data.corr())
corr['Number of  reported results'].sort_values(ascending=False)

insert image description here
Number in hard mode 0.922252
Contest number 0.821787
1 try 0.342183
4 tries 0.211693
2 tries 0.118527
6 tries 0.084180
5 tries 0.077308
3 tries 0.043624
7 or more tries (X) 0.033079

The larger the absolute value of the correlation coefficient, the stronger the correlation, the closer the correlation coefficient is to 1 or -1, the stronger the correlation, and the closer the correlation coefficient is to 0, the weaker the correlation.

Correlation coefficient:
- 0.8-1.0 Very strong correlation
- 0.6-0.8 strong correlation
- 0.4-0.6 Moderate correlation
- 0.2-0.4 weak correlation
- 0.0-0.2 Very weak or no correlation

Pearson correlation, also known as product-difference correlation (or product-moment correlation), is a method of calculating linear correlation proposed by the British statistician Pearson in the 20th century. The strength of the correlation coefficient is not enough to just look at the size of the coefficient.

Generally speaking, after taking the absolute value, 0-0.09 is no correlation, 0.3-weak, 0.1-0.3 is weak correlation, 0.3-0.5 is moderate correlation, and 0.5-1.0 is strong correlation. However, often you also need to do a significant difference test, that is, t-test, to test whether the two sets of data are significantly correlated, which will be automatically calculated in SPSS.

plt.figure(figsize=(15,15))
g=sns.heatmap(data.corr(),cmap='RdYlGn',annot=True)
plt.show()

insert image description here

Regression prediction model - XGBoost

The earliest prototype of XGBoost appeared in 2014, when Chen Tianqi was in charge of the research project during his Ph.D. After open source, it gradually developed into a mature framework supporting C++, Java, Python, R and Julia languages. XGBoost is an acronym for Extreme Gradient Boosting, where Gradient Boosting is actually a gradient boosting algorithm.

The name of Gradient Boosting actually consists of 2 parts: Gradient Descent + Boosting. First of all, you need to figure out what Boosting is. The meaning of Boosting is just like the literal meaning of "boosting". By improving the weak learner, the process of obtaining a strong learner is the process of boosting. Weak learners are very simple models with low complexity, easy to train, and not prone to overfitting.
These models are often better than random guessing, such as decision trees with only one level of depth. Then, the selected weak learner is called the base learner, and the improved learner is obtained by combination on this basis.
insert image description here

Evaluation index

An evaluation index is needed. For regression problems, MSE mean square error is often selected for evaluation. The formula is as follows:
insert image description here
Calculate the value of MSE based on the formula:

# 计算 MSE 值
np.square(np.subtract(y, y_)).mean()

XGBoost framework uses

First execute the following command to install.

pip install xgboost  # 安装

Regression calls the XGBRegressor() interface.

Modeled with XGBoost. The classifier method of XGBoost is XGBRegressor. There are many parameters, let's take a look at the commonly used ones:

max_depth – The maximum tree depth of the base learner.
learning_rate – Boosting learning rate.
n_estimators – Number of decision trees.
gamma – The penalty factor, specifying the minimum loss function drop required for node splits.
booster - specifies the boosting algorithm: gbtree, gblinear or dart.
n_jobs – specifies the number of multithreading.
reg_alpha – L1 regularization weights.
reg_lambda – L2 regularization weights.
scale_pos_weight – Positive and negative weight balance.
random_state – random number seed.

Initialize the model with default parameters.

Call XGBRegressor() to train the model and evaluate it.

import xgboost as xgb

model_r = xgb.XGBRegressor()

Divide the data set, 80% training data and 20% test data

X = data.drop(labels='Number of  reported results', axis=1)
y = data['Number of  reported results']  # 目标值

from sklearn.model_selection import train_test_split

# 划分数据集，80% 训练数据和 20% 测试数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

Train using training data

Train using training data
Calculate the R^2 evaluation metric using the test data

model_r.fit(X_train, y_train)  # 使用训练数据训练
model_r.score(X_test, y_test)  # 使用测试数据计算 R^2 评估指标

parameter

There is a parameter objective in both XGBClassifier and XGBRegressor.

When solving classification problems, objective='binary:logistic' is selected by default, and objective='reg:linear' is selected by default for regression problems.
From the literal meaning, you should be able to find that this is a parameter that specifies what type of task the learner completes, usually called the target parameter.

Then, this parameter is generally reg:linear (to be renamed: reg:squarederror) and reg:logistic when solving regression problems, representing linear regression and logistic regression respectively.

Draw a decision tree

XGBoost provides the xgb.plot_tree method, which can draw the decision subtree after the model is trained.

When using it, you only need to pass in the serial numbers of the model and subtree, and you can draw whichever one you want.

Install the graphviz package

# 安装 graphviz 包
!pip install graphviz

from matplotlib import pyplot as plt
from matplotlib.pylab import rcParams
%matplotlib inline

# 设置图像大小
rcParams['figure.figsize'] = [50, 10]

xgb.plot_tree(model_t, num_trees=1)

Cross-validation

How to use XGBoost for cross-validation.

Cross-validation is an important method for quickly evaluating models in machine learning.

You can divide the dataset into N subsets, use N-1 of them to train the model, and finally evaluate it on the remaining 1 subset.

Polling in turn, and finally find the average index of N times of evaluation, as the final evaluation result of the model.

XGBoost provides the xgb.cv method to complete the cross-validation process.

Therefore, cross-validation does not need to divide the training and test sets separately, and can directly use the complete data set.

# 依次传入特征和目标值

data_d = xgb.DMatrix(data=X, label=y)

xgb.cv(dtrain=data_d , params={
    
    'objective': 'reg:squarederror'}, nfold=5, as_pandas=True)

Among the above parameters,

dtrain passes in the data set, params is the custom parameter of the model,
nfold is N subsets divided by cross-validation,
as_pandas indicates that the final output will be in DataFrame style.

By default, XGBoost will perform Boosting iterations 10 times, so you can see 10 lines of output.

Of course, you can modify the num_boost_round parameter to customize the maximum number of iterations.

question two

For a given future solution word for a future date ,
- Develop a model that enables you to predict the distribution of reported outcomes .
- in other words,Predict relative percentages for future dates (1, 2, 3, 4, 5, 6, X)。
- What uncertainties are associated with your models and forecasts?
- Give a concrete example of your prediction for the word EERIE on March 1, 2023.
- How confident are you in your model's predictions ?

There are five vowels: a, e, i, o, u , and the rest are consonants.

The consonants are: b, c, d, f, g, h, j, k, l, m, n, p, q, r, s, t, v, w, x, y, z.

Vowel = ['a','e','i','o','u'] 
Consonant = list(set(small).difference(set(Vowel)))
def count_Vowel(s):
    c = 0
    for i in range(len(s)):
        if s[i] in Vowel:
            c+=1
    return c
def count_Consonant(s):
    c = 0
    for i in range(len(s)):
        if s[i] in Consonant:
            c+=1
    return c

df['Vowel_fre'] = df['Word'].apply(lambda x:count_Vowel(x)) 
df['Consonant_fre'] = df['Word'].apply(lambda x:count_Consonant(x))

Temporal Feature Transformation

df["year"] = df.index.year

df["quarter"] = df.index.quarter

df["month"] = df.index.month

df["week"] = df.index.week

df["weekday"] = df.index.weekday

data standardization

The standardization of data is to convert the original data according to a certain ratio through a certain mathematical transformation method, so that it falls into a small specific interval, such as the interval of 0-1 or -1-1

Eliminate the differences in characteristic attributes such as properties, dimensions, and orders of magnitude among different variables, and convert them into a dimensionless relative value,
That is to standardize the values so that the values of each indicator are at the same quantitative level.
This facilitates comprehensive analysis and comparison of indicators of different units or orders of magnitude.

from sklearn.preprocessing import StandardScaler

# 标准化
std = StandardScaler()
X1 = std .fit_transform(X)

Ensemble Learning - Random Forest

Ensemble learning is to build and combine multiple individual learners (called base learners) to complete learning tasks. As an example. In the table below, √ means the classification is correct, and × means the classification is wrong.
insert image description here

random forest

Random Forest is a decision tree based learner. But attribute selection is not the same as decision trees.

In the random forest, the basic decision tree learner randomly selects a subset containing K attributes from the attribute set of the node on each node, and then selects the optimal attribute from the subset for division.
This satisfies the "good but different" condition. Random forest has low computational overhead and is a relatively high-level algorithm among machine learning algorithms.

sklearn parameter tuning

insert image description here

Cross Validation Method Tuning

We first adjust: n_estimators, max_depth.

First observe the number of features, which determines the range of parameters such as max_depth.
Then use the cross-validation method to tune the parameters.
Get the optimal parameters n_estimators=100, max_depth=10.

def para_tune(para, X, y): #
    clf = RandomForestClassifier(n_estimators=para) # n_estimators 设置为 para
    score = np.mean(cross_val_score(clf, X, y, scoring='accuracy'))
    return score

def accurate_curve(para_range, X, y, title):
    score = []
    for para in para_range:
        score.append(para_tune(para, X, y))
    plt.figure()
    plt.title(title)
    plt.xlabel('Paramters')
    plt.ylabel('Score')
    plt.grid()
    plt.plot(para_range, score, 'o-')
    return plt

g = accurate_curve([2, 10, 50, 100, 150], X, y, 'n_estimator tuning')

insert image description here

def para_tune(para, X, y):
    clf = RandomForestClassifier(n_estimators=300, max_depth=para)
    score = np.mean(cross_val_score(clf, X, y, scoring='accuracy'))
    return score

def accurate_curve(para_range, X, y, title):
    score = []
    for para in para_range:
        score.append(para_tune(para, X, y))
    plt.figure()
    plt.title(title)
    plt.xlabel('Paramters')
    plt.ylabel('Score')
    plt.grid()
    plt.plot(para_range, score, 'o-')
    return plt

g = accurate_curve([2, 10, 20, 30, 40], X, y, 'max_depth tuning')

insert image description here

scikit-learn automatic parameter adjustment function GridSearchCV

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, cv=10,
                        train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title) # 设置图的 title
    plt.xlabel('Training examples') # 横坐标
    plt.ylabel('Score') # 纵坐标
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv,
                                                            train_sizes=train_sizes) 
    train_scores_mean = np.mean(train_scores, axis=1) # 计算平均值
    train_scores_std = np.std(train_scores, axis=1) # 计算标准差
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid() # 设置背景的网格

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std,
                     alpha=0.1, color='g') # 设置颜色
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std,
                     alpha=0.1, color='r')
    plt.plot(train_sizes, train_scores_mean, 'o-', color='g',
             label='traning score') # 绘制训练精度曲线
    plt.plot(train_sizes, test_scores_mean, 'o-', color='r',
             label='testing score') # 绘制测试精度曲线
    plt.legend(loc='best')
    return plt

clf = RandomForestClassifier()
para_grid = {
    
    'max_depth': [10], 'n_estimators': [100], 'max_features': [1, 5, 10], 'criterion': ['gini', 'entropy'],
             'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 5, 10]}#对以上参数进行网格搜索
gs = GridSearchCV(clf, param_grid=para_grid, cv=3, scoring='accuracy')
gs.fit(X, y)
gs_best = gs.best_estimator_ #选择出最优的学习器
gs.best_score_ #最优学习器的精度

g = plot_learning_curve(gs_best, 'RFC', X, y)#调用实验2中定义的 plot_learning_curve 绘制学习曲线

insert image description here

question three

Develop and summarize a model to classify solution words by difficulty .
- Identify the attributes of a given word associated with each category .
- Using your model, how hard is the word EERIE ?
- Discuss the accuracy of classification models .

Kmeans clustering algorithm

algorithm thinking

Find the k value through continuous iterations to form a division method, so that the overall error obtained when the mean value of the k clusters is used to represent the corresponding types of samples is the smallest.

The similarity of objects in the same cluster is high; while the similarity of objects in different clusters is small.
The basis of the k-means algorithm is the minimum error sum of squares criterion, and its function is:

In the above formula, μc(i) represents the mean value of the i-th cluster.
The more similar the samples divided into various clusters, the smaller the square error between them and the mean of the class,
Then the error squares calculated for all classes are accumulated and summed again,
That is, we hope that the smaller the value of J, the better.

Algorithm implementation steps

The k-means algorithm is to cluster the samples into k cluster centers, where the k value is given by us, that is, we want to divide the data into several categories.

The specific algorithm is described as follows:

For the data that needs to be clustered, randomly select k cluster centroid points;
Find the distance from each point to the cluster centroid point, calculate the class it should belong to, and iterate until it converges to a certain value.

# 导入 KMeans 估计器
from sklearn.cluster import KMeans
est = KMeans(n_clusters=4)  # 选择聚为 4 类
est.fit(X)
y_kmeans = est.predict(X)  # 预测类别，输出为含0、1、2、3数字的数组

# 为预测结果上色并可视化
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = est.cluster_centers_  # 找出中心
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)  # 绘制中心点

K-Means Algorithm: Expectation Maximization

K-Means is an algorithm that uses an expectation-maximization method to arrive at a result. Expectation maximization can be explained in two steps, and its working principle is as follows:
1. Guess some cluster centers.
2. Repeat until convergence.

Expectation step (E-step): Assign the point to its nearest cluster center point.
Maximization step (M-step): Set the cluster center point as the average of all point coordinates.

from sklearn.metrics import pairwise_distances_argmin  # 最小距离函数
import numpy as np


def find_clusters(X, n_clusters, rseed=2):
    # 1.随机选择簇中心点
    rng = np.random.RandomState(rseed)
    i = rng.permutation(X.shape[0])[:n_clusters]
    centers = X[i]
    while True:
        # 2a.基于最近的中心指定标签
        labels = pairwise_distances_argmin(X, centers)
        # 2b.根据点的平均值找到新的中心
        new_centers = np.array([X[labels == i].mean(0)
                                for i in range(n_clusters)])
        # 2c.确认收敛
        if np.all(centers == new_centers):
            break
        centers = new_centers
    return centers, labels


centers, labels = find_clusters(X, 4)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis')  # 绘制聚类结果