Table of contents
- 2023 MCM competition C question Wordle prediction problem solution!
- question two
2023 MCM competition C question Wordle prediction problem solution!
Question one
- The number of reported results (Number of reported results) changes daily.
- develop a model to account for this variation ,
- and use your model to create aprediction interval。
- Do any properties of the word affect the reported percentage of scores played in hardmode ? If so, how? If not, why not?
read data
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
from datetime import date, timedelta
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
import seaborn as sns
%matplotlib inline
data= pd.read_excel("./Problem_C_Data_Wordle.xlsx",header=1)
data
data preprocessing
data = data.drop(columns='Unnamed: 0')
data['Date'] = pd.to_datetime(data['Date'])
data
data.set_index("Date", inplace=True)
data.sort_index(ascending=True,inplace=True)
data=data.reset_index()
data
data analysis
Data Trends
plt.figure(figsize=(15,6))
data["Date"] = pd.to_datetime(data["Date"])
plt.plot(data['Date'],data['Number of reported results'],'r-o', markersize=3)
plt.legend(['Number of reported results'],fontsize=20)
plt.xlabel('Date',fontsize=14)
plt.ylabel('Number of reported results',fontsize=14)
data distribution
plt.figure(figsize=(10,8))
kdeplo=data['Number of reported results']
g=sns.kdeplot(kdeplo,legend=True,shade=True,color='b',label='Number of reported results')
plt.legend(loc='best', fontsize='large')
from scipy.stats import norm, skew
plt.figure(figsize=(10,8))
(mu, sigma) = norm.fit(data['Number of reported results'])
print('\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
g = sns.distplot(data['Number of reported results'], fit=norm)
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best')
g.set(ylabel='Frequency')
g.set(title=' distribution')
plt.show()
Data statistics - mean, variance, maxima and minima...
data.describe()
data dependency
corr = abs(data.corr())
corr['Number of reported results'].sort_values(ascending=False)
Number in hard mode 0.922252
Contest number 0.821787
1 try 0.342183
4 tries 0.211693
2 tries 0.118527
6 tries 0.084180
5 tries 0.077308
3 tries 0.043624
7 or more tries (X) 0.033079
The larger the absolute value of the correlation coefficient, the stronger the correlation, the closer the correlation coefficient is to 1 or -1, the stronger the correlation, and the closer the correlation coefficient is to 0, the weaker the correlation.
- Correlation coefficient:
- 0.8-1.0 Very strong correlation
- 0.6-0.8 strong correlation
- 0.4-0.6 Moderate correlation
- 0.2-0.4 weak correlation
- 0.0-0.2 Very weak or no correlation
Pearson correlation, also known as product-difference correlation (or product-moment correlation), is a method of calculating linear correlation proposed by the British statistician Pearson in the 20th century. The strength of the correlation coefficient is not enough to just look at the size of the coefficient.
Generally speaking, after taking the absolute value, 0-0.09 is no correlation, 0.3-weak, 0.1-0.3 is weak correlation, 0.3-0.5 is moderate correlation, and 0.5-1.0 is strong correlation. However, often you also need to do a significant difference test, that is, t-test, to test whether the two sets of data are significantly correlated, which will be automatically calculated in SPSS.
plt.figure(figsize=(15,15))
g=sns.heatmap(data.corr(),cmap='RdYlGn',annot=True)
plt.show()
Regression prediction model - XGBoost
The earliest prototype of XGBoost appeared in 2014, when Chen Tianqi was in charge of the research project during his Ph.D. After open source, it gradually developed into a mature framework supporting C++, Java, Python, R and Julia languages. XGBoost is an acronym for Extreme Gradient Boosting, where Gradient Boosting is actually a gradient boosting algorithm.
The name of Gradient Boosting actually consists of 2 parts: Gradient Descent + Boosting. First of all, you need to figure out what Boosting is. The meaning of Boosting is just like the literal meaning of "boosting". By improving the weak learner, the process of obtaining a strong learner is the process of boosting. Weak learners are very simple models with low complexity, easy to train, and not prone to overfitting.
These models are often better than random guessing, such as decision trees with only one level of depth. Then, the selected weak learner is called the base learner, and the improved learner is obtained by combination on this basis.
Evaluation index
An evaluation index is needed. For regression problems, MSE mean square error is often selected for evaluation. The formula is as follows:
Calculate the value of MSE based on the formula:
# 计算 MSE 值
np.square(np.subtract(y, y_)).mean()
XGBoost framework uses
First execute the following command to install.
pip install xgboost # 安装
Regression calls the XGBRegressor() interface.
Modeled with XGBoost. The classifier method of XGBoost is XGBRegressor. There are many parameters, let's take a look at the commonly used ones:
- max_depth – The maximum tree depth of the base learner.
- learning_rate – Boosting learning rate.
- n_estimators – Number of decision trees.
- gamma – The penalty factor, specifying the minimum loss function drop required for node splits.
- booster - specifies the boosting algorithm: gbtree, gblinear or dart.
- n_jobs – specifies the number of multithreading.
- reg_alpha – L1 regularization weights.
- reg_lambda – L2 regularization weights.
- scale_pos_weight – Positive and negative weight balance.
- random_state – random number seed.
Initialize the model with default parameters.
- Call XGBRegressor() to train the model and evaluate it.
import xgboost as xgb
model_r = xgb.XGBRegressor()
Divide the data set, 80% training data and 20% test data
X = data.drop(labels='Number of reported results', axis=1)
y = data['Number of reported results'] # 目标值
from sklearn.model_selection import train_test_split
# 划分数据集,80% 训练数据和 20% 测试数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Train using training data
- Train using training data
- Calculate the R^2 evaluation metric using the test data
model_r.fit(X_train, y_train) # 使用训练数据训练
model_r.score(X_test, y_test) # 使用测试数据计算 R^2 评估指标
parameter
There is a parameter objective in both XGBClassifier and XGBRegressor.
- When solving classification problems, objective='binary:logistic' is selected by default, and objective='reg:linear' is selected by default for regression problems.
- From the literal meaning, you should be able to find that this is a parameter that specifies what type of task the learner completes, usually called the target parameter.
Then, this parameter is generally reg:linear (to be renamed: reg:squarederror) and reg:logistic when solving regression problems, representing linear regression and logistic regression respectively.
Draw a decision tree
XGBoost provides the xgb.plot_tree method, which can draw the decision subtree after the model is trained.
- When using it, you only need to pass in the serial numbers of the model and subtree, and you can draw whichever one you want.
Install the graphviz package
# 安装 graphviz 包
!pip install graphviz
from matplotlib import pyplot as plt
from matplotlib.pylab import rcParams
%matplotlib inline
# 设置图像大小
rcParams['figure.figsize'] = [50, 10]
xgb.plot_tree(model_t, num_trees=1)
Cross-validation
How to use XGBoost for cross-validation.
Cross-validation is an important method for quickly evaluating models in machine learning.
- You can divide the dataset into N subsets, use N-1 of them to train the model, and finally evaluate it on the remaining 1 subset.
Polling in turn, and finally find the average index of N times of evaluation, as the final evaluation result of the model.
- XGBoost provides the xgb.cv method to complete the cross-validation process.
Therefore, cross-validation does not need to divide the training and test sets separately, and can directly use the complete data set.
# 依次传入特征和目标值
data_d = xgb.DMatrix(data=X, label=y)
xgb.cv(dtrain=data_d , params={
'objective': 'reg:squarederror'}, nfold=5, as_pandas=True)
Among the above parameters,
- dtrain passes in the data set, params is the custom parameter of the model,
- nfold is N subsets divided by cross-validation,
- as_pandas indicates that the final output will be in DataFrame style.
By default, XGBoost will perform Boosting iterations 10 times, so you can see 10 lines of output.
- Of course, you can modify the num_boost_round parameter to customize the maximum number of iterations.
question two
- For a given future solution word for a future date ,
- Develop a model that enables you to predict the distribution of reported outcomes .
- in other words,Predict relative percentages for future dates (1, 2, 3, 4, 5, 6, X)。
- What uncertainties are associated with your models and forecasts?
- Give a concrete example of your prediction for the word EERIE on March 1, 2023.
- How confident are you in your model's predictions ?
There are five vowels: a, e, i, o, u , and the rest are consonants.
The consonants are: b, c, d, f, g, h, j, k, l, m, n, p, q, r, s, t, v, w, x, y, z.
Vowel = ['a','e','i','o','u']
Consonant = list(set(small).difference(set(Vowel)))
def count_Vowel(s):
c = 0
for i in range(len(s)):
if s[i] in Vowel:
c+=1
return c
def count_Consonant(s):
c = 0
for i in range(len(s)):
if s[i] in Consonant:
c+=1
return c
df['Vowel_fre'] = df['Word'].apply(lambda x:count_Vowel(x))
df['Consonant_fre'] = df['Word'].apply(lambda x:count_Consonant(x))
Temporal Feature Transformation
df["year"] = df.index.year
df["quarter"] = df.index.quarter
df["month"] = df.index.month
df["week"] = df.index.week
df["weekday"] = df.index.weekday
data standardization
The standardization of data is to convert the original data according to a certain ratio through a certain mathematical transformation method, so that it falls into a small specific interval, such as the interval of 0-1 or -1-1
- Eliminate the differences in characteristic attributes such as properties, dimensions, and orders of magnitude among different variables, and convert them into a dimensionless relative value,
- That is to standardize the values so that the values of each indicator are at the same quantitative level.
- This facilitates comprehensive analysis and comparison of indicators of different units or orders of magnitude.
from sklearn.preprocessing import StandardScaler
# 标准化
std = StandardScaler()
X1 = std .fit_transform(X)
Ensemble Learning - Random Forest
Ensemble learning is to build and combine multiple individual learners (called base learners) to complete learning tasks. As an example. In the table below, √ means the classification is correct, and × means the classification is wrong.
random forest
Random Forest is a decision tree based learner. But attribute selection is not the same as decision trees.
- In the random forest, the basic decision tree learner randomly selects a subset containing K attributes from the attribute set of the node on each node, and then selects the optimal attribute from the subset for division.
- This satisfies the "good but different" condition. Random forest has low computational overhead and is a relatively high-level algorithm among machine learning algorithms.
sklearn parameter tuning
Cross Validation Method Tuning
We first adjust: n_estimators, max_depth.
- First observe the number of features, which determines the range of parameters such as max_depth.
- Then use the cross-validation method to tune the parameters.
Get the optimal parameters n_estimators=100, max_depth=10.
def para_tune(para, X, y): #
clf = RandomForestClassifier(n_estimators=para) # n_estimators 设置为 para
score = np.mean(cross_val_score(clf, X, y, scoring='accuracy'))
return score
def accurate_curve(para_range, X, y, title):
score = []
for para in para_range:
score.append(para_tune(para, X, y))
plt.figure()
plt.title(title)
plt.xlabel('Paramters')
plt.ylabel('Score')
plt.grid()
plt.plot(para_range, score, 'o-')
return plt
g = accurate_curve([2, 10, 50, 100, 150], X, y, 'n_estimator tuning')
def para_tune(para, X, y):
clf = RandomForestClassifier(n_estimators=300, max_depth=para)
score = np.mean(cross_val_score(clf, X, y, scoring='accuracy'))
return score
def accurate_curve(para_range, X, y, title):
score = []
for para in para_range:
score.append(para_tune(para, X, y))
plt.figure()
plt.title(title)
plt.xlabel('Paramters')
plt.ylabel('Score')
plt.grid()
plt.plot(para_range, score, 'o-')
return plt
g = accurate_curve([2, 10, 20, 30, 40], X, y, 'max_depth tuning')
scikit-learn automatic parameter adjustment function GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import learning_curve
def plot_learning_curve(estimator, title, X, y, cv=10,
train_sizes=np.linspace(.1, 1.0, 5)):
plt.figure()
plt.title(title) # 设置图的 title
plt.xlabel('Training examples') # 横坐标
plt.ylabel('Score') # 纵坐标
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv,
train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1) # 计算平均值
train_scores_std = np.std(train_scores, axis=1) # 计算标准差
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid() # 设置背景的网格
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.1, color='g') # 设置颜色
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.1, color='r')
plt.plot(train_sizes, train_scores_mean, 'o-', color='g',
label='traning score') # 绘制训练精度曲线
plt.plot(train_sizes, test_scores_mean, 'o-', color='r',
label='testing score') # 绘制测试精度曲线
plt.legend(loc='best')
return plt
clf = RandomForestClassifier()
para_grid = {
'max_depth': [10], 'n_estimators': [100], 'max_features': [1, 5, 10], 'criterion': ['gini', 'entropy'],
'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 5, 10]}#对以上参数进行网格搜索
gs = GridSearchCV(clf, param_grid=para_grid, cv=3, scoring='accuracy')
gs.fit(X, y)
gs_best = gs.best_estimator_ #选择出最优的学习器
gs.best_score_ #最优学习器的精度
g = plot_learning_curve(gs_best, 'RFC', X, y)#调用实验2中定义的 plot_learning_curve 绘制学习曲线
question three
- Develop and summarize a model to classify solution words by difficulty .
- Identify the attributes of a given word associated with each category .
- Using your model, how hard is the word EERIE ?
- Discuss the accuracy of classification models .
Kmeans clustering algorithm
algorithm thinking
Find the k value through continuous iterations to form a division method, so that the overall error obtained when the mean value of the k clusters is used to represent the corresponding types of samples is the smallest.
- The similarity of objects in the same cluster is high; while the similarity of objects in different clusters is small.
- The basis of the k-means algorithm is the minimum error sum of squares criterion, and its function is:
In the above formula, μc(i) represents the mean value of the i-th cluster. - The more similar the samples divided into various clusters, the smaller the square error between them and the mean of the class,
- Then the error squares calculated for all classes are accumulated and summed again,
- That is, we hope that the smaller the value of J, the better.
Algorithm implementation steps
The k-means algorithm is to cluster the samples into k cluster centers, where the k value is given by us, that is, we want to divide the data into several categories.
The specific algorithm is described as follows:
- For the data that needs to be clustered, randomly select k cluster centroid points;
- Find the distance from each point to the cluster centroid point, calculate the class it should belong to, and iterate until it converges to a certain value.
# 导入 KMeans 估计器
from sklearn.cluster import KMeans
est = KMeans(n_clusters=4) # 选择聚为 4 类
est.fit(X)
y_kmeans = est.predict(X) # 预测类别,输出为含0、1、2、3数字的数组
# 为预测结果上色并可视化
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = est.cluster_centers_ # 找出中心
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5) # 绘制中心点
K-Means Algorithm: Expectation Maximization
K-Means is an algorithm that uses an expectation-maximization method to arrive at a result. Expectation maximization can be explained in two steps, and its working principle is as follows:
1. Guess some cluster centers.
2. Repeat until convergence.
Expectation step (E-step): Assign the point to its nearest cluster center point.
Maximization step (M-step): Set the cluster center point as the average of all point coordinates.
from sklearn.metrics import pairwise_distances_argmin # 最小距离函数
import numpy as np
def find_clusters(X, n_clusters, rseed=2):
# 1.随机选择簇中心点
rng = np.random.RandomState(rseed)
i = rng.permutation(X.shape[0])[:n_clusters]
centers = X[i]
while True:
# 2a.基于最近的中心指定标签
labels = pairwise_distances_argmin(X, centers)
# 2b.根据点的平均值找到新的中心
new_centers = np.array([X[labels == i].mean(0)
for i in range(n_clusters)])
# 2c.确认收敛
if np.all(centers == new_centers):
break
centers = new_centers
return centers, labels
centers, labels = find_clusters(X, 4)
plt.scatter(X[:, 0], X[:, 1], c=labels,
s=50, cmap='viridis') # 绘制聚类结果