2023 Meisai Spring Game Y-question nanny-level thinking code

First question:

Develop a mathematical model to explain the listed prices for each sailboat in the provided spreadsheet. Include any predictors you find useful. You can take advantage of other resources to understand other characteristics of a given sailboat (such as beam, draft, displacement, rigging, sail area, hull material, engine hours, sleeping capacity, headroom, electronics, etc.) as well as economic data by year and region :

First, the dataset can be cleaned and processed, including handling missing values ​​and outliers. Data cleaning and preprocessing can be performed using Excel or other data analysis tools.

Then, variable selection and feature engineering can be done to select features that are likely to correlate with sailboat prices and build models based on those features. For example, a regression model could be used to explain sailboat prices, where possible predictors include sailboat length, hull material, engine hours, sleeping capacity, electronics, and so on. gzh digital simulation incubator

Additionally, other data sources, such as economic data by year and region, can be leveraged to enhance the model's interpretability and predictive accuracy. For example, factors such as GDP, per capita income, price level, and exchange rate can be considered, and prices can be adjusted in combination with factors such as geographical location and market demand.

Finally, the modeling process can be implemented using statistical software or a programming language such as Python or R, and techniques such as cross-validation are used to validate and test the performance of the model. During the modeling process, attention should be paid to selecting the appropriate algorithm, adjusting model parameters, and dealing with collinearity and nonlinearity. At the same time, it is recommended to use interpretable models, such as linear regression or decision trees, to better understand how the model predicts and interprets.

Specific steps:

Data cleaning and preprocessing: gzh digital model incubator

Check the dataset for missing values ​​and outliers, and handle as necessary.

Data transformation and normalization of features such as sailboat length, hull material, engine hours, sleeping capacity, electronics, etc., such as one-hot encoding or dummy variables for categorical variables, and standardization or normalization for continuous variables.

Feature selection and building models:

Based on business understanding and domain knowledge, predictor variables that are likely to correlate with sailing price are selected.

Use a regression model, such as linear regression, ridge regression, lasso regression, etc., to establish the relationship between sailing price and predictor variables.

Adjust parameters and optimize the model, such as selecting the optimal regularization coefficient, least square method, etc.

Enhance model interpretability and predictive accuracy:

Leverage additional data sources, such as economic data by year and region, to enhance model interpretability and predictive accuracy.

Adjust sailboat prices according to economic data and regions, such as GDP, per capita income, price level, exchange rate and other factors. gzh digital simulation incubator

Adjust the price considering factors such as geographic location and market demand, for example, prices may vary in different regions and markets.

Model Evaluation and Validation:

Use techniques such as cross-validation to evaluate and verify the performance of the model, such as root mean square error, mean absolute error and other indicators.

Interpret the model, analyze the impact of the model on prices, and evaluate the predictive power and interpretability of the model.

code:

1. Import database

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error, r2_score

#import dataset

df = pd.read_excel('2023_MCM_Problem_Y_Boats.xlsx', sheet_name='Monohull')

2. Data cleaning and processing:

# 删除缺失值

df.dropna(inplace=True)



# 对分类变量进行独热编码

df = pd.get_dummies(df, columns=['Manufacturer', 'Model', 'Region', 'Country'])



# 标准化连续变量

df['Length'] = (df['Length'] - df['Length'].mean()) / df['Length'].std()

df['Displacement'] = (df['Displacement'] - df['Displacement'].mean()) / df['Displacement'].std()

df['EngineHours'] = (df['EngineHours'] - df['EngineHours'].mean()) / df['EngineHours'].std()



# 分离特征和目标变量

X = df.drop(columns=['Price'])

y = df['Price']

3. Feature Selection and Model Construction

# 分割数据集

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)



# 建立线性回归模型

reg = LinearRegression()



# 拟合模型

reg.fit(X_train, y_train)



# 预测测试集

y_pred = reg.predict(X_test)



4.模型评估与验证:

# 评估模型性能

rmse = np.sqrt(mean_squared_error(y_test, y_pred))

r2 = r2_score(y_test, y_pred)



# 打印性能指标

print('RMSE:', rmse)

print('R2 Score:', r2)

.model interpretation:

# 打印模型系数和截距

coef = pd.DataFrame({'Features': X.columns, 'Coefficients': reg.coef_})

print(coef)

print('Intercept:', reg.intercept_)

The question also asks us to identify and describe all the data sources used . What's going on?

What she meant was to ask us to figure out what the document meant, we just need to explain it to her in the article.

2023_MCM_Problem_Y_Boats.xlsx file: This is the main source of data and includes data on approximately 3,500 sailing boats between 36 and 56 feet in length, including make, model, length, geographic region, country/state, price, and year.

To complement and enhance the interpretability and predictive accuracy of the model, it may be necessary to use data from other sources to extract other characteristics of the sailboat such as beam, draft, displacement, rigging, sail area, hull material, engine hours, sleeping capacity, Headroom, electronics, etc. This data may come from sailing boat manufacturers, third-party data providers, sailing associations, etc.

Next, in terms of economic data and regional data, in order to better understand the sailing market, it may be necessary to consider economic data and regional data by year, such as GDP, per capita income, price level, exchange rate and other factors. These data can come from official statistical agencies, economic research institutes, the World Bank, etc. Public number: digital simulation incubator

Also, the background mentions that a Hong Kong (SAR) sailing broker commissioned a modeling team to prepare a report on the pricing of used sailing yachts. Therefore, it may also be necessary to obtain data from brokers about the used sailboat market, such as sales prices, transaction cycles, transaction volumes, market trends, etc. These data can be obtained through communication and negotiation with the broker.

Let me give a specific demonstration example: Take the extraction of other features of the sailboat as an example:

Review and selection of available data sources: You first need to identify available data sources and review the quality and availability of the data. Possible data sources include official websites of sailing boat manufacturers, third-party data providers, sailing associations, etc.

Determine the required features and variables: According to the needs of the model, you need to determine the required features and variables. These features and variables may include beam, draft, displacement, rigging, sail area, hull material, engine hours, sleeping capacity, headroom, electronics, etc. It is necessary to pay attention to whether the selected characteristics and variables are related to the price, and whether there is a correlation between them. gzh digital simulation incubator

Extract and process data: Extract and clean required feature and variable data from data sources as needed. Transformation and normalization of the data may be required, such as one-hot encoding or dummy variables for categorical variables and standardization or normalization for continuous variables.

Integrate and merge data: Integrate and merge newly extracted feature and variable data with the original dataset. We need to ensure that each sailboat in the merged dataset has complete feature and variable information.

Further Feature Selection and Model Building: Use new feature and variable data for further feature selection and model building to improve model interpretability and predictive accuracy. It is necessary to pay attention to whether the selected characteristics and variables are related to the price, and whether there is a correlation between them.

Supongo que te gusta

Origin blog.csdn.net/shumofuhuayuan/article/details/129880474
Recomendado
Clasificación