Full analysis of the mathematical modeling code of the multiple linear regression model for the 2023 Meisai Y question listing price impact analysis!

According to the provided "2023_MCM_Problem_Y_Boats.xlsx" file, it includes information on approximately 3,500 sailboats of different lengths, regions and manufacturing years, divided into two tables: monohull sailboats and catamaran sailboats.

We can use this data to develop a mathematical model that explains the listing price of each sailboat and includes any useful predictors. Additional data may be supplemented as needed, but this data must be included in the "2023_MCM_Problem_Y_Boats.xlsx" file and the source of any supplemental data used must be fully identified and documented.

In order to develop a mathematical model that explains listing prices, we need to consider what factors affect the price of a sailboat.

Here are some predictive factors that may be useful:

Year of Manufacture : The year a sailboat was built can have an impact on its value, as older sailboats often wear out faster and require more maintenance.

Length : The length of a sailboat can also be an important predictor. Generally speaking, longer sailboats tend to be more expensive.

Geographic Region : Geographic region may also affect the price of a sailboat, as some areas have a more active sailboat market and higher prices.

Make and model : The make and model of your sailboat may also affect the price, as some brands of sailboats may be more popular than others and thus cost more.

Hull Materials : Different hull materials can affect the price of a sailboat, as some materials may be more durable or more expensive.

Engine age : Engine age may also affect the price of a sailboat, as an engine that is used less often may be worth more.

Electronics : Electronics may also be a useful predictor, as certain equipment may increase the value of a sailboat.

Other factors : Other factors such as the number of cabins, ventilation, water treatment and power systems may also have an impact on the price of a sailboat.

We can use these predictors to develop a multiple linear regression model that explains the listing price of each sailboat. For missing data, we can use interpolation to fill in the missing data.

The final model will be able to provide a prediction of the price of each sailboat listing, and the accuracy of the prediction may be affected by the accuracy of the model fit and the quality of the data.

Based on the above factors, we can build a multiple linear regression model to explain the price of sailing boats. The model might take the following form:

Price = β0 + β1Length + β2Material + β3Displacement + β4Tonnage + β5Propulsion + β6Structure + β7Equipment + β8Region + β9Year + β10Market Conditions + ε

Among them, Price is the dependent variable, indicating the price of the ship; Length, Material, Displacement, Tonnage, Propulsion, Structure and Equipment are independent variables, indicating the characteristics of the ship; Region is a dummy variable, indicating the geographical location of the ship; Year represents the year of manufacture of the ship . ; Market Conditions represents market conditions and can be an indicator variable; β0, β1, β2, β3, β4, β5, β6, β7, β8, β9 and β10 are regression coefficients; ε is the error term.

We can use a linear regression model to build a mathematical model that predicts listing prices. The linear regression model assumes that there is a linear relationship between the predictor variables and the response variable . We can get the regression coefficients of the predictor variables by fitting the existing data, and use these coefficients to predict the listing price of the sailboat. The formula of the linear regression model is:

Among them, $y_i$ represents the response variable of the $i$-th sample (i.e., list price), and $x_{ij}$ represents the $j$-th predictor variable of the $i$-th sample (such as length, geographical area, etc.) , $\beta_j$ represents the regression coefficient of the $j$th predictor variable, $\epsilon_i$ represents the error term. We can use information such as length, geographical location, and year in the existing data as predictor variables, use the list price as the response variable, fit the regression coefficients of each predictor variable, and substitute them into the above formula to get the predicted list price.

Polynomial regression is an extension of linear regression, which assumes that the relationship between the independent variable $X$ and the response variable $Y$ can be described by a polynomial of degree $n$. The general form of a polynomial regression model is:

Among them, $\beta_0, \beta_1, \beta_2, ..., \beta_n$ are the regression coefficients, and $\epsilon$ is the error term.

# 导入相关库
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
 
# 读取数据
df = pd.read_excel('2023_MCM_Problem_Y_Boats.xlsx', sheet_name='Monohull')
# 数据清洗
df.dropna(inplace=True)
df = df[df['Listing Price']>0]
# 特征提取
X = df[['Length (ft)', 'Year', 'Country/Region/State']]
X = pd.get_dummies(X)
y = df['Listing Price']
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 模型训练
model = LinearRegression()
model.fit(X_train, y_train)
# 模型评估
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = model.score(X_test, y_test)
 
# 输出模型评估结果
print('MSE:', mse)
print('RMSE:', rmse)
print('R2 score:', r2)

Specifically, we have some further plans for model optimization :

Consider more characteristics: In addition to the columns given in the question, there may be other characteristics that will affect the price of the sailboat, such as the weight of the sailboat, age, machine hours, interior, materials, etc. You can try adding these features and use some feature selection methods to decide which features are the most relevant.

Handling missing values: There may be missing values ​​in the data set, and interpolation methods can be used to fill in the missing values. For example, you can use mean interpolation, median interpolation , K nearest neighbor interpolation, etc.

Processing outliers: There may be some outliers, and you can use some outlier detection methods, such as boxplot detection , clustering-based outlier detection, etc., to identify and process these outliers.

Processing nonlinear relationships There may be nonlinear relationships between features. Some nonlinear modeling methods can be used, such as polynomial regression, support vector machines , neural networks, etc., to capture these nonlinear relationships.

Consider geographical location: The price of a sailboat may be affected by geographical location. You can consider using geographic information system (GIS) technology to add the location of the sailboat and the characteristics of the surrounding environment (such as population density, number of surrounding ports, etc.) into the model.

Use a time series model : You can consider using a time series model to model changes in sailboat prices over time, which may involve time trends, seasonal factors, holiday effects, etc.

See here for more details: In-depth analysis of 23 US Sai (Y questions) | Complete code of mathematical modeling + full analysis of the modeling process - Zhihu (zhihu.com)

Guess you like

Origin blog.csdn.net/qq_25834913/article/details/132497788