2023 American College Mathematical Contest in Modeling Analysis of Question Y

The following is an analysis of the ideas for Question Y of the US Spring Competition:

Problem Y: Understanding Used Sailboat Prices

Like many luxury items, sailboats change in value as they age and market conditions change. The attached "2023_MCM_Problem_Y_Boats.xlsx" file includes data for approximately 3,500 sailing boats between 36 and 56 feet in length that will be advertised for sale in Europe, the Caribbean, and the United States in December 2020.

A boating enthusiast provided the data to COMAP. Like most real-world datasets, it may have missing data or other issues that require some data cleaning before analysis. The Excel file includes two tabs, one for monohull sailboats and another for catamarans. Within each tab, the columns are labeled Make, Variable, Long (in feet), Geographic Region, Country/State, List Price (in USD), and Year (year of manufacture). For a given Make, Variant and Year, there are many other sources besides the provided Excel files that may provide a detailed characterization of a particular sailboat.

Background analysis: The above content leads to problems, and the core is to use the data given in the title to solve the problem. It is written that data cleaning needs to be performed before analysis, that is to say, data preprocessing must be written at the beginning of the paper.

Common data preprocessing methods include data cleaning, data integration, data transformation, data reduction and data discretization. According to the characteristics of data indicators, different processing can be done for different problems. The problem of data loss is specifically mentioned in the title. There are three common methods: (1) Deleting tuples is to delete objects (tuples, records) with missing information attribute values, so as to obtain a complete information table. (2) The method of data completion is to fill the empty value with a certain value, so as to complete the information table. In addition to conventional methods, intelligent algorithms such as random forests and neural networks can also be used for data completion. (3) Mean/Mode Completer Divide the attributes in the information table into numerical attributes and non-numeric attributes to be processed separately. If the null value is numeric, fill the missing attribute value according to the average value of the attribute in all other objects; if the null value is non-numeric, use the majority principle in statistics The attribute takes the value with the most values ​​(that is, the value with the highest frequency) in all other objects to fill in the missing attribute value. (The specific preprocessing scheme for this topic will be given in the data set analysis in Appendix 1 for reference)

You may supplement the provided dataset with any additional data of your choice; however, you must include the data in "2023_MCM_Problem_Y_Boats.xlsx" in your modeling. Ensure that the source of any supplementary data used is fully identified and documented. Sailboats are often sold through brokers. In order to better understand the sailboat market, a sailboat broker in Hong Kong, China commissioned your team to compile a report on the pricing of second-hand sailboats. Brokers want you to:

Background analysis : Since additional data can be used to supplement the provided data set, data collection for this topic is also a top priority, and some data will be updated for everyone in the group. Since China is used as a sailing broker, the Asia-Pacific region can be selected as the place of sale, and some existing data can be collected.

Problem One Build a mathematical model to explain the listed prices for each sailboat in the spreadsheet provided. Include any predicted values ​​that you find useful. You can use other sources to learn about other characteristics of a sailing boat (such as beam, draft, displacement, rigging, sail area, hull material, engine hours, sleeping capacity, headroom, electronics, etc.). As well as economic data by year and region. Identify and describe all data sources used. Include a discussion of the precision of your price estimates for each sailing variant.

Tip: Before doing the first question, you need to preprocess the data set given in the question.

Analysis of Question 1: The first question can be divided into three sub-questions. First of all, it is necessary to explain the data sets from other sources. This requires the team's ability to obtain information. You can look at foreign statistical websites and collect good information on the Internet. Here it is best to do a preliminary EDA (Exploratory Data Analysis) on the dataset, including some data visualization. Including but not limited to:

  • Data volume, number of features, data type
  • Data distribution (standard deviation, quantiles, maximum and minimum values)
  • Repeated value processing (retention, deletion) : If you want to discover the behavior pattern of a certain user, the user performs the same operation at different points in time, can this repeated value help you obtain the user's behavior preferences ( your problem) , that can be kept
  • Outlier processing (retention, deletion) : If you are doing anomaly detection tasks, this information can help you perform effective data labeling (your needs)
  • Missing value handling (deletion, filling)

The data visualization methods recommended here are:

Univariate visualization: view data distribution - histogram, box plot

Visualization of two variables: correlation analysis - line graph, scatter plot, heat map, such as:

The second small question needs to build a model to explain the rationality of the price in the data set. Here it can be simpler. You can first use correlation analysis to extract highly correlated indicators, and then use multiple linear regression to fit to judge the rationality of the price. sex. (For Xiaobai) You can also use factor analysis, association rules, some predictive algorithms, etc. for discriminant analysis.

The third small question needs to discuss the accuracy of the model price estimation, where confidence and confidence interval analysis can be done. You can also divide the training set and test set to directly calculate the accuracy rate. Here you can also visualize the ROC curve, etc.

1. Precision

As far as precision is concerned, there are many versions, and there are different versions. There are accuracy rates and correct rates, and even the accuracy rate is unbearable. Anyway, let’s look at English precision.

Precision is the proportion of predicted positive samples that are actually positive samples. It can be seen that precision is the proportion of positive samples considered to be predicted correctly. According to Figure 1-1, the calculation formula can be obtained as: P = TP / (TP + FP)

2. Recall rate (recall)

The recall rate is the proportion of the predicted positive samples among the actual positive samples. It can be seen that the recall rate considers the proportion of positive samples recalled. According to Figure 1-1, the calculation formula can be obtained as: P = TP / (TP + FN)

3. Accuracy

The accuracy rate represents the proportion of correct predictions among all the prediction samples. Its calculation formula is: A = (TP + TN) / (TP + FN + FP + TN)

Summary: The numerators of the calculation formulas of accuracy and recall are TP, that is, the number of positive samples predicted as positive samples, which can be known as the precision rate of positive samples and the recall rate of positive samples. Accuracy mainly characterizes the proportion of overall predictions that are correct.

Here are some suggestions for dataset processing:

1 Time series analysis can be done according to the year

2 There are many quantitative indicators that can be used to normalize such data

3 Listing Price (USD) needs to process the data format from $204,921 to 204921

4 Geographic Region can do cluster analysis by country

5 Some non-numeric data, if needed, can be converted into numerical data using one hot encoding

The above is only part of the ideas for the first question (subsequent improvement). For the rest of the ideas, data sets and other specific supporting codes, reference papers, and other topic ideas, you can read my article:

Question Two: Discuss how your modeling of a given geographic area is useful in the Hong Kong (SAR) market.

Select an informative subset of sailboats, divided into monohulls and catamarans, from the spreadsheet provided.

Find comparable listing price data for this subset from the Hong Kong (SAR) market.

If Hong Kong (SAR) had an effect on the price per sailboat of the sailboat you are in, what would the regional influence of Hong Kong (SAR) be?

Is the effect on catamarans and monohulls the same?

Question three:

Question four:

A PDF solution of no more than 25 pages in total should include:

· A one-page summary table that clearly describes your approach to the problem and the most important conclusions from your analysis in the context of the problem.

·Table of contents.

· Your complete solution.

· Submit a one to two page report to the broker.

· Reference list.

Note: The MCM contest is limited to 25 pages.

All aspects of your submission count towards the 25 page limit (summary sheet, table of contents, report, one or two page report to broker, reference list and any appendices).

You must cite the source of your ideas, data, images and any other material used in your report

Dataset Description

Data file input description Make: The name of the ship's manufacturer. Variant: The name that identifies the specific model of the boat. Length (Ft): The length of the boat in feet.
Geographic Region: The geographic region where the vessel is located (Caribbean, Europe, America).
Country/State: The specific country/state where the vessel is located. Listing Price (USD): The advertised price in USD for the purchase of the vessel.
Year: The year the ship was built

Glossary: ​​(The translation is not allowed, the following is the original title)

The above is only part of the ideas for the first question (subsequent improvement), and the rest of the ideas, data sets and other specific supporting codes, reference papers, and other topic ideas can be obtained from the business card at the end of the article

The following are suggestions for topic selection: Suggestions for topic selection for the 2023 American College Mathematical Contest in Modeling (Spring Competition)- Know about

Guess you like

Origin blog.csdn.net/weixin_43345535/article/details/129871102