[Continuous Update] Modeling Notes of Meisai (Overall)

Only if you want to learn, it’s not too late



Reference

Station B video


1. Topic selection

If you are a brother, choose C

2. Modeling method

Mainly grasp the advantages and disadvantages of the method, the usage scenarios, and
the basic methods are listed. When real modeling, you need to add some content according to the actual situation, you can find some literature

1. Prediction problem

Differential equation:
Grey prediction:
Markov:
Time series:
Interpolation and fitting:
(abbreviated) Neural network:

2. Classification problem

(Brief) Support Vector Machine:
Cluster Analysis:
Principal Component Analysis:
Discriminant Analysis:
Canonical Correlation Analysis:

3. Optimization problem

Linear programming:
Non-linear programming:
Tabu search:
Simulated annealing:
Genetic algorithm:
(abbreviated) Artificial neural network:

4. Evaluation and decision-making

Ideal solution:
fuzzy comprehensive evaluation method:
data envelopment analysis method:
grey relational analysis method:
principal component analysis method:
(abbreviated) rank sum ratio comprehensive analysis method:


Three, data search

1. Literature search

HowNet
Web of Science
Google Scholar, Baidu Scholar
Wikipedia

2. Database

kaggledatasets
national database
and whale database
Ali Tianchi
github-publicdatasets

Four, data processing

1. Data cleaning

(1) Missing value

Delete variables: high missing rate, low coverage rate-delete
fixed value fill: general 9999 (infinity)
statistics fill: fill according to the data distribution (average for uniform distribution; median for skewed distribution)
interpolation fill: random interpolation , Multiple interpolation method, hot platform interpolation, Lagrangian interpolation method, Newton interpolation method
(abbreviated) Model filling: regression, Bayes, random forest, decision tree

(2) Outliers

Check whether there are outliers: simple statistical analysis (box plot, judgment of each quantile), based on the median of absolute deviation, based on distance, based on density, based on clustering.
Specific processing: delete, logarithmic transformation to eliminate abnormalities , Mean/median replacement, the model is more robust to outlier data, so it can be omitted (tree model)

Outlier handling in Matlab: link

(3) Noise treatment

Smoothing data: binning-using bin statistics to replace the numbers in the bins
Establish regression models for variables and prevariables, and inversely solve the approximate values ​​of independent variables based on regression coefficients and predictors

2. Data integration

Entity recognition: (database) determine that customer_id in the database and club_id in data B refer to the same entity
redundancy problem (sort-merge): detect duplication of records by whether adjacent records are similar, and use correlation detection: numeric Variable calculation correlation coefficient matrix; nominal variable calculation chi-square test
Conflict handling: different data sets, maintain standardization and de-duplication when merged and unified

3. Data protocol

4. Data transformation

(1) Standardized processing

Maximum-minimum normalized
z-score normalized
log change

(2) Discretization processing

Conditions: The model needs to segment the continuous data into discrete intervals; the discretized features are easier to understand; the discretized can overcome the hidden defects in the data.
Methods: equal frequency method; equal width method; clustering method

(3) Sparse processing

0, 1 dummy variables are
classified into the same category

Guess you like

Origin blog.csdn.net/weixin_45660543/article/details/113112111