Machine Learning Algorithm Competition Actual Combat--2, Problem Modeling

Table of contents

1. Comprehension of the competition questions

1. Comprehension of the competition questions

2. Data understanding:

3. Evaluation indicators (classification and regression)

thinking exercise

When the contestants get the competition topic, the first thing they should consider is the problem modeling, and at the same time complete the pipeline construction of the baseline model, so that they can get the feedback on the results in the first time to help the follow-up work. In addition, the existence of the competition is all Relying on real business scenarios and complex data Contestants usually have a lot of ideas about this, but the number of verifications of online submission results is often limited, so it is necessary to reasonably split the training set and verification set and build a credible offline verification become very important. This is also the basis for guaranteeing the generalization of the model.

The problem modeling in the competition can be divided into three parts: comprehension of the problem, sample selection, and offline evaluation strategy

1. Comprehension of the competition questions

1. Comprehension of the competition questions

The understanding of the competition problem is actually to sort out the problem intuitively and analyze the method of solving the problem. The background of the competition problem is the main pain point of the
competition problem. , for the analysis of real business, we can use our own prior knowledge to conduct a preliminary analysis, which paves the way for the next part
 

2. Data understanding:

We can divide data understanding into two parts, which are the data base layer and the data description layer. In the exploration stage, further understand the data and discover key information from the data

3. Evaluation indicators (classification and regression)

 

 

 In actual data sets, there is often an imbalance between positive and negative samples, that is, there are many more negative samples than positive samples, or on the contrary, and the distribution of positive and negative samples in the test set may also have a good characteristic of the ROC curve over time, That is, in this case, he can still remain unchanged. However, the ROC curve is not common in competitions. On the contrary, the AUC curve can be said to be our old friend, which often appears in classification problems.


AUC is an extremely common evaluation indicator in Internet search, recommendation and advertising ranking services. It is defined as the area under the ROC curve, because the ROC curve is generally above the straight line y=x, so the value range is between 0.5 and 1. The reason why AUC is used as an evaluation index is because the ROC curve does not clearly indicate which classifier is better in many cases, and AUC is a value, and the larger the value, the better the classifier. It is worth mentioning the ranking property of AUC. Compared with indicators such as accuracy rate and recall rate, the AUC indicator itself has nothing to do with the absolute value of the probability predicted by the model. It only focuses on the sorting effect between samples, so it is especially suitable as an evaluation indicator for modeling sorting-related problems. AUC is a probability value. We randomly select a positive sample and a negative sample. The probability that the current classification algorithm ranks the positive sample ahead of the negative sample according to the calculated score is the AUC value. Therefore, the larger the AUC value, the more likely the current classification algorithm will rank positive samples ahead of negative sample values, that is, better classification.

The logarithmic loss is mainly to evaluate whether the root rate predicted by the model is accurate enough) it pays more attention to the degree of agreement with the observed data, while AUC evaluates the ability of the model to rank positive samples to the front. Due to the different emphases of the evaluation of the two indicators, the selected evaluation indicators will be different due to the different issues considered by the contestants. For the problem of advertising CTR estimation, if considering the effect of advertising ranking, Wu can choose AUC so that it will not be affected by extreme values. In addition, the logarithmic loss reflects the average deviation, and is more inclined to accurately divide the class with a large number of samples.

Although the mean absolute error solves the problem of the positive and negative bottom of the sum of the residuals and can better measure the quality of the regression model, but the existence of the absolute value causes the function to be not smooth, and it cannot be derived at some points, that is, the mean absolute error is not, Second-order continuously differentiable, and the second-order derivative is always 0

Even in the actual competition, the data provided by the organizer may have quality problems that make the contestants very headache
. This will undoubtedly have a great impact on the final prediction results, so it is necessary to consider how to select appropriate sample data for
training. So how can we select appropriate samples? Before answering this question, let’s take a look at the specifics that affect the results
What is the reason, here are four main reasons: respectively, the large data set seriously affects the performance of the model, noise and
abnormal data lead to insufficient accuracy, redundant sample data or irrelevant data do not bring benefits to the model , and
the uneven distribution of positive and negative samples leads to skewed data.

Thinking exercise:


(2 messages) Evaluation indicators and loss functions in machine learning_Yasin_'s blog-CSDN blog_Cosine similarity loss function https://blog.csdn.net/Yasin0/article/details/94435677

Summary of 7 major loss functions in machine learning (with Python drill) - Zhihu (zhihu.com) https://zhuanlan.zhihu.com/p/80370381

[Deep Learning] An article to understand the common loss function of machine learning (Loss Function) - Tencent Cloud Developer Community - Tencent Cloud (tencent.com) https://cloud.tencent.com/developer/article/1165263

Machine learning - the difference between loss function (loss) and evaluation index (metric)? - Zhihu (zhihu.com) https://zhuanlan.zhihu.com/p/373032887

Loss function VS evaluation index - come to Anhui soon - blog garden (cnblogs.com) https://www.cnblogs.com/pythonfl/p/13705143.html

Ideas for solving the problem of unbalanced classification of machine learning samples - Zhihu (zhihu.com) https://zhuanlan.zhihu.com/p/84322912

When the sample categories of the dataset are unbalanced, how should the training and testing set be done? - Zhihu (zhihu.com) https://www.zhihu.com/question/373862904

How does "cross-validation" choose the K value? - Zhihu (zhihu.com) https://zhuanlan.zhihu.com/p/31924220

Cross Validation and Hyperparameter Tuning: How to Optimize Your Machine Learning Model - Zhihu (zhihu.com) https://zhuanlan.zhihu.com/p/184608795

(2 messages) Advantages and disadvantages of k-fold cross-validation_【Machine Learning】Training set, verification set, test set; verification and cross-validation..._Hurrah lame blog-CSDN blog https://blog . csdn.net/weixin_35988311/article/details/112540577

Do you really understand cross-validation and overfitting? - Solong1989 - Blog Garden (cnblogs.com) https://www.cnblogs.com/solong1989/p/9415606.html

Classification and regression (how to transform classification problems into regression problem solving)_matrix_studio's blog-CSDN blog_Change the classification network to the regression networkhttps: icon-default.png?t=N0U7//blog.csdn.net/matrix_studio/article/details/121100472

Guess you like

Origin blog.csdn.net/m0_63309778/article/details/128800775