Machine Learning Algorithm Competition Actual Combat--3, Data Exploration

Data mining is one of the core modules of the competition, and the implementation of the competition is always the key to the success of many competitions. So what is data exploration? What problems can be solved? First of all, you should clarify 3 points, that is, how to ensure that you are ready for the algorithm model used in the competition, how to choose the most suitable algorithm for the data set, and how to define the characteristic variables that can be used in the algorithm model

Data exploration can help answer the above 3 points and can guarantee the best results of the competition, it is a method of summarizing, visualizing and familiarizing important features in the data set. Data exploration helps us discover some characteristics of the data, and the correlation between the data helps the subsequent feature construction

Data exploration can be regarded as pre-match data exploration mainly includes analysis ideas, analysis methods and purposes Through systematic exploration, we can deepen our understanding of data

In actual competition, it is better to use multiple exploration paths and methods to explore each variable and compare the results. After fully understanding the data set, you can enter the data preprocessing stage and feature extraction stage in order to transform the data according to the desired business outcome. The purpose of this step is to be confident that the dataset is ready to be applied to the machine learning algorithm

Not only for each variable, but also to analyze the relationship between variables, as well as the correlation between variables and labels and conduct hypothesis testing to help us extract useful features

Correlation analysis can only compare numerical features, so for letter or string features, it is necessary to encode and convert them into numerical values, and then see what is the relationship between the features. In actual competitions, correlation analysis can filter well Features that are not directly related to the label and this method has a good effect in many competitions

The purpose of data exploration is to help us understand the data and construct effective features

Univariate analysis is too simple to dig out the internal relationship between variables to obtain more detailed information, so multivariate analysis becomes a must
 

Analyzing the relationship between feature variables and feature variables helps to build better features while reducing the probability of component redundant features

 
The learning curve is a widely used effect evaluation tool in machine learning. It can reflect the changes in the scores of the training set and the verification set in the training iterations, and help us quickly understand the learning effect of the model. We can use the learning curve to observe whether the model is overfitting


. Determine how to improve the model by judging the fit

 

 4.4.2 Classification model evaluation index (1) - confusion matrix (Confusion Matrix)icon-default.png?t=MBR7

Draw confusion matrix sklearn_Love learning Chinese cabbage blog-CSDN blog_Draw confusion matrix icon-default.png?t=MBR7https://blog.csdn.net/csdnliwenqi/article/details/120759519 Confusion Matrix Confusion Matrix - Know (zhihu.com) icon-default.png?t=MBR7https: //zhuanlan.zhihu.com/p/111234566 【Knowledge】The characteristics and applicable occasions of the six basic charts- Tencent Cloud Developer Community-Tencent Cloud (tencent.com) icon-default.png?t=MBR7https://cloud.tencent.com/developer/article /1044115 Classification of statistical charts and the advantages of various charts? - Zhihu (zhihu.com) icon-default.png?t=MBR7https://www.zhihu.com/question/278758088 17 kinds of data visualization charts, what are the applicable scenarios and limitations- Zhihu (zhihu.com) icon-default.png?t=MBR7https://zhuanlan.zhihu.com /p/54849856

Guess you like

Origin blog.csdn.net/m0_63309778/article/details/128808325