Introduction to Data Mining learning --- 1

Look at the recent introduction of data mining, Tsinghua University, plan to review their own save, finishing things to learn here, I hope this sort of chicken dish for some children's shoes to help it.

Classification:

Definition: Given a training set: {(x1, y1), ..., (xn, yn)}, xi generate any unknown object is mapped to its class label yi classifier (function).

Icon:

Its classical algorithm:

  • Decision Tree
  • KNN
  • Neural Networks
  • Support Vector Machines

Note: We want to be the ideal classifier can get most correct results, not to achieve 100% results require smooth.

Classification algorithms cross-validation:

 

process:

  1. For generating a model using the training data set.
  2. Using the test set model evaluation (Evaluation)
  3. The evaluation results fed back to generate a model.
  4. If the evaluation result is quite satisfactory, output generation model. Otherwise regenerate.

As for how to carry out the evaluation, we must first understand a noun: Confusion Matrix (confusion matrix):

 

For a chestnut to help understand:

We'll gender as y, ie classification in two categories, male and female. We make men as positive, the female is negative.

If a person has sex for men, its actual value is positive. If we enter these personal attributes (attributes are what can set their own slightly) model has been positive, then the corresponding true positive figure, indicating predict success, to obtain the corresponding false negative negative, show a woman a man would predict. Conversely, the corresponding other two lattices.

Of course, the accuracy of the model can be used accuracy = (TP + TN) / (P + N) (test result data set) model is correct for this.

The following ROC curve:

The first look at a map!

If we predict gender, height property directly using the predicted words:

The abscissa represents the height, two lines represent a man / woman, the middle line ah, which is the threshold.

Purple part TP, FP overlap place.

Well, the second one! Nothing to say. . . . Which is below 1 corresponds to the first line in FIG enclosed area.

We see third figure:

If we set the threshold of 1m, all forecasts are boys. TP is 100%, FP 100%. Corresponding to the upper right corner of FIG third (very small threshold).

If the threshold is set to 5m, FP, TP is 0, corresponding to the lower left corner (very large threshold).

不同的阈值,对应着此坐标系中不同点的取值。

其中,链接两个对角的对角线为random guess,也就是来一个人,不管什么属性,随机猜,就是这个效果。

理论上,我们希望这条曲线越高越好。为了衡量这条线的好坏,定义AUC。为此线下面的面积,为测试模型好坏的一指标。越接近1,此模型越好。

 

本菜鸡还是初学者,有啥错误希望路过的大神指正。

Guess you like

Origin www.cnblogs.com/jameschou/p/10989908.html