Classification
There are three classification methods mentioned in this chapter:
decision tree classification
Three attribute selection metrics for decision trees are detailed in Decision Tree Classification:
information gain
Partition D is the training set of labeled class tuples.
The entropy of D, that is, the desired information required is:
Divide D according to a certain attribute A, then the expected value is:
The information gain value of attribute A is:
ID3 adopts the information gain.
Information gain metrics are biased towards tests with many outputs, i.e. towards properties with a large number of values.
Gain rate
However, if the number of tuples for each attribute is 1, the information gain at this time cannot provide effective information for the division of attributes.
Divide D by attribute A, and the value of the dividing point:
Gain rate:
C4.5 adopts the gain rate.
The emergence of the gain ratio is an attempt to overcome the bias of the information gain. But the gain rate tends to produce unbalanced partitions, where one partition is much smaller than the others.
The information gain metric is based on the information obtained for the classification based on the same partition.
Gini Index
The Gini index is defined as:
.
The Gini index divided by
attribute A is: The impurity of attribute A is:
CART uses the Gini index.
The Gini index measures the impurity of a data partition or set D of training tuples. Finally, the index with the smallest Gini index is selected, which will produce a larger impurity.
The Gini index is biased towards multi-valued attributes, and when the number of classes is too large, it will be difficult due to the large amount of computation. Tends to result in equal sized partitions and purity.
Naive Bayes Classification
First introduce the basic concept in Bayes' Theorem:
about predicting the class of X with the highest posterior probability.
Posterior probability
P(H|X) is the posterior probability. or the posterior probability of H under condition X. Suppose the data tuples are limited to customers described by the attributes age and income, respectively, and X is a 25-year-old customer with an income of 40,000 yuan. Let H be some assumption, such as the customer will buy a computer.
Then P(H|X) reflects the probability that customer X buys the computer when the attribute value of customer X is known.
Priori probability
P(H), the prior probability of H. is the probability that any given customer will buy a computer, regardless of their age, income, etc. P(H) is independent of X.
Where is Naive Bayes Naive?
Naive Bayes assumes that the influence of an attribute value on a given class is independent of the values of other attributes. This assumption becomes class conditional independence.
That is, the calculation of P(X|H) is simplified.
What should I do when there are zero probability values?
Laplace estimation method: Add 1 to the counts of q classes and add q to the corresponding denominators.
IF-THEN rule classification
Model Evaluation and Selection
Evaluate classifier performance metrics
The confusion matrix is used to evaluate the quality of the classifier, and for binary classification problems, it shows true TP, true negative TN, false positive FP, and false negative FN.
The evaluation performance measures include:
accuracy, sensitivity (recall), specificity, precision, F1 and Fp.
When the main class of interest is in the minority, over-reliance on the accuracy measure can be deceived .
3% example.
Data set partitioning
- Keep
- random sampling
- Cross-validation (k-fold)
- self help
Significance test and ROC, AUC curves
The significance test is used to assess whether the difference in the accuracy of the two classifiers is due to chance. (This is useful) The
ROC curve plots the true and false positive rates for one or more classifiers.
Improving Model Accuracy: A Combination Approach
- Bagging (with replacement sampling)
- Lift (with weight)
- Random Forest (Decision Tree)
class imbalance problem
- oversampling and undersampling
- Threshold shift
- Combination technology