"Data Mining Concepts and Techniques" Chapter 8 Classification: Basic Concepts

Classification

There are three classification methods mentioned in this chapter:

decision tree classification

Three attribute selection metrics for decision trees are detailed in Decision Tree Classification:

information gain

Partition D is the training set of labeled class tuples.
The entropy of D, that is, the desired information required is:
insert image description here

Divide D according to a certain attribute A, then the expected value is:
insert image description here
The information gain value of attribute A is:
insert image description here
ID3 adopts the information gain.
Information gain metrics are biased towards tests with many outputs, i.e. towards properties with a large number of values.

Gain rate

However, if the number of tuples for each attribute is 1, the information gain at this time cannot provide effective information for the division of attributes.
Divide D by attribute A, and the value of the dividing point:
insert image description here
Gain rate:
insert image description here
C4.5 adopts the gain rate.
The emergence of the gain ratio is an attempt to overcome the bias of the information gain. But the gain rate tends to produce unbalanced partitions, where one partition is much smaller than the others.
The information gain metric is based on the information obtained for the classification based on the same partition.

Gini Index

The Gini index is defined as:
insert image description here.
The Gini index divided by
insert image description here
attribute A is: The impurity of attribute A is:
insert image description here
CART uses the Gini index.
The Gini index measures the impurity of a data partition or set D of training tuples. Finally, the index with the smallest Gini index is selected, which will produce a larger impurity.
The Gini index is biased towards multi-valued attributes, and when the number of classes is too large, it will be difficult due to the large amount of computation. Tends to result in equal sized partitions and purity.

Naive Bayes Classification

First introduce the basic concept in Bayes' Theorem:
insert image description here
about predicting the class of X with the highest posterior probability.

Posterior probability

P(H|X) is the posterior probability. or the posterior probability of H under condition X. Suppose the data tuples are limited to customers described by the attributes age and income, respectively, and X is a 25-year-old customer with an income of 40,000 yuan. Let H be some assumption, such as the customer will buy a computer.
Then P(H|X) reflects the probability that customer X buys the computer when the attribute value of customer X is known.

Priori probability

P(H), the prior probability of H. is the probability that any given customer will buy a computer, regardless of their age, income, etc. P(H) is independent of X.

Where is Naive Bayes Naive?

Naive Bayes assumes that the influence of an attribute value on a given class is independent of the values ​​of other attributes. This assumption becomes class conditional independence.
That is, the calculation of P(X|H) is simplified.

What should I do when there are zero probability values?

Laplace estimation method: Add 1 to the counts of q classes and add q to the corresponding denominators.

IF-THEN rule classification

Model Evaluation and Selection

Evaluate classifier performance metrics

The confusion matrix is ​​used to evaluate the quality of the classifier, and for binary classification problems, it shows true TP, true negative TN, false positive FP, and false negative FN.
insert image description here
The evaluation performance measures include:
accuracy, sensitivity (recall), specificity, precision, F1 and Fp.
insert image description here

When the main class of interest is in the minority, over-reliance on the accuracy measure can be deceived .
3% example.

Data set partitioning

  • Keep
  • random sampling
  • Cross-validation (k-fold)
  • self help

Significance test and ROC, AUC curves

The significance test is used to assess whether the difference in the accuracy of the two classifiers is due to chance. (This is useful) The
ROC curve plots the true and false positive rates for one or more classifiers.

Improving Model Accuracy: A Combination Approach

  • Bagging (with replacement sampling)
  • Lift (with weight)
  • Random Forest (Decision Tree)

class imbalance problem

  • oversampling and undersampling
  • Threshold shift
  • Combination technology

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326318260&siteId=291194637