Data Mining (4.1)--Classification and Prediction

Table of contents

foreword

1. Classification and prediction

Classification

predict

2. Questions about classification and prediction

Prepare data for classification and prediction

Evaluate classification and prediction methods

confusion matrix

Evaluation accuracy

References


foreword

Classification: discrete, categorical new data

Forecasting: Continuous, predicting unknown values

Description attributes: continuous, discrete

Class attribute: Discrete

Supervised learning:

Classification

The training samples are labeled

Classify unknown data

Unsupervised learning:

clustering

no label

Divide existing clusters

1. Classification and prediction

Classification

The classification process is a two-step process. The first step is the model building phase, or called the training phase, the purpose of this step is to describe the classifier of the pre-defined data class or concept set. In this step, a classification algorithm is used to analyze the existing data (training set) to construct a classifier . The training data set consists of a set of data tuples, each data tuple is assumed to belong to a pre-specified category (determined by the category label attribute).

In the second step of classification, the classifier obtained in the first step needs to be used for classification, so as to evaluate the prediction accuracy of the classifier . Specifically, a test dataset consisting of a set of inspection tuples and associated class labels.

In machine learning, classification is also often referred to as supervised learning . "Supervised" means that the class labels of the data tuples used for training are known, and new data is classified based on the training data set. Corresponding to it is clustering, which is called unsupervised learning in machine learning . "Unsupervised" means that the category labels of the data tuples used for training are unknown. This kind of learning aims to identify class or cluster.

predict

Data prediction is also a two-step process. Different from data classification, the attribute values ​​that need to be predicted are continuous and ordered; while the attribute values ​​that need to be predicted for classification are discrete and unordered. A predictor is similar to a classifier and can also be viewed as a map or function y = f(x), where x is an input tuple and the output y is a continuous or ordered value. As with classification, the test and training datasets should also be independent in prediction tasks. The prediction accuracy is evaluated by using the difference between the predicted value of y and the actual known value for each test tuple r.

2. Questions about classification and prediction

Prepare data for classification and prediction

Preprocessing the data used for classification and prediction, the preprocessing can generally be divided into the following three steps:
(1) Data cleaning. The main purpose is to reduce data noise and handle missing values.

Although most classification algorithms have some mechanism for dealing with noise and missing values, this step helps reduce confusion when learning.
(2) Correlation analysis. The purpose is to remove irrelevant or redundant attributes in the data.

This can speed up the training speed of the classifier and improve the accuracy of the classifier.
(3) Data conversion. The purpose is to generalize or normalize the data.

This distance measurement method can avoid the influence of different initial value ranges of different attributes on the measurement results.

Evaluate classification and prediction methods

(1) Accuracy rate.

Classification accuracy refers to the ability of a classifier to predict the class label of a new or previously unseen data tuple. The accuracy of a predictor refers to how accurately the predictor guesses the predicted attribute values ​​of new or previously unseen data tuples.
(2) Speed.

Refers to the time overhead of building a model (training) and using the model (classification/prediction).
(3) Robustness.

Refers to the ability of a classifier or predictor to handle noisy or missing value data.
(4) Scalability.

Refers to the processing power of large-scale data, classifiers or predictors.
(5) Interpretability.

Refers to the degree of understandability and insight provided by a classifier or predictor.

The accuracy and error rate of a classifier or predictor on a detection set are two commonly used metrics. The accuracy rate on the detection set refers to the proportion of tuples in the detection set that are correctly classified or predicted. In contrast, the error rate on the detection set refers to the proportion of tuples in the detection set that are misclassified or predicted.

confusion matrix

 A useful tool for analyzing situations where a classifier recognizes different tuples.

True (TruePositives) refers to positive tuples correctly labeled by the classifier.TP

True Negatives are negative tuples that are correctly labeled by the classifier. TN

False Positives are mislabeled negative tuples, FP

FalseNegatives are mislabeled positive tuples. FN

Correct rate:

\frac{TP+TN}{TP+FN+FP+TN}

Accuracy:

\frac{TP}{TP+FP}

Evaluation accuracy

Hold, random subsampling, and cross-validation are commonly used techniques for evaluating accuracy based on random sampling of given data. The use of these techniques will increase the overall computational overhead, but will benefit the model selection.

The hold method is the default method for general discussion of accuracy. This approach splits the given data into two separate sets: a training dataset and a testing dataset. Generally, 2/3 of the data is used as the training data set, and 1/3 of the data is used as the test data set. The training data set is used to build the model, while the accuracy is evaluated by the test data set.

The random subsampling method is a simple variant of the hold method, which repeats the hold method k times, and the overall accuracy estimate takes the average of the accuracy for each iteration.

In the k-crossover test, the initial data is randomly divided into k mutually disjoint subsets S1, S2,..Sk, and the size of each subset is approximately equal. Training and testing are performed k times. At the i-th iteration, the subset Si is used as the test set, and the remaining subsets are used to train the model.

References

"Data Mining: Methods and Applications" by Xu Hua

Guess you like

Origin blog.csdn.net/weixin_53197693/article/details/130201595