2. Automatic text classification

2. Automatic text classification

Now for the definition and scope of text categorization understand. When the reference to "text classification system" Text files can be divided into classes or categories they represent, but also for text classification from the formal definition of the concept and mathematics. Suppose a few people browse through the text and categorize each complete text classification tasks, then they are part of the document classification system we are talking about. However, once the number of documents and require fast over one million sorting process, which can not scale well. In order to make the process more efficient document classification and fast, automated text classification tasks you need to think about, which brings us to the automatic text classification.

To achieve automatic text classification, you can take advantage of some of the techniques and concepts of machine learning. Here there are two types of technologies related to solving this problem:

  • Supervised machine learning.
  • Unsupervised machine learning.

In addition, there are some other machine learning algorithms family, such as reinforcement learning and semi-supervised learning. Next, ascend to understand more supervised and unsupervised machine learning machine learning algorithms from machine learning algorithms to learn how to use these text files classification.

Unsupervised learning refers not need to mark the training data sample in advance to establish specific machine learning algorithm or model. Typically, there is a set of data points, which can be text or numeric types, depending on the specific problem to be solved. We "feature extraction" procedure called by extracting features from each of the data, and then from the characteristic of each data set input to the algorithm. Try to extract meaningful patterns from these data, such as clustering or using text-based summary of technical topics model for similar data grouping. This technique is useful in text classification, also known as document clustering that we rely solely on the features of the text, and attribute similarity, without the use of any model training for document annotation data packets. Follow-up will explain and discuss unsupervised learning, including themes modeling, documentation shine, similarity analysis and clustering.


Supervised learning refers to the specific machine learning techniques or algorithms trained pre-marked sample data (also referred to as training data). Extracting features using feature extraction data or attributes from, for each data point, will have the feature type corresponding to the set / label. Learning algorithm different models in each category from the training data. After the study is completed, it will be a trained model. Once the characteristics of the future data samples into the test model, the model can predict the classification test data samples. Such machines learned any new data based on samples of data samples to predict the location of training classification.

Currently, there are two main types of supervised learning algorithms.

Categories: when the predicted output is discrete type, constituting called supervised learning classification, therefore, the output variable in this case is the type of variable. Examples include news film classification or classification.

Return: When we want to output the result of a continuous numeric variables, supervised machine learning algorithm called regression algorithm. Examples include housing prices or a person's weight.

At present, on the issue of classification, try text file into a discrete category or classification.

Now, ready to be defined automatically or machine-based text classification process mathematically. There is a collection of documents, collection of documents with the appropriate category or classification label. This set can be  TS  said that this is a set of documents and tags, TS  = {(D . 1 , C . 1 ), (D 2 , C 2 ), ..., (D n- , C n- )}, where D . 1 , D 2 , ..., D n-  text list, C . 1 , C 2 , ..., CN is the type corresponding to the text. Where C € {C = C . 1 , C 2 , ..., C n- }, where C x  represents the x corresponding to the document type, C denotes the set of all possible discrete categories, set a document may be any element or more than one type. Assuming that there has been a training data set, you can define a supervised learning algorithm  F , when the algorithm in the training data  TS  after the training set, get trained classifier Y, it can be expressed as F.  ( The TS ) = Y. Therefore, the supervised learning algorithm using input F set (document, class) for  TS , get trained classifier Y, which is our model. The above process is called training process.

This model input a new, unknown document  ND , can predict the type of document C ND , so that the C ND  € C, this process is called a prediction process, can be expressed as the Y: C → the TD ND . So see supervised text classification process consists of two main processes:

  • training
  • prediction.

A key point to remember is supervised text classification also requires some manual annotation of training data, even if we are talking about automatic text classification, also requires some manual work to start our automated processing. Of course, this is also a wide range of benefits, use less effort and human supervision to keep predict and classify new documents.

The following will discuss different learning methods or algorithms. These emit not only for text data, which are common machine learning algorithms can be applied to various types of data preprocessing through feature extraction. It will involve a lot of supervised machine learning algorithms, and use them to solve real problems of text classification. These algorithms are usually trained on a training data set, Yang performed on an alternative model validation data set in order to avoid over-fitting the training data. Overfitting  basic means for a new internal parameters, performance metrics (e.g., accuracy of the validation set) or by using cross-validation to evaluate the performance. When the cross-validation, using a random sample of the training data into training and validation sets. These constitute the training process, the output is fully trained model can be predicted. In the forecast period, the general use of the test data set new data. After the normalization and feature extraction processing, they are trained into the model and then to observe how well the model is performed by evaluating the prediction performance.

Based on the number and nature of prediction prediction type, there are a variety of text classification. The classification is based on the number of types of data sets, the number associated with the type or class of the data set, the data points can be predicted.

  • When the number of discrete binary classification types or classes is 2, any one of them can be predicted.
  • Also known as multi-class classification multivariate classification, it refers to a problem when the number of types of more than 2, each of these types of prediction given a class or category. When the number of all types of more than 2, which is a problem of extended binary category.
  • Multi-label classification refers to any data, each of the plurality of prediction results may produce results / prediction type.

Guess you like

Origin www.cnblogs.com/dalton/p/11353926.html