7. Document Clustering

7. Document Clustering

Document clustering or cluster analysis and text analysis in NLP is an interesting area, it is the application of unsupervised ML concepts and techniques. The main premise of the document clustering similar document classification, starting from the complete corpus of documents, and according to some unique features, attributes and characteristics of the document will divide them into different groups. Document classification need to pre-labeled training data to build the model, then the document is classified. ML document clustering algorithm using unsupervised documents grouped into various categories. Characteristics of these classes is compared to and between other types of documents, more similar between documents within a class, more closely related to each other.

Here we must remember one important thing is that clustering is an unsupervised learning techniques, there is always some overlap between classes, because there is no perfect definition of this cluster. All of these techniques are based on mathematical, heuristic algorithms and produce some of the inherent properties of the cluster process, they are never 100% perfect. There are a number of techniques or methods found in clusters, several popular clustering algorithm is briefly described as follows:

  • Hierarchical clustering properties : These are also known as cluster model-based clustering method connector, which is based on a concept that should be closer to similar objects in the vector space related objects, rather than independent objects, i.e. they are independent objects further from . Clustering is formed by the connection object based on the distance, it can be used to visualize the tree. The output of these models is complete, detailed hierarchy clustering. This class model is divided into coalescing and splitting clustering model.
  • Based on the centroid of the cluster model : The model constructed in such a manner cluster, i.e. has a center of each cluster, representative members of which can be representative of the cluster, and having a specific cluster and other poly to distinguish the feature class. Model-based clustering centroids comprises a variety of algorithms, such as k-means, k-medoids algorithm, such algorithms need to set in advance the number of clusters k, and minimizing a distance metric (e.g., each data point the distance to the centroid squared). The disadvantage of these models is that you need to mention the right to specify the value of k, and this may lead to local minima, so you can not get real cluster representation of the data.
  • Cluster model based on the distribution : These models use the concept of probability distributions to cluster the data points. The idea is to objects with similar distributions can be clustered into the same groups or clusters. Gaussian mixture model (Gaussian Mixture Model, GMM) algorithm to maximize the use of such expected to build these clusters. Features, attributes correlation dependencies can also use these models to capture, but this type of model is easy to over-fitting.
  • Density-based clustering model : data points such clustering model uses denser regions gathered in clusters generated, compared with the data points denser regions, other data point may randomly appear in the sparse area vector space. These sparse areas as noise, and to isolate the United States together as a boundary. Two popular algorithms in the field and is DBSCAN algorithm OPTICS algorithm.

Recently there have been some other clustering models, including BIRCH and CLARANS algorithms. Currently, there are many specialized clustered data and magazines - because clustering is a very effective and valuable topic. We will introduce three different major clustering algorithm, and describe them with actual data to facilitate better understanding:

  • k-means clustering.
  • Affinity Propagation (Affinity Propagation, AP) clustering.
  • Ward agglomerative hierarchical clustering (Ward's agglomerative hierarchical chustering).

For each algorithm, we will introduce its theoretical concepts, as did introduce other algorithms. Each will be through the clustering algorithm applied to some real data on films and film-related introduction to explain the working principle of each algorithm. I would like to see detailed statistical data clustering, and focus on the use of visual clustering algorithm after verification, because the clustering results are usually more difficult to visualize, but employees are often faced with this challenge.

Guess you like

Origin www.cnblogs.com/dalton/p/11354023.html