A brief introduction to unsupervised learning

A brief introduction to unsupervised learning

Unsupervised learning is a method in machine learning whose goal is to identify the underlying structure and patterns of samples in unlabeled data based on the inherent structure and relationship of the data. The goal of unsupervised learning is to discover unknown structures without using any pre-defined target variables, which is the opposite of the goal of supervised learning.

The main methods of unsupervised learning include techniques such as clustering and dimensionality reduction.

clustering

Clustering is an unsupervised learning method that divides samples in a given dataset into different groups or clusters, each cluster containing similar data points. Clustering can help us discover underlying patterns and structures in a dataset, thereby deepening our understanding of the dataset.

Clustering is one of the most common methods in unsupervised learning, and its purpose is to group or cluster observation samples in a data set. The samples in the same cluster are as similar as possible, while the samples between different clusters are quite different.

Commonly used clustering algorithms include K-means, Hierarchical clustering, and DBSCAN.

K-means

The K-means algorithm is one of the most simple and popular clustering algorithms, and its working principle is as follows:

  1. First, select the number of groups to be grouped (that is, the number of clusters k).
  2. Randomly select k sample points as cluster centers.
  3. All sample points are assigned to the nearest cluster center.
  4. Update the cluster center position of each cluster.
  5. Repeat steps 3 and 4 until the convergence condition is met.

Code:

from sklearn.cluster import KMeans
import numpy as np

# 数据集
X = np.array([[5, 3], [10, 15], [15, 12], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52], [80, 91]])
plt.scatter(x[:,0],x[:,1], s = 50)
plt.show()

# 聚类数,使用K-Means算法对数据进行聚类
kmeans = KMeans(n_clusters=2)

# 训练模型
kmeans.fit(X)

# 可视化聚类效果
plt.scatter(x[:,0],x[:,1], c = kmeans.labels_, s=50)
plt.show()

# 打印聚类中心
print(kmeans.cluster_centers_)

# 预测簇
print(kmeans.labels_)

Hierarchical clustering

Hierarchical clustering is also called hierarchical clustering, which can be a bottom-up or top-down method, using different similarity measures to generate a tree-shaped hierarchical structure.

Code:

from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
import numpy as np

# 数据集
X = np.array([[5, 3], [10, 15], [15, 12], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52], [80, 91]])

# 层次聚类
linked = linkage(X, 'single')

# 绘制谱系树
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.show()

DBSCAN

The DBSCAN algorithm determines the number of clusters rather than preset. This algorithm divides a given data set into different clusters. For each cluster, its shape can be any shape. Among other things, the algorithm can also identify noisy data points.

Code:

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

# 数据集
X, y = make_moons(n_samples=200, noise=0.05, random_state=0)

# DBSCAN聚类
dbscan = DBSCAN(eps=0.2, min_samples=5)
clusters = dbscan.fit_predict(X)

# 绘图聚类
plt.scatter(X[:, 0], X[:, 1], c=clusters, s=50, cmap='viridis');
plt.show()

The above code first generates an artificial dataset with 6 data points and visualizes it. Then, use the K-Means algorithm to divide the data into two categories, and visualize the clustering results.

Dimensionality reduction

Dimensionality reduction is also a very important part of unsupervised learning, and its goal is to map high-dimensional data to low-dimensional space. Dimensionality reduction can help us better understand the data, and at the same time reduce the number of features, which makes the machine learning algorithm less computationally intensive and thus trains the model faster.

Common dimensionality reduction algorithms include PCA and t-SNE.

PCA

PCA (Principal Component Analysis) is a linear algorithm that transforms high-dimensional data into low-dimensional data. It creates new low-dimensional features by finding the main directions of variation in the data.

Code:

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# 数据集
iris = load_iris()
X = iris.data
y = iris.target

# 可视化数据
plt.scatter(x[:0],x[:,1],c = y, s = 50)
plt.show()

# PCA分析,使用pca算法降维
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# 绘制结果
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, s=50, cmap='viridis')
plt.show()

t-SNE

t-SNE (t-Distributed Stochastic Neighbor Embedding) is one of the most popular nonlinear dimensionality reduction algorithms. It is able to map high-dimensional data points to low-dimensional space and preserve the local structure between high-dimensional data points as much as possible.

Code:

from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

# 数据集
digits = load_digits()
X = digits.data
y = digits.target

# t-SNE分析
tsne = TSNE(n_components=2, perplexity=30, verbose=2)
X_tsne = tsne.fit_transform(X)

# 绘制结果
plt.figure(figsize=(10, 10))
sns.scatterplot(X_tsne[:, 0], X_tsne[:, 1], hue=y, legend='full', palette='Spectral')
plt.title('t-SNE')
plt.show()

Other Unsupervised Learning Techniques

Besides clustering and dimensionality reduction algorithms, there are many other unsupervised learning techniques such as anomaly detection, association rules, deep learning autoencoders, etc. Their application scenarios are different, and the appropriate technology can be selected according to the needs.

in conclusion

This tutorial covers the most common clustering and dimensionality reduction algorithms in unsupervised learning, as well as some other unsupervised learning techniques. I believe that readers have a deeper understanding of unsupervised learning and can apply it to practical problems.

Guess you like

Origin blog.csdn.net/qq_36693723/article/details/130404908