Dimensionality reduction visualization (tSNE, UMAP, hypertools, etc.) code and effect comparison

  In the field of machine learning and deep learning, features are often high-dimensional, but unfortunately, our computer screens are two-dimensional, and our human eyes can only observe up to three-dimensional, so features must be reduced in dimension before they can be visualized .

1. Preparatory work: extracting the features of MNIST in LeNet5

  The method is very simple. We modify the code in the first section , and we take out the output features of the penultimate fully connected layer, which has 84 dimensions. The code is as follows, output one more emb.

def forward(self, x):
    x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
    x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2))
    x = x.view(-1, self.num_flat_features(x))
    x = F.relu(self.fc1(x))
    emb = F.relu(self.fc2(x))
    x = self.fc3(emb)
    return emb,x

  Then use the trained model to go through the data set to get the embedding vector embs of all samples in the data set, and also collect labels for later drawing by category:

model.eval()
embs = []
labels = []
for data, target in test_loader:
    data, target = data.cuda(), target.cuda()
    emb,output = model(data)
    embs.append(emb.data.cpu().numpy())
    labels.append(target.data.cpu().numpy())
embs = np.concatenate(embs)
labels = np.concatenate(labels)

Second, use t-SNE visualization in sklearn

  Using tSNE visualization does not need to change the network structure, just directly process the output obtained by the original network. This function has been encapsulated in sklearn:

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, learning_rate=200, metric='cosine',n_jobs=-1)
tsne.fit_transform(embs)
outs_2d = np.array(tsne.embedding_)

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
css4 = list(mcolors.CSS4_COLORS.keys())
#我选择了一些较清楚的颜色,更多的类时也能画清晰
color_ind = [2,7,9,10,11,13,14,16,17,19,20,21,25,28,30,31,32,37,38,40,47,51,
         55,60,65,82,85,88,106,110,115,118,120,125,131,135,139,142,146,147]
css4 = [css4[v] for v in color_ind]
for lbi in range(10):
    temp = outs_2d[labels==lbi]
    plt.plot(temp[:,0],temp[:,1],'.',color=css4[lbi])
plt.title('feats dimensionality reduction visualization by tSNE,test data')

  Note that the distance type metric can be specified in the parameters of TSNE. The default is euclidean. I use cosine here, and other distances such as correlation can also be used.
insert image description here
insert image description here

Figure 1. Using sklearn's tSNE method to draw the distribution of MNIST embedded features, twice

  It can be seen that due to the different random seeds, the results drawn each time are different. The shape of each category is roughly the same, but the relative position will change. This also shows that tSNE can guarantee closer The distance relationship of points, but the distance relationship of farther points cannot be guaranteed. Usually, the intra-class and inter-class can be well reflected, but the distance relationship between classes cannot be reflected. For example black and red classes are close in the first image but far away in the second image. Or from another perspective, tSNE also has a certain classification ability while reducing dimensionality.

3. Using multiple dimensionality reduction methods in hypertools

  Hypertools is a dimensionality reduction toolkit launched by Kaggle, which integrates a variety of dimensionality reduction algorithms, such as PCA, TSNE, Isomap, UMAP, etc. In addition, there are clustering algorithms and alignment algorithms, etc., which are more convenient to use. It can be installed with pip install hypertools. The core code for dimensionality reduction and drawing is done in one line:

import hypertools as hyp
import matplotlib.pyplot as plt
hyp.plot(embs,'.',reduce='TSNE',ndims=2,hue=labels)
plt.title('TSNE')

  Note that this hyp.plot drawing also calls matplotlib.pyplot internally, so you can directly mix and use plt commands to achieve more drawing functions, such as plt.title() above. The effects of various dimensionality reduction methods are given below:
insert image description here

Figure 2. Drawing the distribution of MNIST embedded features using various dimensionality reduction methods of hypertools

  Usually, the two methods of TSNE and UMAP have better visualization effects. Others, such as PCA, retain more proportional relationships between the distance information between samples, but it seems that the various categories are not so open, while UMAP is more extreme. Classes are fairly well differentiated.
Clustering can also be achieved using hypertools, with clustering algorithms such as K-Means, AgglomerativeClustering, Birch, FeatureAgglomeration, and SpectralClustering. If there is no labels information, you can cluster first, and use the clustering results instead of labels to draw the graph of each cluster:

clust = hyp.cluster(embs, cluster='KMeans',n_clusters = 10)
hyp.plot(embs,'.',reduce='TSNE',ndims=2,hue=clust)
plt.title('TSNE, clutering by KMeans')

insert image description here

Figure 3. Using hypertools' KMeans clustering and then using TSNE to draw the distribution of MNIST embedded features

  Finally, let me talk about a small detail. Note that there are still some "noise points" of other clusters in each cluster in Figure 3. These noise points should be caused by errors in TSNE classification and KMeans classification. This also shows that the noise points in Figure 2 are not all marked. error caused.

Guess you like

Origin blog.csdn.net/Brikie/article/details/114375837