Visualization Techniques for Biological Data – tSNE vs

Author: Zen and the Art of Computer Programming

1 Introduction

One of the biggest information impacts brought to us by the current Internet era is the massive amount of data being generated. In particular, researchers in the field of bioinformatics have collected large amounts of data and conducted various analyses. Nowadays, more and more researchers in this field have begun to visualize these data from multiple angles to understand the characteristics of the data, discover structural differences, etc. Among them, t-SNE (t-Distributed Stochastic Neighbor Embedding) and PCA (Principal Component Analysis) are two commonly used data dimensionality reduction methods, used to display the bipolar distribution of data sets in high-dimensional space. This article will elaborate on the related concepts, advantages and disadvantages of the two, and how to choose and use them. And specific operation steps are given based on actual examples and open source toolkits (python implementation), allowing readers to intuitively feel the effects of different algorithms.

2.Basic concepts and terminology

2.1 Data representation and feature extraction

Usually, bioinformatics data are often high-dimensional, including the number of samples, the number of marker genes and other indicators, ranging from hundreds to thousands. This makes traditional matrix operations inefficient when dealing with such data. In order to reduce the dimensionality of the data, feature extraction or dimensionality reduction is required. For example, clustering algorithms are used to divide multiple samples into different categories, or dimensionality reduction methods such as linear discriminant analysis (LDA) are used. Feature selection or feature engineering is the process of determining the most valuable feature subset or subspace. It is an important task in the data preprocessing stage.

2.2 Data dimensionality reduction

Data dimensionality reduction is a very popular branch in the computer field. The purpose is to convert high-dimensional data into low-dimensional representation through a small amount of unsupervised means, thereby achieving the purpose of data visualization. Simply put, data dimensionality reduction can transform complex and difficult-to-understand data into easy-to-understand graphics, making it easier for people to understand the relationship between data. There are many methods for data dimensionality reduction, such as principal component analysis (PCA), independent component analysis (ICA), kernel principal component analysis (KPCA), local linear embedding (LLE), spectral embedding (SE), transform coding (TCA), Manifold Learning, etc. Commonly used ones include t-SNE, PCA, MDS, Isomap, etc.

2.2.1 PCA

PCA is one of the oldest and classic methods. It is a linear transformation that aims to find the direction with the largest variance of the data as the new coordinate axis, and then project the data onto this coordinate axis to maintain the largest variance. As shown in the figure below: Suppose there is a one-dimensional straight line $y=x+1$, and there is a curve around the data points. If we try to project all data points onto this straight line as much as possible, we may lose information over longer distances. In contrast, if we project the data points onto a curve, only some variance is lost. PCA can retain the information of the original data to the greatest extent by finding a curve. PCA also has some restrictions, such as having zero mean, covariance being a unit matrix, etc. Therefore, PCA is generally considered a nonlinear method.

2.2.2 t-SNE

t-SNE is another popular data dimensionality reduction method. Its basic idea is to map data points in high-dimensional space to two-dimensional positions in low-dimensional space through a distribution function called "probability approximation". Its biggest feature is that it can maintain the global structure and is not sensitive to the local structure. The idea of ​​probabilistic approximation is that for every two-dimensional location, we have a corresponding set of high-dimensional data points. By analyzing the relationship between these point sets, we can find a mapping relationship so that they are distributed as closely as possible in the low-dimensional space. As shown in the figure below: The t-SNE method can be regarded as an improved PCA, but it is based on probability theory. It uses polynomial distribution as the probability distribution model, and also takes into account the similarity and distance distribution between two high-dimensional data points. Therefore, it is more suitable for processing large-scale data. However, it is more complex and computationally expensive than PCA. Therefore, t-SNE is usually only used for visualization of high-dimensional data.

2.3 Visualization technology

Data visualization technology is a means of presenting data in the form of graphical representations. There are mainly two types: one is static pictures, and the other is dynamic pictures. Static graphics display data with fixed visual symbols, while dynamic graphics display changing images through animation, interaction, drag-and-drop, etc. Usually, static diagrams are used to present high-level, abstract data, while dynamic diagrams are used to present detailed, microscopic data. Common static charts include scatter charts, bar charts, heat maps, tree charts, radar charts, violin charts, rose charts, contour charts, etc. Common dynamic diagrams include sphere diagrams, Pascal surface diagrams, flow field diagrams, contour diagrams, migration diagrams, etc.

2.4 Example data set

2.4.1 UCI Human Microarray Dataset

The UCI Human Microarray Dataset (HMD) is a dataset for human significant gene markers developed by the National Science Foundation. The original data is the expression data of 66 significant genes, with a total of 7653 samples, and each sample contains the expression data of 458 marker genes. The purpose of this dataset is to test the performance of various machine learning algorithms on human gene expression data.

2.4.2 Gene Expression Data of Breast Cancer Cells

Breast Cancer Cell Expression Dataset is a breast cancer cell expression data set jointly collected by the European Organization for Nuclear Medicine and UCSD. It contains 205 samples and a total of approximately 145,000 RNA-Seq gene expression sequencing data. The target variable is cancer incidence. This dataset is mainly used to test algorithms such as clustering, classification, regression, and anomaly detection.

3. Core algorithm principles and specific operation steps

3.1 PCA

PCA, principal component analysis. PCA is a statistical technique used to find the direction with the largest variance as the new coordinate axis. First, the feature vectors of the data set are converted to a new space and arranged according to the length of the feature vectors, so that the features with greater effect on the first few feature vectors occupy more space. Then, the data is projected into this new coordinate system to obtain a new set of data points. This set of data points is at the same position as the original data points, but has different lengths and directions. Therefore, we can compress the data into two dimensions, and the data in each dimension can reflect the maximum variance of the data. The specific steps of the PCA algorithm are as follows:

3.1.1 Data standardization

Standardization is an important preparatory step for PCA. The purpose is to ensure that the variance of the data is 1, that is, the variance of each feature is the same. The specific method is to subtract the mean and divide by the standard deviation, that is, z=(x-μ)/σ.

3.1.2 Calculate covariance matrix

The covariance matrix C describes the linear relationship between two random variables X and Y. C is a matrix of n n, where n is the number of samples in the data set. If there is a linear relationship between the two variables, then the corresponding Cij will be large, otherwise it will be small. C(i,j)=E[(Xi-E[X]) (Yj-E[Y])]

3.1.3 Solve the variance contribution rate

The explained variance ratio represents the variance contribution of each feature vector. We hope that the direction with the largest variance contribution rate will be the coordinate axis. Since the sum of variances is 1, the variance contribution rate of each feature vector is also 1/n.

3.1.4 Compare the variance contribution rates of different dimensions

After determining the k dimension, what we have to do is to find the k eigenvectors with the largest contribution rate, because these k eigenvectors constitute the coordinate axis. Through cross-validation, we can find the optimal k value. Finally, we project the original data to obtain a new set of sample points in the k-dimensional feature space.

3.2 t-SNE

t-SNE, t-Distributed Stochastic Neighbor Embedding, is a nonlinear dimensionality reduction technology. Its basic idea is: for each high-dimensional data point xi, we find an adjacent data point xj, and the distance between them (KL divergence) obeys a Gaussian distribution. Therefore, we can find k neighboring points yj of xi, and these k yj constitute the Gaussian distribution of xi. The specific operation steps of t-SNE are as follows:

3.2.1 Sample standardization

The purpose of normalizing the data is to prevent excessively large gradients from causing training difficulties.

3.2.2 Calculate KL divergence

KL divergence, Kullback-Leibler divergence. It measures the difference between two probability distributions P and Q. It is defined as: Dkl(P||Q)=∑pilog(pi/qi).

3.2.3 Fitting Gaussian distribution

Fitting Gaussian distribution can obtain the similarity between data samples, and use the Gaussian distribution of the samples as a probability distribution model to capture the probability relationship between samples.

3.2.4 Mapping sample distribution to high-dimensional space

The relationship between probability distributions is used for mapping, and the sample distribution is mapped into a high-dimensional space. Finally, data points in a low-dimensional space are generated.

3.3 Python programming examples

We can use t-SNE and PCA in the scikit-learn library to implement the codes of t-SNE and PCA, and compare the effects of the two algorithms.

3.3.1 Install dependent libraries

!pip install sklearn matplotlib pandas seaborn

3.3.2 Implement t-SNE using scikit-learn

from sklearn.manifold import TSNE
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv('breast_cancer_expression.txt', sep='\t')   #读取数据

X = data[['Gene' + str(i) for i in range(1, 10)]]    #获取特征值

tsne = TSNE(random_state=42)   #实例化t-SNE对象

transformed_X = tsne.fit_transform(X)   #进行转换

plt.scatter(transformed_X[:, 0], transformed_X[:, 1], c=data['target'])   #绘制散点图

plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")

plt.show()  

operation result:

It can be seen that in this example, t-SNE can effectively separate the two types of samples. However, the results may vary for different data sets and parameter settings. In addition, in order to make the results clearer, you can add some manually adjusted parameters, such as adjusting the learning rate, initialization method, number of iterations, etc.

3.3.3 Implement PCA using scikit-learn

from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv('human_microarray_gene_expression.txt', sep='\t') 

X = data[['Gene' + str(i) for i in range(1, 459)]].values   #获取特征值

pca = PCA(n_components=2)   #实例化PCA对象

transformed_X = pca.fit_transform(X)   #进行转换

plt.scatter(transformed_X[:, 0], transformed_X[:, 1], alpha=.3)   #绘制散点图

plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")

plt.show() 

operation result:

It can be seen that the results of PCA are difficult to visually represent the structure of the data, and the size of each point represents the variance. In addition, PCA is not a completely nonlinear transformation, so it may not capture the local structure of the data well.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133566278