Title
SpaGCN: Integrating gene expression, spatial
location and histology to identify spatial domains
and spatially variable genes by graph
convolutional network
SpaGCN is a method for identifying spatial domains and spatially variable genes by integrating gene expression, spatial location, and histological information through graph convolutional networks.
In SpaGCN, we combine gene expression, spatial location, and histological information to build a graph to represent the relationship between all points in the data. Through graph convolutional layers, SpaGCN can aggregate gene expression information from neighboring points. Then, SpaGCN utilizes the aggregated expression matrix to cluster the points using an unsupervised iterative clustering algorithm, considering each cluster as a spatial domain. Next, SpaGCN detects spatially variable genes enriched in specific domains by differential expression analysis.
The key strength of SpaGCN is that it comprehensively considers gene expression, spatial location, and histology information, thereby enabling the identification of spatial domains with consistent gene expression and histology and the detection of spatially variable genes with clear spatial expression patterns. Compared with other methods, the spatially variable genes detected by SpaGCN have better biological interpretation and transferability, which can be used for further research and analysis.
All in all, SpaGCN provides a powerful tool for spatial transcriptomics research by integrating data from different information sources and taking advantage of graph convolutional networks, which can reveal the spatial variation of gene expression in the tissue microenvironment and provide a basis for further understanding Cellular mechanisms and disease pathology provide important clues.
Abstract
Recent advances in spatially resolved transcriptomics (SRT) technologies have enabled comprehensive characterization
of gene expression patterns in the context of tissue microenvironment. To elucidate spatial gene expression variation, we
present SpaGCN, a graph convolutional network approach that integrates gene expression, spatial location and histology
in SRT data analysis. Through graph convolution, SpaGCN aggregates gene expression of each spot from its neighboring
spots, which enables the identification of spatial domains with coherent expression and histology. The subsequent domain
guided differential expression (DE) analysis then detects genes with enriched expression patterns in the identified domains.
Analyzing seven SRT datasets using SpaGCN, we show it can detect genes with much more enriched spatial expression patterns than competing methods. Furthermore, genes detected by SpaGCN are transferrable and can be utilized to study spatial
variation of gene expression in other datasets. SpaGCN is computationally fast, platform independent, making it a desirable
tool for diverse SRT studies.
Significant advances have recently been made in Spatially Resolved Transcriptomics (SRT) techniques, which allow us to comprehensively describe gene expression patterns in tissue microenvironments. To elucidate spatial variation in gene expression, we propose SpaGCN, a graph convolutional network approach that integrates gene expression, spatial location, and histology into the analysis of SRT data. Through graph convolution, SpaGCN combines the gene expression of each point with that of its neighbors, enabling the identification of spatial regions with consistent expression and histology. Subsequent region-guided differential expression (DE) analysis can detect genes with enriched expression patterns in defined regions. By analyzing seven SRT datasets using SpaGCN, we show that it is able to detect genes with more enriched spatial expression patterns than other competing methods. Furthermore, the genes detected by SpaGCN are transferable and can be used to study the spatial variation of gene expression in other datasets. SpaGCN is computationally fast and platform-independent, making it an ideal tool for various SRT studies.
Introduction
Recent technological advances in SRT have enabled gene
expression profiling with spatial information in tissues1
.
Knowledge of the relative locations of different cells in a tissue is critical for understanding disease pathology because spatial
information helps in understanding how the gene expression of a
cell is influenced by its surrounding environment. Popular experimental methods for SRT can be broadly classified into two categories. The first category is in situ hybridization or sequencing-based
technologies with single-cell resolution, which includes seqFISH2,3
,
seqFISH+4
, MERFISH5,6
, STARmap7
and FISSEQ8
that measure the
expression level for hundreds to thousands of genes in cells within
their tissue context. The second category is in situ capturing-based
technologies with spatial barcoding followed by sequencing, which
includes spatial transcriptomics (ST)9
, SLIDE-seq10, SLIDE-seqV2
(ref. 11), HDST12 and 10x Visium that measure the expression level
for thousands of genes in captured locations, referred to as spots.
These different SRT technologies have made it possible to uncover
the complex transcriptional architecture of heterogeneous tissues and enhanced our understanding of cellular mechanisms in
diseases13,14.
In SRT studies, an important step is identifying spatial domains defined as regions that are spatially coherent in both gene expres-sion and histology. Traditional clustering methods such as K-means and Louvain’s method15 only take gene expression data as input, and the resulting clusters may not be contiguous due to the lack of consideration of spatial information and histology. To account for spatial dependency of gene expression, new methods have been developed. For example, Zhu et al.16 uses a Hidden-Markov random field (HMRF) approach to model spatial dependency of gene expression; stLearn17 uses features extracted from histology image as
well as expression of neighboring spots spatially to normalize gene
expression data before clustering; BayesSpace18 employs a Bayesian
approach for clustering by imposing a prior that gives higher weight
to physically close spots. Although these methods can cluster spots
or cells into distinct groups, the lack of flexibility with different
modalities has made them less versatile. As newer SRT technologies
continue to be developed19–22, it is desirable to have methods that are
compatible with different SRT platforms.
To link spatial domains with biological functions, it is crucial
to identify genes that show enriched expression in the identified
domains. Methods such as Trendsceek23, SpatialDE24 and SPARK25
have been developed to detect spatially variable genes (SVGs). These
methods examine each gene independently and return a P value to
represent the spatial variability of a gene. However, due to the lack
of consideration of spatial domains, genes detected by these methods do not have guaranteed spatial expression patterns, making it
difficult to utilize these genes for further biological investigations.
Rather than considering spatial domain and SVG identification
as separate problems, we developed SpaGCN, a graph convolutional
network (GCN)-based approach that considers these two problems
jointly. SpaGCN first identifies spatial domains by integrating gene
expression, spatial location and histology through the construction
of an undirected weighted graph that represents the spatial dependency of the data. For each spatial domain, SpaGCN then detects SVGs that are enriched in the domain. By restricting the search
space to spatial domains, the SVGs detected by SpaGCN are guaranteed to have spatial expression patterns. The spatial domains and
the corresponding SVGs provide a comprehensive picture of the
spatial gradients in gene expression in tissue. SpaGCN is versatile
in analyzing many types of SRT data, including ST, 10x Visium,
SLIDE-seqV2, STARmap, and MERFISH.
Recent technological advances in SRT have enabled gene expression profiling with spatial information in tissues. Knowing the relative location of different cells in a tissue is critical to understanding disease pathology, as spatial information helps to understand how a cell's gene expression is affected by its surrounding environment. Popular SRT experimental approaches can be broadly divided into two categories. The first category is in situ hybridization-based or sequencing-based technologies with single-cell resolution, including seqFISH, seqFISH+, MERFISH, STARmap, and FISSEQ, etc., which can measure the expression levels of hundreds to thousands of genes in cells, and in analysis within its organizational context. The second type of technology is based on in situ capture, using spatial barcodes for sequencing, including spatial transcriptomics (ST), SLIDE-seq, SLIDE-seqV2, HDST, and 10x Visium, etc., which can measure the capture position (i.e. expression levels of thousands of genes in the spot). These diverse SRT techniques allow us to reveal the complex transcriptome structure of heterogeneous tissues and deepen our understanding of cellular mechanisms in disease.
An important step in SRT studies is the identification of spatial domains, regions that are spatially coherent in gene expression and histology. Traditional clustering methods such as K-means and Louvain methods only use gene expression data as input, and the resulting clustering results may not be continuous due to the lack of spatial information and histological considerations. To account for the spatial dependence of gene expression, new methods have been developed. For example, Zhu et al. used the Hidden Markov Random Field (HMRF) method to model the spatial dependence of gene expression; stLearn used features extracted from histological images and the expression of neighboring spots to normalize in space; BayesSpace through Apply a prior to give higher weight to physically close spots when clustering. While these methods can cluster spots or cells into distinct groups, the lack of flexibility for different modalities limits their applicability. As new SRT technologies continue to be developed, methods that are compatible with different SRT platforms are required.
To link spatial domains to biological function, it is critical to identify genes whose expression is enriched in defined domains. Several methods have been developed to detect Spatially Variable Genes (SVGs), such as Trendsceek, SpatialDE, and SPARK. These methods examine each gene independently and return a P-value representing the gene's spatial variability. However, due to the lack of consideration of the spatial domain, the genes detected by these methods do not have guaranteed spatial expression patterns, making it difficult to use these genes for further biological studies.
We develop SpaGCN, a graph convolutional network (GCN) based approach that considers the recognition of spatial domain and SVG as a joint problem. SpaGCN first identifies spatial domains by constructing an undirected weighted graph to integrate gene expression, spatial location, and histology. For each spatial domain, SpaGCN then detects SVGs enriched in that domain. By restricting the search space to the spatial domain, the SVGs detected by SpaGCN have guaranteed spatial representation patterns. The spatial domain and corresponding SVG provide a comprehensive picture of the spatial gradient of gene expression in tissues. SpaGCN is suitable for analyzing many types of SRT data, including ST, 10x Visium, SLIDE-seqV2, STARmap, and MERFISH, etc.
SRT技术分两类根据使用仪器不同为iST和sST,iST是基于in situ hybridization原位杂交,如seqFISH,seqFISH+, MERFISH, STARmap and FISSEQ
ST是基于in situ capturing-based technologies原位捕获技术,如 SLIDE-seq10, SLIDE-seqV2 (ref. 11), HDST12 and 10x Visium。
在SRT研究中分两步后,第一步是identifying spatial domains区分空间域,常用方法有K-means,Louvain但是未考虑空间信息和组织学信息;
常用方法有HMRF,stLearn归一化,BayesSpace加先验增加空间信息,但是缺乏多模态灵活性,兼容性差。
第二步是domains和biological functions联系起来,即在domains中识别enriched expression富集基因。方法如Trendsceek,SpatialDE
和SPARK都用来检测spatially variable genes(SVGs)空间变异基因,用p值来表示基因的空间变异性。但上面的方法缺乏对空间域的考虑。
SpaGCN将domains和detect SVGs联合问题。通过构建undirected weighted graph无向加权图来联合gene expression基因表达, spatial
location空间位置和histology组织学信息,从而识别空间域。
对于每个domains做detect SVGs。该方法适用于多种SRT数据,如ST、10x Visium、SLIDE-seqV2、STARmap和MERFISH等。
in situ hybridization is iST, focusing on hundreds of genes in the cell
in situ capturing-based technologies with spatial barcoding barcode is sST, focusing on thousands of genes in the spot
heterogeneous tissues refers to different cells or groups of cells
Spatial variable genes spatial variable genes refer to different genes enriched in different domains
Results / Experiments
Overview of SpaGCN and evaluation. We explain the workflow
of SpaGCN using in situ capturing-based SRT data as an example,
but the method can be easily modified to analyze other types of SRT
data. As shown in Fig. 1a, SpaGCN first builds a graph to represent
the relationship of all spots considering both spatial location and
histology information. Next, SpaGCN utilizes a graph convolutional
layer to aggregate gene expression information from neighboring
spots. Then, SpaGCN uses the aggregated expression matrix to
cluster spots using an unsupervised iterative clustering algorithm26.
Each cluster is considered as a spatial domain from which SpaGCN
then detects SVGs that are enriched in a domain by DE analysis
(Fig. 1b). When a single gene cannot mark the expression pattern
of a domain, SpaGCN will construct a meta gene, formed by the
combination of multiple genes, to represent the expression pattern
of the domain.
To showcase the strength of SpaGCN, we applied it to seven publicly available datasets (Supplementary Table 1). The spatial domains
identified by SpaGCN agree better with known tissue structures
than Louvain, stLearn, and BayesSpace. We also compared SVGs
detected by SpaGCN with those detected by SpatialDE and SPARK,
and found that the SpaGCN-detected SVGs have more coherent
expression patterns and better biological interpretability than the
other two methods. The specificity of spatial expression patterns
revealed by SpaGCN-detected SVGs were further confirmed by
Moran’s I and Geary’s C statistics27, two commonly used metrics for
quantifying spatial autocorrelation of gene expression28,29.
Overview and evaluation of SpaGCN. We explain the SpaGCN workflow based on capture-based in situ SRT data as an example, but the method can be easily modified to analyze other types of SRT data. As shown in Figure 1a, SpaGCN first constructs a graph to represent the relationship among all spots, considering the spatial location and histological information. Next, SpaGCN utilizes graph convolutional layers to aggregate gene expression information from neighboring spots. Then, SpaGCN uses the aggregated expression matrix to cluster the spots with an unsupervised iterative clustering algorithm. Each cluster was considered as a spatial domain, from which SpaGCN detected SVGs enriched in this domain by differential expression analysis (Fig. 1b). When a single gene cannot mark the expression pattern of a domain, SpaGCN will construct a metagene, composed of multiple genes, to represent the expression pattern of the domain.
To demonstrate the advantages of SpaGCN, we apply it to seven publicly available datasets (Supplementary Table 1). The spatial domain identified by SpaGCN is more consistent with the known organizational structure, compared to Louvain, stLearn and BayesSpace. We also compared the SVGs detected by SpaGCN with those detected by SpatialDE and SPARK, and found that the SVGs detected by SpaGCN have more consistent expression patterns and better biological interpretability. The specificity of the spatial expression patterns revealed by SVGs detected by SpaGCN was further validated by Moran's I and Geary's C statistics, which are commonly used indicators for quantifying the spatial autocorrelation of gene expression.
SpaGCN适用于in situ capturing-based SRT data。首先构建一个graph,这个graph考虑了spatial location空间信息和histology
information组织信息;
然后利用GCL/graph convolutional layer从neighboring spots中聚合gene expression information基因信息,得到一个aggregated expression matrix聚合表达矩阵;
对AEM使用聚类算法聚类,将spot聚类得到domains;
再对单个domain使用DE analysis得到单个SVG,但是当单个基因无法表达区域时,就构建一个meta gene元基因,由多个基因组成来表达domain中的SVG。
识别domains比Louvain, stLearn和BayesSpace更好,ARI更高。
detect SVGs比SpatialDE和SPARK更好,Moran's I和Geary's C统计量更好,这两个统计量用于量化基因表达的空间自相关性。
Application to human primary pancreatic cancer ST data. To
demonstrate the importance of incorporating histology information, we analyzed a human primary pancreatic cancer dataset generated using the ST technology13. This dataset includes 224 spots
and 16,448 genes with three manually annotated tissue regions.
The cancer region detected by Louvain based on gene expression
alone did not closely match the pathologist-annotated cancer region
(Fig. 2a). Spatial clustering methods such as stLearn and BayesSpace
did not detect the cancer region either. SpaGCN revealed a similar pattern when using default parameters. As the histology image
shows a clear difference between the cancer and noncancer regions,
it suggests histology is informative for clustering. SpaGCN has the
flexibility of modeling histology with a scaling parameter s, which
controls the weight given to histology when detecting neighbors
for each spot. By increasing the value of s from 1 to 2, SpaGCN
detected a cluster that agrees well with the manually annotated cancer region. It is worth noting that when s was set at the default value
of 1, SpaGCN detected the noncancer regions well. When s was
increased to 2, SpaGCN not only maintained the ability to detect
the noncancer regions but also detected the cancer region. This
example showed that SpaGCN is flexible in incorporating histology
information in clustering. Although stLearn can incorporate histology data, its use of histology information is pre-fixed by the radius
when defining neighboring spots. The lack of flexibility in adjusting
histology weight led to the discrepancy between their clustering and
the pathologist’s manual annotation.
Next, we detected SVGs using SpaGCN, SPARK and SpatialDE.
In total, SpaGCN detected 12 SVGs, with three, eight and one SVGs
for domains 0, 1 and 2, respectively (Fig. 2b; Supplementary Fig. 1).
Furthermore, a meta gene using KRT17, MMP11 and SERPINA1 marked the cancer region better than the originally identified
KRT17 for domain 2 (Fig. 2c). KRT17 functions as a tumor promoter
and regulates proliferation in pancreatic cancer30, and MMP11 is a
prognostic biomarker for pancreatic cancer31. Our identification of
KRT17 and MMP11 as the two positive genes for the cancer region
agrees well with pancreatic cancer biology. SPARK and SpatialDE
detected 203 and 163 SVGs, with their P or Q values highly skewed
towards 0 (Supplementary Figs. 2 and 3). However, the Moran’s
I and Geary’s C values for their SVGs are much lower than those
detected by SpaGCN, suggesting their lack of spatial patterns
(Fig. 2d). Furthermore, genes with smaller P or Q values do not
necessarily show better spatial expression patterns than those with
larger P or Q values (Supplementary Figs. 4 and 5). More stringent
filtering of spots and genes did not improve the spatial pattern for
SpatialDE and SPARK-detected SVGs (Supplementary Fig. 6).
Applied to human primary pancreatic cancer ST data. To demonstrate the importance of integrating histological information, we analyzed a human primary pancreatic cancer dataset generated using ST technology. The dataset includes 224 spots and 16,448 genes with three manually annotated tissue regions. The cancer regions detected by the Louvain clustering method based only on gene expression did not exactly match the cancer regions annotated by pathologists. Spatial clustering methods such as stLearn and BayesSpace also failed to detect cancerous regions. SpaGCN reveals a similar pattern when using the default parameters. However, since the histology images show clear differences between cancerous and non-cancerous regions, this suggests that histology is informative for clustering. SpaGCN has the flexibility to adjust histology, and can control the weight given to histology when detecting neighbors of each spot through a scaling parameter s. Increasing the value of s from 1 to 2, one cluster detected by SpaGCN is in good agreement with manually annotated cancer regions. It is worth noting that when s is set to the default value of 1, SpaGCN can also detect non-cancer regions well. When s is increased to 2, SpaGCN not only maintains the ability to detect non-cancer regions, but also detects cancer regions. This example demonstrates the flexibility of SpaGCN to integrate histological information in clustering. While stLearn can integrate histological data, the histological information it uses when defining neighboring spots is pre-fixed by the radius. The inability to adjust histology weights resulted in discrepancies between its clustering results and pathologist's manual annotations.
Next, we detect SVGs using SpaGCN, SPARK and SpatialDE. In total, SpaGCN detected 12 SVGs, among which 3, 8, and 1 SVGs were distributed in domains 0, 1, and 2, respectively (Fig. 2b; Supplementary Fig. 1). Furthermore, metagenes constructed using KRT17, MMP11, and SERPINA1 better marked cancer regions than KRT17 initially identified in domain 2 (Fig. 2c). KRT17 functions as a tumor promoter and regulates proliferation in pancreatic cancer, while MMP11 is a prognostic biomarker in pancreatic cancer. Our finding of KRT17 and MMP11 as two positive genes in cancer regions fits well with the biology of pancreatic cancer. SPARK and SpatialDE detect 203 and 163 SVGs whose P or Q values are highly skewed around 0. However, the Moran's I and Geary's C values of their SVGs are much lower than those detected by SpaGCN, indicating that they lack spatial patterns. Furthermore, genes with smaller P-values or Q-values do not necessarily show better spatial expression patterns than genes with larger P-values or Q-values. For SVGs detected by SpatialDE and SPARK, stricter spot and gene filtering did not improve the spatial patterns.
Application to human dorsolateral prefrontal cortex 10x Visium
data. To show quantitatively that SpaGCN outperforms Louvain,
stLearn and BayesSpace in spatial domain detection, we analyzed
the LIBD human dorsolateral prefrontal cortex (DLPFC) data generated using 10x Visium32. This study sequenced 12 tissue slices that
span six neuronal layers plus white matter from the DLPFC in three
human brains. The manual annotation of the tissue layers provided
by the original study allows us to evaluate the accuracy of spatial
domain detection. Figure 3a shows that for the representative tissue slice 151673, both SpaGCN and BayesSpace revealed spatial
domains that agree better with the manually annotated tissue layers
than Louvain. Although stLearn utilized histology information, its
performance is not much better than Louvain and is substantially
worse than SpaGCN and BayesSpace. The relative performance
of these methods remains the same when considering all 12 slices
(Fig. 3b and Supplementary Table 2); the median ARI is 0.36 for
stLearn, 0.42 for BayesSpace and 0.45 for SpaGCN.
To validate further the identified spatial domains, we detected
SVGs for each domain in slice 151673. In total, SpaGCN detected
67 SVGs, with 53 of them being specific to domain 5, which corresponds to white matter (Supplementary Fig. 7). Patterns of SVGs
for other domains are not very clear. These results indicate that
gene expression profiles of spots from white matter are distinct
from spots in the neuronal layers, while gene expression differences
among the six neuronal layers are much smaller and more difficult to distinguish using individual marker genes. SVGs detected
by SPARK and SpatialDE also suffered from the same problem.
SPARK detected 3,187 SVGs with 1,131 of them having false discovery rate (FDR)-adjusted P values equal to 0, most of which
only marked the white matter region (Supplementary Figs. 8 and
9). We also found that the SVGs detected by SPARK lack domain
specificity (Supplementary Fig. 10). SpatialDE detected 3,654 SVGs
with 806 of them having Q values equal to 0, but these genes do
not necessarily show better spatial patterns than genes with larger
Q values (Supplementary Fig. 11). Although SPARK and SpatialDE
detected much larger numbers of SVGs than SpaGCN, the genes
detected by these two methods cannot distinguish different degrees
of spatial expression variability as their P or Q value distributions
are highly skewed towards 0. Figure 3c shows that the Moran’s I values for SpaGCN-detected SVGs are significantly higher than genes
detected by SpatialDE and SPARK (median of 0.39 for SpaGCN
against 0.09 for SPARK and 0.08 for SpatialDE). More stringent
filtering of spots and genes did not improve the performance of
SpatialDE and SPARK (Supplementary Fig. 12). For three out of the
six neuronal layers, SpaGCN detected a single SVG to mark that
region (Fig. 3d). For example, CAMK2N1 is enriched in domain 0
(layers 1 and 2), PCP4 is enriched in domain 1 (layer 4) and NEFM
is enriched in domain 3 (layer 3).
To show that SpaGCN-detected SVGs are useful for downstream
analysis, we performed K-means clustering on slice 151507, which is from a different brain, using all 67 SVGs detected from slice
151673 by SpaGCN. Compared with manually curated layer assignment, this clustering analysis had a Adjusted Rand Index (ARI) of
0.23 (Fig. 3e,f). We performed similar analysis using SVGs detected
by SpatialDE and SPARK. When randomly selecting 67 SVGs with
0 P or Q value from genes detected by SpatialDE/SPARK, the ARI is
only 0.13 for SpatialDE and 0.14 for SPARK. The ARIs for SpatialDE
and SPARK did not improve even with increased numbers of SVGs
(Fig. 3e). These results further confirmed the lack of spatial patterns
for genes detected by SPARK and SpatialDE.
Although it is difficult to identify single genes to mark certain
neuronal layers, SpaGCN was able to find domain-specific meta
genes. As shown in Fig. 3g, SpaGCN detected meta genes for
domains 1, 2, 4 and 6. The meta gene for domain 2 is specific to layer 1. As layer 1 only has a few spots, it is difficult to find a highly
enriched gene. However, by adding depleted genes such as FTH1,
MBP, MT-CO3 and PLP1, the expression pattern in this region is
strengthened. Furthermore, the SVGs and meta genes detected by
SpaGCN are transferrable to slice 151507 obtained from a different brain, in which the meta genes detected in slice 151673 mark
the same layers in slice 151507 (Fig. 3g and Supplementary Fig. 13).
10x Visium data applied to the human dorsolateral prefrontal cortex (DLPFC). To quantitatively demonstrate that SpaGCN outperforms Louvain, stLearn, and BayesSpace in spatial domain detection, we analyze LIBD human DLPFC data generated using 10x Visium technology. The study sequenced 12 tissue sections from six neuronal layers of the DLPFC and white matter from three human brains. The tissue-level manual annotations provided by the original study allowed us to assess the accuracy of spatial domain detection. Figure 3a shows that for representative tissue slice 151673, the spatial domains revealed by SpaGCN and BayesSpace are more consistent with the manually annotated tissue levels, while Louvain does not perform as well as them. Although stLearn utilizes histological information, its performance is not much better than Louvain, and far behind SpaGCN and BayesSpace. The relative performance of the methods remains the same when all 12 slices are considered (Fig. 3b and Supplementary Table 2); the median ARI is 0.36 for stLearn, 0.42 for BayesSpace, and 0.45 for SpaGCN.
To further validate the identified spatial domains, we detected SVGs for each domain in slice 151673. In total, SpaGCN detected 67 SVGs, 53 of which were specific to domain 5, corresponding to white matter (Supplementary Fig. 7). The patterns for SVGs in other domains are less clear. These results suggest that the gene expression profiles of spots in white matter differ from spots in neuronal layers, while gene expression differences between the six neuronal layers are much smaller and more difficult to distinguish using a single marker gene. SVGs detected by SPARK and SpatialDE have the same problem. SPARK detected 3,187 SVGs, of which 1,131 had FDR-adjusted P values equal to 0, most of which only labeled white matter regions (Supplementary Figures 8 and 9). We also found that the SVGs detected by SPARK lack domain-specific properties (Supplementary Figure 10). SpatialDE detected 3,654 SVGs, of which 806 had Q values equal to 0, but the spatial patterns of these genes were not necessarily better than those with larger Q values (Fig. 11). Although SPARK and SpatialDE detect much more SVGs than SpaGCN, the genes detected by these two methods cannot distinguish different degrees of spatial expression variation because their P or Q value distributions are highly skewed around 0. Figure 3c shows that the Moran's I value of SVGs detected by SpaGCN is significantly higher than that of genes detected by SpatialDE and SPARK (median of 0.39 for SpaGCN, 0.09 for SPARK, and 0.08 for SpatialDE). Stricter filtering of spots and genes did not improve the performance of SpatialDE and SPARK (Fig. 12). For three of the six neuron layers, SpaGCN detects a single SVG to label the region (Fig. 3d). For example, CAMK2N1 is enriched in domain 0 (layers 1 and 2), PCP4 in domain 1 (layer 4) and NEFM in domain 3 (layer 3).
To demonstrate the usefulness of the SVGs detected by SpaGCN in downstream analysis, we performed K-means clustering on slice 151507, which is from another brain, using all 67 SVGs detected by SpaGCN from slice 151673. Compared with manually labeled stratum assignments, the Adjusted Rand Index (ARI) for this cluster analysis was 0.23 (Fig. 3e,f). We performed a similar analysis using SVGs detected by SpatialDE and SPARK. When randomly selecting 67 SVGs with 0 P-values or Q-values in genes detected by SpatialDE/SPARK, the ARI of SpatialDE is only 0.13 and that of SPARK is 0.14. Even increasing the number of SVGs, the ARI of SpatialDE and SPARK does not improve (Fig. 3e). These results further confirm the lack of spatial patterns in genes detected by SPARK and SpatialDE.
While it is difficult to pinpoint individual genes to mark specific neuronal layers, SpaGCN is able to find domain-specific metagenes. As shown in Figure 3g, SpaGCN detected metagenes of domains 1, 2, 4 and 6. Domain 2 metagenes are specific to layer 1. Since layer 1 has only a few spots, it is difficult to find highly enriched genes. However, the expression pattern in this region was enhanced by the addition of depleted genes such as FTH1, MBP, MT-CO3 and PLP1. Furthermore, SVGs and metagenes detected by SpaGCN could be transferred to slice 151507 obtained from another brain where the metagenes detected by SpaGCN labeled the same layer (Fig. 3g and Supplementary Fig. 13).
Application to mouse posterior brain 10x Visium data. Next,
we analyzed a 10x Visium dataset generated from mouse posterior brain that includes 3,353 spots and 31,053 genes33. This dataset shows much more complex tissue structure than the previous
two datasets. We compared the clustering result of SpaGCN with
Louvain, stLearn and BayesSpace when the number of clusters was
set at ten for all methods. Figure 4a shows that Louvain’s clustering is similar to stLearn, BayesSpace and SpaGCN, but the spatial
domains detected by the latter three methods are more spatially
contiguous due to their ability to account for spatial dependency of
gene expression.
We further investigated the ability of each method in detecting
more refined tissue structure. Specifically, we performed subclustering analysis for spots in domain 5 detected by SpaGCN, which
corresponds to the cortex (Fig. 4b). The subdomains detected by
SpaGCN agree well with the Allen Brain Institute reference atlas
diagram of the mouse cortex (Fig. 4c). The detected subdomains
include layers 2/3, layers 4/5, layer 6, a hippocampal region (CA1)
and the subiculum. Layers 2/3 are the ‘external’ cortical layers that
are biologically responsible for local networks in which neurons in
this subdomain communicate to other neurons in adjacent neocortical regions. Layers 4/5 are the ‘internal’ cortical layers that are biologically responsible for longer range neural networks. For example,
the visual cortex, which corresponds to the neocortical region, is
responsible for receiving visual information from the lateral geniculate nucleus that is far away. SpaGCN was able to separate the
molecular (layer 1), external (layers 2/3), internal (layers 4/5) and
the plexiform (6) layers. More importantly, SpaGCN outperformed
Louvain and stLearn, which show combining of neocortical layers.
SpaGCN also outperformed BayesSpace in distinguishing between
the plexiform layer (subdomain 1) and the non-neocortical CA1
region of the hippocampus (subdomain 3). In contrast, BayesSpace
combined layer 6 of the neocortex with the non-neocortical CA1
layer of the hippocampus.
Next, we compared SpaGCN with SPARK and SpatialDE for
SVG detection. SpaGCN detected 1,028 SVGs for the ten spatial
domains while SPARK and SpatialDE detected 9,678 and 12,676
SVGs, respectively (Supplementary Fig. 14). As shown in Fig. 4d,
the Moran’s I values of SpaGCN-detected SVGs are much higher
than those detected by SPARK and SpatialDE (median of 0.54 for
SpaGCN against 0.20 for SPARK and 0.16 for SpatialDE). More
stringent filtering of spots and genes did not improve the performance of SPARK and SpatialDE (Supplementary Fig. 15). The
P or Q value distributions of SpatialDE and SPARK are highly skewed towards 0 (Supplementary Fig. 16), and genes with similar P or Q values do not necessarily show similar spatial patterns
and a smaller P or Q value does not guarantee a better spatial pattern (Supplementary Figs. 17 and 18). In contrast, multiple domain
adaptive filtering criteria implemented in SpaGCN allow it to eliminate false positive SVGs and ensure all detected SVGs have clear
spatial expression patterns.
To illustrate how the filtering in SpaGCN works, we use domains
1, 5 and 8 as an example. For each of these domains, SpaGCN
detected a single SVG enriched in that region. As shown in Fig. 4e,
PVALB is enriched in domain 1 and TRM62 is enriched in domain
8. Although domains 1 and 8 are adjacent to each other, these
two SVGs can still well mark these domains. NRGN is a SVG that
SpaGCN detected for domains 5 and 7. The high expression of
NRGN in domains 5 and 7 also indicates that these two domains are
neuroanatomically similar—both consisting of cortex and the pyramidal layer of the hippocampus. Both the cortex and hippocampus
are regions that are on the curved surface of the brain. Domains
5 and 7, which would be contiguous in a three-dimensional (3D)
reconstruction, are artifactually separated as a result of how the section was cut. Therefore, it is not surprising that in addition to NRGN,
SpaGCN also detected many other SVGs for domains 5 and 7, some
of which are highly expressed in both domains (Supplementary
Fig. 19). The unique and powerful SVG detection procedure in
SpaGCN ensures that genes such as these are not missed.
SpaGCN only identified four SVGs for domain 0. However, we
reason that a meta gene, formed by the combination of multiple
genes, may better reveal spatial patterns than any single genes. We
used domain 0 as an example to show how SpaGCN can create
informative meta genes to mark a spatial domain (Fig. 4f). First, by
lowering the filtering thresholds, SpaGCN identified KLK6 which
is highly expressed in the lower part of domain 0. Using KLK6 as a
starting gene, SpaGCN used a novel approach to find a log-linear
combination of gene expression of KLK6, MBP and ATP1B1, which
accurately marked the spatial domain 0. In this meta gene, KLK6
and MBP are considered as positive markers because they are
highly expressed in some spots in domain 0, whereas ATP1B1 is
considered a negative marker as it is mainly expressed in regions
other than domain 0. Previous studies have shown that KLK6 and
MBP expression is restricted to oligodendrocytes, while ATP1B1 is
mainly expressed in neurons and astrocytes34. This resonates with
the fact that domain 0 represents white matter which is dominated
by oligodendrocytes and has few neuronal cell bodies. Therefore,
the genes that make up this meta gene have meaningful biological
interpretations. While we focused our analyses on one tissue section, SpaGCN
can also jointly analyze multiple tissue sections. We show two examples using this mouse brain Visium data provided by 10x Genomics.
Figure 5a shows SpaGCN clustering results for two mouse posterior sections. As these two tissue sections are from the same region,
SpaGCN was able to infer cluster correspondence between the two
tissue sections. Next, we used SpaGCN to analyze jointly two tissue sections with one from the mouse posterior brain and the other
from the mouse anterior brain. As the anterior section and posterior section are adjacent in the brain, we modified the coordinates
for spots in the posterior section such that the revised coordinates
reflect the spatial adjacency of the two tissue sections. Using the
modified coordinates as input, SpaGCN was able to produce clustering results that reflect the shared layer structure in the anterior
and posterior brain (Fig. 5b).
Next, we analyzed the 10x Visium dataset obtained from the mouse hindbrain, which contains 3,353 spots and 31,053 genes. This dataset exhibits a more complex organizational structure than the previous two datasets. We compared the clustering results of SpaGCN with Louvain, stLearn and BayesSpace when setting the number of clusters to ten for all methods. Figure 4a shows that the clustering results of Louvain are similar to those of stLearn, BayesSpace and SpaGCN, but the spatial domains detected by the latter three methods are more continuous due to the consideration of the spatial dependence of gene expression.
We further investigated the ability of each method in detecting finer tissue structures. Specifically, we subclustered the spots detected by SpaGCN in domain 5, which corresponds to the cortex (Fig. 4b). The subdomains detected by SpaGCN were in good agreement with the mouse cortex in the Allen Brain Institute reference atlas (Fig. 4c). Detected subdomains included layer 2/3, layer 4/5, layer 6, hippocampus (CA1) and subiculum. Layers 2/3 are the "outer" cortical layers, biologically responsible for the local network in which neurons in this subfield communicate with other neurons in adjacent neocortical areas. Layers 4/5 are the "inner" cortical layers, biologically responsible for longer-range neural networks. For example, the visual cortex corresponds to the neocortical area responsible for receiving visual information from the distant lateral geniculate nucleus. SpaGCN is able to distinguish molecular layers (layer 1), outer layers (layers 2/3), inner layers (layers 4/5) and plexiform layers (layer 6). More importantly, SpaGCN outperforms Louvain and stLearn in discriminating cortical layers, which show a mixture of cortical layers. SpaGCN also outperforms BayesSpace in distinguishing between the plexiform layer (subdomain 1) and the CA1 region of the hippocampus (subdomain 3). In contrast, BayesSpace combines cortical layer 6 with the non-cortical layer CA1 of the hippocampus.
Next, we compared SpaGCN with SPARK and SpatialDE for SVG detection. SpaGCN detected 1,028 SVGs in ten spatial domains, while SPARK and SpatialDE detected 9,678 and 12,676 SVGs, respectively (Fig. 14). As shown in Figure 4d, the Moran’s I values of SVG detected by SpaGCN are much higher than those detected by SPARK and SpatialDE (median of 0.54 for SpaGCN, 0.20 for SPARK, and 0.16 for SpatialDE). Stricter filtering of spots and genes did not improve the performance of SPARK and SpatialDE (Fig. 15). The distributions of P or Q values for SpatialDE and SPARK are highly skewed around 0 (Supplementary Figure 16), genes with similar P or Q values do not necessarily show similar spatial patterns, and smaller P or Q values do not guarantee better The spatial pattern of (Figures 17 and 18). In contrast, the multi-domain adaptive filtering criterion implemented in SpaGCN enables it to eliminate false positive SVGs and ensure that all detected SVGs have clear spatial expression patterns.
To illustrate how filtering in SpaGCN works, we take domains 1, 5 and 8 as examples. For each of these domains, SpaGCN detected a single SVG enriched in that region. As shown in Figure 4e, PVALB was enriched in domain 1 and TRM62 in domain 8. Even though domains 1 and 8 are next to each other, these two SVGs still mark the domains well. NRGN is the SVG detected by SpaGCN for domains 5 and 7. The high expression of NRGN in domains 5 and 7 also suggests that these two domains are neuroanatomically similar, both including cortical and hippocampal pyramidal layers. Both the cortex and the hippocampus are regions that lie on the curved surface of the brain. Domains 5 and 7 are continuous in the 3D reconstruction, but they are separated in reality due to the way the slices are cut. Thus, not only NRGN, SpaGCN also detected many other SVGs for domains 5 and 7, some of which were highly expressed in these two domains (Supplementary Figure 19). The unique and robust SVG detection process in SpaGCN ensures that no such genes are missed.
SpaGCN only identifies four SVGs for domain 0. However, we argue that metagenes formed by combinations of multiple genes may reveal spatial patterns better than any single gene. Using domain 0 as an example, we show how SpaGCN creates informative metagenes to label spatial domains (Fig. 4f). First, by lowering the filtering threshold, SpaGCN identified KLK6, which is highly expressed in the lower part of domain 0. Using KLK6 as the starting gene, SpaGCN employed a novel approach to find a log-linear combination of KLK6, MBP, and ATP1B1 gene expression, which accurately labels the spatial domain 0. In this metagene, KLK6 and MBP were considered positive markers because they are highly expressed in certain spots in domain 0, while ATP1B1 was considered a negative marker because it is mainly expressed in non-domain 0 regions. Previous studies have shown that the expression of KLK6 and MBP is restricted to oligodendrocytes, whereas ATP1B1 is mainly expressed in neurons and astrocytes. This is consistent with the fact that domain 0 represents white matter, which is dominated by oligodendrocytes with few neuronal cell bodies. Therefore, the genes that make up this metagene have meaningful biological interpretations. Although our analysis focuses on a single tissue slice, SpaGCN can also jointly analyze multiple tissue slices. We show two examples using this mouse brain Visium data provided by 10x Genomics. Figure 5a shows the SpaGCN clustering results of two mouse hindbrain slices. Since these two tissue slices are from the same region, SpaGCN is able to infer the cluster correspondence between the two tissue slices. Next, we co-analyzed one tissue section from the mouse hindbrain and another tissue section from the mouse forebrain using SpaGCN. Since the forebrain and hindbrain are adjacent in the brain, we modified the coordinates of the spots in the hindbrain tissue slices so that the modified coordinates reflect the spatial adjacency of the two tissue slices. Using the modified coordinates as input, SpaGCN was able to produce clustering results reflecting the layer structure shared by the forebrain and hindbrain (Fig. 5b).
Application to mouse visual cortex STARmap data. Finally, we
analyzed a STARmap dataset that has single-cell resolution7
. This
dataset was generated from mouse visual cortex that spans from
hippocampus to corpus callosum, and the six neocortical layers. In total, 1,020 genes were measured in 1,207 cells that include
non-neuronal cells, excitatory and inhibitory neurons. The layer
structure and cell type distribution of the tissue section provided
by the original study are shown in Fig. 6a. As the tissue capture area
of STARmap is much smaller than 10x Visium, we increased the
contribution of neighboring cells from 0.5 to 1 when calculating
the weighted gene expression of each cell in SpaGCN. Using this
approach, SpaGCN detected spatial domains that agreed well with
the annotated tissue structure (Fig. 6a,c), achieving an ARI of 0.51.
By contrast, the ARIs of the other methods are much lower (0.30 for
Louvain, 0.37 for BayesSpace and 0.03 for HMRF) (Fig. 6b). This
example demonstrates that SpaGCN utilizes spatial information
more efficiently than BayesSpace and HMRF. Using SpaGCN, we
further detected 25 SVGs including genes LAMP5, HPCAL1, CPLX1,
PLP1, NRSN1, ATP1A2 and BSG that showed enriched expression
patterns for domains 0 to 6 (Fig. 6e and Supplementary Fig. 20).
Similar to previous analyses, SPARK and SpatialDE detected much
larger number of SVGs but many of the SVGs lack spatial expression
patterns (Fig. 6d and Supplementary Figs. 21–24).
Finally, we analyzed a STARmap dataset with single-cell resolution. The dataset was obtained from mouse visual cortex, spanning the hippocampus to the corpus callosum and six neocortical layers. In total, 1,020 genes were measured in 1,207 cells, including non-neuronal cells, excitatory neurons, and inhibitory neurons. The layer structure and cell type distribution of the tissue sections provided by the original study are shown in Fig. 6a. Since the tissue capture area of STARmap is much smaller than 10x Visium, we increased the contribution of neighboring cells from 0.5 to 1 when calculating the weighted gene expression of each cell in SpaGCN. Using this approach, the spatial domains detected by SpaGCN are in good agreement with the annotated organizational structure (Fig. 6a,c), with an ARI of 0.51. In contrast, the ARI of other methods is much lower (0.30 for Louvain, 0.37 for BayesSpace, and 0.03 for HMRF) (Fig. 6b). This example demonstrates that SpaGCN utilizes spatial information more effectively than BayesSpace and HMRF. Using SpaGCN, we further detected 25 SVGs, including the genes LAMP5, HPCAL1, CPLX1, PLP1, NRSN1, ATP1A2, and BSG, which showed enriched expression patterns in domains 0 to 6 (Fig. 6e and Supplementary Fig. 20). . Similar to previous analyses, SPARK and SpatialDE detected more SVGs, but many of them lacked spatial expression patterns (Fig. 6d and Supplements 21–24).
Discussion
detect SVGs and meta genes that have much clearer spatial expression patterns and biological interpretations than genes detected by
SpatialDE and SPARK. Additionally, the SpaGCN-detected SVGs
are transferrable and can be utilized for downstream analyses in
independent tissue sections. SpaGCN is also computationally
fast and memory efficient compared to SPARK and SpatialDE
(Supplementary Note 4).
The spatial domain detection step in SpaGCN is flexible. First,
SpaGCN can adjust the weight of histology in gene expression
smoothing. For datasets with clear tissue structure in histology,
higher weight led to clearer separation of cancer versus noncancer regions. Second, during the GCN fitting procedure, the graph
weights are updated, which allows SpaGCN to learn an efficient way
to aggregate gene expression from neighboring spots for each gene.
For data generated from different platforms, the spatial dependency
between spots/cells is different as the size of the captured tissue
area varies. The flexibility in modeling spatial dependency makes
SpaGCN versatile for different types of SRT data.
A limitation of SpaGCN is that the spatial domain detection is
mainly driven by gene expression, which may lead to the discrepancy
between the detected domains and the underlying tissue anatomical structure. This is a general problem for gene expression-based
clustering methods. Another limitation of SpaGCN is the lack of
separation of spatial variation and cell type variation in gene expression patterns for the detected SVGs. To address these limitations,
methods that can jointly consider gene expression and histological
features in clustering are needed. Further, cell type-specific gene
expression needs to be estimated to tease out the contribution of cell
types and spatial location in gene expression variation. We anticipate that methods development along these directions is warranted
for future research.
In this paper, we introduce SpaGCN, a method that integrates gene expression, spatial location, and histology information, for modeling the spatial dependence of gene expression to identify spatial domains and enriched SVGs. SpaGCN has been extensively tested on datasets from different species, regions and tissues generated using different SRT techniques. Additional analyzes of ST, SLIDE-seqV2, and MERFISH data are presented in Notes 1–3. Our results consistently show that SpaGCN is able to identify spatial domains with consistent gene expression and histology, detecting SVGs and metagenes with clearer spatial expression patterns and biological interpretations than genes detected by SpatialDE and SPARK. performance. Furthermore, SVGs detected by SpaGCN can be transferred in independent tissue sections and used for downstream analysis. Compared with SPARK and SpatialDE, SpaGCN has fast computation speed and high memory utilization (Note 4).
The spatial domain detection step in SpaGCN is flexible. First, SpaGCN can adjust the weight of histology in gene expression smoothing. For datasets with well-defined tissue structures, higher weights lead to a clearer separation of cancerous from non-cancerous regions. Second, during the GCN fitting process, the weights of the graph are updated, which allows SpaGCN to learn an efficient way to aggregate the gene expression of each gene in adjacent spots. For data generated from different platforms, the spatial dependence between spot/cell is also different due to the different size of the captured tissue area. The flexibility of modeling spatial dependencies makes SpaGCN suitable for different types of SRT data.
A limitation of SpaGCN is that spatial domain detection is mainly driven by gene expression, which may lead to discrepancies between detected domains and underlying tissue anatomy. This is a general problem with gene expression-based clustering methods. Another limitation of SpaGCN is the lack of separation of spatial variation and cell type variation in the detected gene expression patterns of SVG. To address these limitations, methods that can simultaneously consider gene expression and histological features in clustering need to be developed. Furthermore, estimation of gene expression in specific cell types is required to distinguish the contribution of cell type and spatial location to gene expression variation. We anticipate that future research will require further development of methods along these directions.