SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains an

Title

SpaGCN: Integrating gene expression, spatial 
location and histology to identify spatial domains 
and spatially variable genes by graph 
convolutional network

SpaGCN is a method for identifying spatial domains and spatially variable genes by integrating gene expression, spatial location, and histological information through graph convolutional networks.

In SpaGCN, we combine gene expression, spatial location, and histological information to build a graph to represent the relationship between all points in the data. Through graph convolutional layers, SpaGCN can aggregate gene expression information from neighboring points. Then, SpaGCN utilizes the aggregated expression matrix to cluster the points using an unsupervised iterative clustering algorithm, considering each cluster as a spatial domain. Next, SpaGCN detects spatially variable genes enriched in specific domains by differential expression analysis.

The key strength of SpaGCN is that it comprehensively considers gene expression, spatial location, and histology information, thereby enabling the identification of spatial domains with consistent gene expression and histology and the detection of spatially variable genes with clear spatial expression patterns. Compared with other methods, the spatially variable genes detected by SpaGCN have better biological interpretation and transferability, which can be used for further research and analysis.

All in all, SpaGCN provides a powerful tool for spatial transcriptomics research by integrating data from different information sources and taking advantage of graph convolutional networks, which can reveal the spatial variation of gene expression in the tissue microenvironment and provide a basis for further understanding Cellular mechanisms and disease pathology provide important clues.

Abstract

Recent advances in spatially resolved transcriptomics (SRT) technologies have enabled comprehensive characterization 
of gene expression patterns in the context of tissue microenvironment. To elucidate spatial gene expression variation, we 
present SpaGCN, a graph convolutional network approach that integrates gene expression, spatial location and histology 
in SRT data analysis. Through graph convolution, SpaGCN aggregates gene expression of each spot from its neighboring 
spots, which enables the identification of spatial domains with coherent expression and histology. The subsequent domain 
guided differential expression (DE) analysis then detects genes with enriched expression patterns in the identified domains. 
Analyzing seven SRT datasets using SpaGCN, we show it can detect genes with much more enriched spatial expression patterns than competing methods. Furthermore, genes detected by SpaGCN are transferrable and can be utilized to study spatial 
variation of gene expression in other datasets. SpaGCN is computationally fast, platform independent, making it a desirable 
tool for diverse SRT studies.

Significant advances have recently been made in Spatially Resolved Transcriptomics (SRT) techniques, which allow us to comprehensively describe gene expression patterns in tissue microenvironments. To elucidate spatial variation in gene expression, we propose SpaGCN, a graph convolutional network approach that integrates gene expression, spatial location, and histology into the analysis of SRT data. Through graph convolution, SpaGCN combines the gene expression of each point with that of its neighbors, enabling the identification of spatial regions with consistent expression and histology. Subsequent region-guided differential expression (DE) analysis can detect genes with enriched expression patterns in defined regions. By analyzing seven SRT datasets using SpaGCN, we show that it is able to detect genes with more enriched spatial expression patterns than other competing methods. Furthermore, the genes detected by SpaGCN are transferable and can be used to study the spatial variation of gene expression in other datasets. SpaGCN is computationally fast and platform-independent, making it an ideal tool for various SRT studies.

Introduction

Recent technological advances in SRT have enabled gene 
expression profiling with spatial information in tissues1
. 
Knowledge of the relative locations of different cells in a tissue is critical for understanding disease pathology because spatial 
information helps in understanding how the gene expression of a 
cell is influenced by its surrounding environment. Popular experimental methods for SRT can be broadly classified into two categories. The first category is in situ hybridization or sequencing-based 
technologies with single-cell resolution, which includes seqFISH2,3
, 
seqFISH+4
, MERFISH5,6
, STARmap7
 and FISSEQ8
 that measure the 
expression level for hundreds to thousands of genes in cells within 
their tissue context. The second category is in situ capturing-based 
technologies with spatial barcoding followed by sequencing, which 
includes spatial transcriptomics (ST)9
, SLIDE-seq10, SLIDE-seqV2 
(ref. 11), HDST12 and 10x Visium that measure the expression level 
for thousands of genes in captured locations, referred to as spots. 
These different SRT technologies have made it possible to uncover 
the complex transcriptional architecture of heterogeneous tissues and enhanced our understanding of cellular mechanisms in 
diseases13,14.
In SRT studies, an important step is identifying spatial domains defined as regions that are spatially coherent in both gene expres-sion and histology. Traditional clustering methods such as K-means  and Louvain’s method15 only take gene expression data as input, and the resulting clusters may not be contiguous due to the lack of consideration of spatial information and histology. To account for spatial dependency of gene expression, new methods have been developed. For example, Zhu et al.16 uses a Hidden-Markov random field (HMRF) approach to model spatial dependency of gene expression; stLearn17 uses features extracted from histology image as 
well as expression of neighboring spots spatially to normalize gene 
expression data before clustering; BayesSpace18 employs a Bayesian 
approach for clustering by imposing a prior that gives higher weight 
to physically close spots. Although these methods can cluster spots 
or cells into distinct groups, the lack of flexibility with different 
modalities has made them less versatile. As newer SRT technologies 
continue to be developed19–22, it is desirable to have methods that are 
compatible with different SRT platforms.
To link spatial domains with biological functions, it is crucial 
to identify genes that show enriched expression in the identified 
domains. Methods such as Trendsceek23, SpatialDE24 and SPARK25
have been developed to detect spatially variable genes (SVGs). These 
methods examine each gene independently and return a P value to 
represent the spatial variability of a gene. However, due to the lack 
of consideration of spatial domains, genes detected by these methods do not have guaranteed spatial expression patterns, making it 
difficult to utilize these genes for further biological investigations.
Rather than considering spatial domain and SVG identification 
as separate problems, we developed SpaGCN, a graph convolutional 
network (GCN)-based approach that considers these two problems 
jointly. SpaGCN first identifies spatial domains by integrating gene 
expression, spatial location and histology through the construction 
of an undirected weighted graph that represents the spatial dependency of the data. For each spatial domain, SpaGCN then detects SVGs that are enriched in the domain. By restricting the search 
space to spatial domains, the SVGs detected by SpaGCN are guaranteed to have spatial expression patterns. The spatial domains and 
the corresponding SVGs provide a comprehensive picture of the 
spatial gradients in gene expression in tissue. SpaGCN is versatile 
in analyzing many types of SRT data, including ST, 10x Visium, 
SLIDE-seqV2, STARmap, and MERFISH.

Recent technological advances in SRT have enabled gene expression profiling with spatial information in tissues. Knowing the relative location of different cells in a tissue is critical to understanding disease pathology, as spatial information helps to understand how a cell's gene expression is affected by its surrounding environment. Popular SRT experimental approaches can be broadly divided into two categories. The first category is in situ hybridization-based or sequencing-based technologies with single-cell resolution, including seqFISH, seqFISH+, MERFISH, STARmap, and FISSEQ, etc., which can measure the expression levels of hundreds to thousands of genes in cells, and in analysis within its organizational context. The second type of technology is based on in situ capture, using spatial barcodes for sequencing, including spatial transcriptomics (ST), SLIDE-seq, SLIDE-seqV2, HDST, and 10x Visium, etc., which can measure the capture position (i.e. expression levels of thousands of genes in the spot). These diverse SRT techniques allow us to reveal the complex transcriptome structure of heterogeneous tissues and deepen our understanding of cellular mechanisms in disease.

An important step in SRT studies is the identification of spatial domains, regions that are spatially coherent in gene expression and histology. Traditional clustering methods such as K-means and Louvain methods only use gene expression data as input, and the resulting clustering results may not be continuous due to the lack of spatial information and histological considerations. To account for the spatial dependence of gene expression, new methods have been developed. For example, Zhu et al. used the Hidden Markov Random Field (HMRF) method to model the spatial dependence of gene expression; stLearn used features extracted from histological images and the expression of neighboring spots to normalize in space; BayesSpace through Apply a prior to give higher weight to physically close spots when clustering. While these methods can cluster spots or cells into distinct groups, the lack of flexibility for different modalities limits their applicability. As new SRT technologies continue to be developed, methods that are compatible with different SRT platforms are required.

To link spatial domains to biological function, it is critical to identify genes whose expression is enriched in defined domains. Several methods have been developed to detect Spatially Variable Genes (SVGs), such as Trendsceek, SpatialDE, and SPARK. These methods examine each gene independently and return a P-value representing the gene's spatial variability. However, due to the lack of consideration of the spatial domain, the genes detected by these methods do not have guaranteed spatial expression patterns, making it difficult to use these genes for further biological studies.

We develop SpaGCN, a graph convolutional network (GCN) based approach that considers the recognition of spatial domain and SVG as a joint problem. SpaGCN first identifies spatial domains by constructing an undirected weighted graph to integrate gene expression, spatial location, and histology. For each spatial domain, SpaGCN then detects SVGs enriched in that domain. By restricting the search space to the spatial domain, the SVGs detected by SpaGCN have guaranteed spatial representation patterns. The spatial domain and corresponding SVG provide a comprehensive picture of the spatial gradient of gene expression in tissues. SpaGCN is suitable for analyzing many types of SRT data, including ST, 10x Visium, SLIDE-seqV2, STARmap, and MERFISH, etc.

SRT技术分两类根据使用仪器不同为iST和sST,iST是基于in situ hybridization原位杂交,如seqFISH,seqFISH+, MERFISH, STARmap and FISSEQ
ST是基于in situ capturing-based technologies原位捕获技术,如 SLIDE-seq10, SLIDE-seqV2 (ref. 11), HDST12 and 10x Visium。

在SRT研究中分两步后,第一步是identifying spatial domains区分空间域,常用方法有K-means,Louvain但是未考虑空间信息和组织学信息;
常用方法有HMRF,stLearn归一化,BayesSpace加先验增加空间信息,但是缺乏多模态灵活性,兼容性差。

第二步是domains和biological functions联系起来,即在domains中识别enriched expression富集基因。方法如Trendsceek,SpatialDE
和SPARK都用来检测spatially variable genes(SVGs)空间变异基因,用p值来表示基因的空间变异性。但上面的方法缺乏对空间域的考虑。

SpaGCN将domains和detect SVGs联合问题。通过构建undirected weighted graph无向加权图来联合gene expression基因表达, spatial
location空间位置和histology组织学信息,从而识别空间域。

对于每个domains做detect SVGs。该方法适用于多种SRT数据,如ST、10x Visium、SLIDE-seqV2、STARmap和MERFISH等。

in situ hybridization is iST, focusing on hundreds of genes in the cell

insert image description here

in situ capturing-based technologies with spatial barcoding barcode is sST, focusing on thousands of genes in the spot

Please add a picture description

heterogeneous tissues refers to different cells or groups of cells

Please add a picture description

Please add a picture description

Spatial variable genes spatial variable genes refer to different genes enriched in different domains

Please add a picture description

Results / Experiments

Overview of SpaGCN and evaluation. We explain the workflow 
of SpaGCN using in situ capturing-based SRT data as an example, 
but the method can be easily modified to analyze other types of SRT 
data. As shown in Fig. 1a, SpaGCN first builds a graph to represent 
the relationship of all spots considering both spatial location and 
histology information. Next, SpaGCN utilizes a graph convolutional 
layer to aggregate gene expression information from neighboring 
spots. Then, SpaGCN uses the aggregated expression matrix to 
cluster spots using an unsupervised iterative clustering algorithm26. 
Each cluster is considered as a spatial domain from which SpaGCN 
then detects SVGs that are enriched in a domain by DE analysis 
(Fig. 1b). When a single gene cannot mark the expression pattern 
of a domain, SpaGCN will construct a meta gene, formed by the 
combination of multiple genes, to represent the expression pattern 
of the domain.
To showcase the strength of SpaGCN, we applied it to seven publicly available datasets (Supplementary Table 1). The spatial domains 
identified by SpaGCN agree better with known tissue structures 
than Louvain, stLearn, and BayesSpace. We also compared SVGs 
detected by SpaGCN with those detected by SpatialDE and SPARK, 
and found that the SpaGCN-detected SVGs have more coherent 
expression patterns and better biological interpretability than the 
other two methods. The specificity of spatial expression patterns 
revealed by SpaGCN-detected SVGs were further confirmed by 
Moran’s I and Geary’s C statistics27, two commonly used metrics for 
quantifying spatial autocorrelation of gene expression28,29.

Overview and evaluation of SpaGCN. We explain the SpaGCN workflow based on capture-based in situ SRT data as an example, but the method can be easily modified to analyze other types of SRT data. As shown in Figure 1a, SpaGCN first constructs a graph to represent the relationship among all spots, considering the spatial location and histological information. Next, SpaGCN utilizes graph convolutional layers to aggregate gene expression information from neighboring spots. Then, SpaGCN uses the aggregated expression matrix to cluster the spots with an unsupervised iterative clustering algorithm. Each cluster was considered as a spatial domain, from which SpaGCN detected SVGs enriched in this domain by differential expression analysis (Fig. 1b). When a single gene cannot mark the expression pattern of a domain, SpaGCN will construct a metagene, composed of multiple genes, to represent the expression pattern of the domain.

To demonstrate the advantages of SpaGCN, we apply it to seven publicly available datasets (Supplementary Table 1). The spatial domain identified by SpaGCN is more consistent with the known organizational structure, compared to Louvain, stLearn and BayesSpace. We also compared the SVGs detected by SpaGCN with those detected by SpatialDE and SPARK, and found that the SVGs detected by SpaGCN have more consistent expression patterns and better biological interpretability. The specificity of the spatial expression patterns revealed by SVGs detected by SpaGCN was further validated by Moran's I and Geary's C statistics, which are commonly used indicators for quantifying the spatial autocorrelation of gene expression.

SpaGCN适用于in situ capturing-based SRT data。首先构建一个graph,这个graph考虑了spatial location空间信息和histology 
information组织信息;
然后利用GCL/graph convolutional layer从neighboring spots中聚合gene expression information基因信息,得到一个aggregated expression matrix聚合表达矩阵;
对AEM使用聚类算法聚类,将spot聚类得到domains;
再对单个domain使用DE analysis得到单个SVG,但是当单个基因无法表达区域时,就构建一个meta gene元基因,由多个基因组成来表达domain中的SVG。

识别domains比Louvain, stLearn和BayesSpace更好,ARI更高。
detect SVGs比SpatialDE和SPARK更好,Moran's I和Geary's C统计量更好,这两个统计量用于量化基因表达的空间自相关性。
Application to human primary pancreatic cancer ST data. To 
demonstrate the importance of incorporating histology information, we analyzed a human primary pancreatic cancer dataset generated using the ST technology13. This dataset includes 224 spots 
and 16,448 genes with three manually annotated tissue regions. 
The cancer region detected by Louvain based on gene expression 
alone did not closely match the pathologist-annotated cancer region 
(Fig. 2a). Spatial clustering methods such as stLearn and BayesSpace 
did not detect the cancer region either. SpaGCN revealed a similar pattern when using default parameters. As the histology image 
shows a clear difference between the cancer and noncancer regions, 
it suggests histology is informative for clustering. SpaGCN has the 
flexibility of modeling histology with a scaling parameter s, which 
controls the weight given to histology when detecting neighbors 
for each spot. By increasing the value of s from 1 to 2, SpaGCN 
detected a cluster that agrees well with the manually annotated cancer region. It is worth noting that when s was set at the default value 
of 1, SpaGCN detected the noncancer regions well. When s was 
increased to 2, SpaGCN not only maintained the ability to detect 
the noncancer regions but also detected the cancer region. This 
example showed that SpaGCN is flexible in incorporating histology 
information in clustering. Although stLearn can incorporate histology data, its use of histology information is pre-fixed by the radius 
when defining neighboring spots. The lack of flexibility in adjusting 
histology weight led to the discrepancy between their clustering and 
the pathologist’s manual annotation.
Next, we detected SVGs using SpaGCN, SPARK and SpatialDE. 
In total, SpaGCN detected 12 SVGs, with three, eight and one SVGs 
for domains 0, 1 and 2, respectively (Fig. 2b; Supplementary Fig. 1). 
Furthermore, a meta gene using KRT17, MMP11 and SERPINA1 marked the cancer region better than the originally identified 
KRT17 for domain 2 (Fig. 2c). KRT17 functions as a tumor promoter 
and regulates proliferation in pancreatic cancer30, and MMP11 is a 
prognostic biomarker for pancreatic cancer31. Our identification of 
KRT17 and MMP11 as the two positive genes for the cancer region 
agrees well with pancreatic cancer biology. SPARK and SpatialDE 
detected 203 and 163 SVGs, with their P or Q values highly skewed 
towards 0 (Supplementary Figs. 2 and 3). However, the Moran’s 
I and Geary’s C values for their SVGs are much lower than those 
detected by SpaGCN, suggesting their lack of spatial patterns 
(Fig. 2d). Furthermore, genes with smaller P or Q values do not 
necessarily show better spatial expression patterns than those with 
larger P or Q values (Supplementary Figs. 4 and 5). More stringent 
filtering of spots and genes did not improve the spatial pattern for 
SpatialDE and SPARK-detected SVGs (Supplementary Fig. 6).

Applied to human primary pancreatic cancer ST data. To demonstrate the importance of integrating histological information, we analyzed a human primary pancreatic cancer dataset generated using ST technology. The dataset includes 224 spots and 16,448 genes with three manually annotated tissue regions. The cancer regions detected by the Louvain clustering method based only on gene expression did not exactly match the cancer regions annotated by pathologists. Spatial clustering methods such as stLearn and BayesSpace also failed to detect cancerous regions. SpaGCN reveals a similar pattern when using the default parameters. However, since the histology images show clear differences between cancerous and non-cancerous regions, this suggests that histology is informative for clustering. SpaGCN has the flexibility to adjust histology, and can control the weight given to histology when detecting neighbors of each spot through a scaling parameter s. Increasing the value of s from 1 to 2, one cluster detected by SpaGCN is in good agreement with manually annotated cancer regions. It is worth noting that when s is set to the default value of 1, SpaGCN can also detect non-cancer regions well. When s is increased to 2, SpaGCN not only maintains the ability to detect non-cancer regions, but also detects cancer regions. This example demonstrates the flexibility of SpaGCN to integrate histological information in clustering. While stLearn can integrate histological data, the histological information it uses when defining neighboring spots is pre-fixed by the radius. The inability to adjust histology weights resulted in discrepancies between its clustering results and pathologist's manual annotations.

Next, we detect SVGs using SpaGCN, SPARK and SpatialDE. In total, SpaGCN detected 12 SVGs, among which 3, 8, and 1 SVGs were distributed in domains 0, 1, and 2, respectively (Fig. 2b; Supplementary Fig. 1). Furthermore, metagenes constructed using KRT17, MMP11, and SERPINA1 better marked cancer regions than KRT17 initially identified in domain 2 (Fig. 2c). KRT17 functions as a tumor promoter and regulates proliferation in pancreatic cancer, while MMP11 is a prognostic biomarker in pancreatic cancer. Our finding of KRT17 and MMP11 as two positive genes in cancer regions fits well with the biology of pancreatic cancer. SPARK and SpatialDE detect 203 and 163 SVGs whose P or Q values ​​are highly skewed around 0. However, the Moran's I and Geary's C values ​​of their SVGs are much lower than those detected by SpaGCN, indicating that they lack spatial patterns. Furthermore, genes with smaller P-values ​​or Q-values ​​do not necessarily show better spatial expression patterns than genes with larger P-values ​​or Q-values. For SVGs detected by SpatialDE and SPARK, stricter spot and gene filtering did not improve the spatial patterns.

Application to human dorsolateral prefrontal cortex 10x Visium 
data. To show quantitatively that SpaGCN outperforms Louvain, 
stLearn and BayesSpace in spatial domain detection, we analyzed 
the LIBD human dorsolateral prefrontal cortex (DLPFC) data generated using 10x Visium32. This study sequenced 12 tissue slices that 
span six neuronal layers plus white matter from the DLPFC in three 
human brains. The manual annotation of the tissue layers provided 
by the original study allows us to evaluate the accuracy of spatial 
domain detection. Figure 3a shows that for the representative tissue slice 151673, both SpaGCN and BayesSpace revealed spatial 
domains that agree better with the manually annotated tissue layers 
than Louvain. Although stLearn utilized histology information, its 
performance is not much better than Louvain and is substantially 
worse than SpaGCN and BayesSpace. The relative performance 
of these methods remains the same when considering all 12 slices 
(Fig. 3b and Supplementary Table 2); the median ARI is 0.36 for 
stLearn, 0.42 for BayesSpace and 0.45 for SpaGCN.
To validate further the identified spatial domains, we detected 
SVGs for each domain in slice 151673. In total, SpaGCN detected 
67 SVGs, with 53 of them being specific to domain 5, which corresponds to white matter (Supplementary Fig. 7). Patterns of SVGs 
for other domains are not very clear. These results indicate that 
gene expression profiles of spots from white matter are distinct 
from spots in the neuronal layers, while gene expression differences 
among the six neuronal layers are much smaller and more difficult to distinguish using individual marker genes. SVGs detected 
by SPARK and SpatialDE also suffered from the same problem. 
SPARK detected 3,187 SVGs with 1,131 of them having false discovery rate (FDR)-adjusted P values equal to 0, most of which 
only marked the white matter region (Supplementary Figs. 8 and 
9). We also found that the SVGs detected by SPARK lack domain 
specificity (Supplementary Fig. 10). SpatialDE detected 3,654 SVGs 
with 806 of them having Q values equal to 0, but these genes do 
not necessarily show better spatial patterns than genes with larger 
Q values (Supplementary Fig. 11). Although SPARK and SpatialDE 
detected much larger numbers of SVGs than SpaGCN, the genes 
detected by these two methods cannot distinguish different degrees 
of spatial expression variability as their P or Q value distributions 
are highly skewed towards 0. Figure 3c shows that the Moran’s I values for SpaGCN-detected SVGs are significantly higher than genes 
detected by SpatialDE and SPARK (median of 0.39 for SpaGCN 
against 0.09 for SPARK and 0.08 for SpatialDE). More stringent 
filtering of spots and genes did not improve the performance of 
SpatialDE and SPARK (Supplementary Fig. 12). For three out of the 
six neuronal layers, SpaGCN detected a single SVG to mark that 
region (Fig. 3d). For example, CAMK2N1 is enriched in domain 0 
(layers 1 and 2), PCP4 is enriched in domain 1 (layer 4) and NEFM
is enriched in domain 3 (layer 3).
To show that SpaGCN-detected SVGs are useful for downstream 
analysis, we performed K-means clustering on slice 151507, which is from a different brain, using all 67 SVGs detected from slice 
151673 by SpaGCN. Compared with manually curated layer assignment, this clustering analysis had a Adjusted Rand Index (ARI) of 
0.23 (Fig. 3e,f). We performed similar analysis using SVGs detected 
by SpatialDE and SPARK. When randomly selecting 67 SVGs with 
0 P or Q value from genes detected by SpatialDE/SPARK, the ARI is 
only 0.13 for SpatialDE and 0.14 for SPARK. The ARIs for SpatialDE 
and SPARK did not improve even with increased numbers of SVGs 
(Fig. 3e). These results further confirmed the lack of spatial patterns 
for genes detected by SPARK and SpatialDE.
Although it is difficult to identify single genes to mark certain 
neuronal layers, SpaGCN was able to find domain-specific meta 
genes. As shown in Fig. 3g, SpaGCN detected meta genes for 
domains 1, 2, 4 and 6. The meta gene for domain 2 is specific to layer 1. As layer 1 only has a few spots, it is difficult to find a highly 
enriched gene. However, by adding depleted genes such as FTH1, 
MBP, MT-CO3 and PLP1, the expression pattern in this region is 
strengthened. Furthermore, the SVGs and meta genes detected by 
SpaGCN are transferrable to slice 151507 obtained from a different brain, in which the meta genes detected in slice 151673 mark 
the same layers in slice 151507 (Fig. 3g and Supplementary Fig. 13).

10x Visium data applied to the human dorsolateral prefrontal cortex (DLPFC). To quantitatively demonstrate that SpaGCN outperforms Louvain, stLearn, and BayesSpace in spatial domain detection, we analyze LIBD human DLPFC data generated using 10x Visium technology. The study sequenced 12 tissue sections from six neuronal layers of the DLPFC and white matter from three human brains. The tissue-level manual annotations provided by the original study allowed us to assess the accuracy of spatial domain detection. Figure 3a shows that for representative tissue slice 151673, the spatial domains revealed by SpaGCN and BayesSpace are more consistent with the manually annotated tissue levels, while Louvain does not perform as well as them. Although stLearn utilizes histological information, its performance is not much better than Louvain, and far behind SpaGCN and BayesSpace. The relative performance of the methods remains the same when all 12 slices are considered (Fig. 3b and Supplementary Table 2); the median ARI is 0.36 for stLearn, 0.42 for BayesSpace, and 0.45 for SpaGCN.

To further validate the identified spatial domains, we detected SVGs for each domain in slice 151673. In total, SpaGCN detected 67 SVGs, 53 of which were specific to domain 5, corresponding to white matter (Supplementary Fig. 7). The patterns for SVGs in other domains are less clear. These results suggest that the gene expression profiles of spots in white matter differ from spots in neuronal layers, while gene expression differences between the six neuronal layers are much smaller and more difficult to distinguish using a single marker gene. SVGs detected by SPARK and SpatialDE have the same problem. SPARK detected 3,187 SVGs, of which 1,131 had FDR-adjusted P values ​​equal to 0, most of which only labeled white matter regions (Supplementary Figures 8 and 9). We also found that the SVGs detected by SPARK lack domain-specific properties (Supplementary Figure 10). SpatialDE detected 3,654 SVGs, of which 806 had Q values ​​equal to 0, but the spatial patterns of these genes were not necessarily better than those with larger Q values ​​(Fig. 11). Although SPARK and SpatialDE detect much more SVGs than SpaGCN, the genes detected by these two methods cannot distinguish different degrees of spatial expression variation because their P or Q value distributions are highly skewed around 0. Figure 3c shows that the Moran's I value of SVGs detected by SpaGCN is significantly higher than that of genes detected by SpatialDE and SPARK (median of 0.39 for SpaGCN, 0.09 for SPARK, and 0.08 for SpatialDE). Stricter filtering of spots and genes did not improve the performance of SpatialDE and SPARK (Fig. 12). For three of the six neuron layers, SpaGCN detects a single SVG to label the region (Fig. 3d). For example, CAMK2N1 is enriched in domain 0 (layers 1 and 2), PCP4 in domain 1 (layer 4) and NEFM in domain 3 (layer 3).

To demonstrate the usefulness of the SVGs detected by SpaGCN in downstream analysis, we performed K-means clustering on slice 151507, which is from another brain, using all 67 SVGs detected by SpaGCN from slice 151673. Compared with manually labeled stratum assignments, the Adjusted Rand Index (ARI) for this cluster analysis was 0.23 (Fig. 3e,f). We performed a similar analysis using SVGs detected by SpatialDE and SPARK. When randomly selecting 67 SVGs with 0 P-values ​​or Q-values ​​in genes detected by SpatialDE/SPARK, the ARI of SpatialDE is only 0.13 and that of SPARK is 0.14. Even increasing the number of SVGs, the ARI of SpatialDE and SPARK does not improve (Fig. 3e). These results further confirm the lack of spatial patterns in genes detected by SPARK and SpatialDE.

While it is difficult to pinpoint individual genes to mark specific neuronal layers, SpaGCN is able to find domain-specific metagenes. As shown in Figure 3g, SpaGCN detected metagenes of domains 1, 2, 4 and 6. Domain 2 metagenes are specific to layer 1. Since layer 1 has only a few spots, it is difficult to find highly enriched genes. However, the expression pattern in this region was enhanced by the addition of depleted genes such as FTH1, MBP, MT-CO3 and PLP1. Furthermore, SVGs and metagenes detected by SpaGCN could be transferred to slice 151507 obtained from another brain where the metagenes detected by SpaGCN labeled the same layer (Fig. 3g and Supplementary Fig. 13).

Application to mouse posterior brain 10x Visium data. Next, 
we analyzed a 10x Visium dataset generated from mouse posterior brain that includes 3,353 spots and 31,053 genes33. This dataset shows much more complex tissue structure than the previous 
two datasets. We compared the clustering result of SpaGCN with 
Louvain, stLearn and BayesSpace when the number of clusters was 
set at ten for all methods. Figure 4a shows that Louvain’s clustering is similar to stLearn, BayesSpace and SpaGCN, but the spatial 
domains detected by the latter three methods are more spatially 
contiguous due to their ability to account for spatial dependency of 
gene expression.
We further investigated the ability of each method in detecting 
more refined tissue structure. Specifically, we performed subclustering analysis for spots in domain 5 detected by SpaGCN, which 
corresponds to the cortex (Fig. 4b). The subdomains detected by 
SpaGCN agree well with the Allen Brain Institute reference atlas 
diagram of the mouse cortex (Fig. 4c). The detected subdomains 
include layers 2/3, layers 4/5, layer 6, a hippocampal region (CA1) 
and the subiculum. Layers 2/3 are the ‘external’ cortical layers that 
are biologically responsible for local networks in which neurons in 
this subdomain communicate to other neurons in adjacent neocortical regions. Layers 4/5 are the ‘internal’ cortical layers that are biologically responsible for longer range neural networks. For example, 
the visual cortex, which corresponds to the neocortical region, is 
responsible for receiving visual information from the lateral geniculate nucleus that is far away. SpaGCN was able to separate the 
molecular (layer 1), external (layers 2/3), internal (layers 4/5) and 
the plexiform (6) layers. More importantly, SpaGCN outperformed 
Louvain and stLearn, which show combining of neocortical layers. 
SpaGCN also outperformed BayesSpace in distinguishing between 
the plexiform layer (subdomain 1) and the non-neocortical CA1 
region of the hippocampus (subdomain 3). In contrast, BayesSpace 
combined layer 6 of the neocortex with the non-neocortical CA1 
layer of the hippocampus.
Next, we compared SpaGCN with SPARK and SpatialDE for 
SVG detection. SpaGCN detected 1,028 SVGs for the ten spatial 
domains while SPARK and SpatialDE detected 9,678 and 12,676 
SVGs, respectively (Supplementary Fig. 14). As shown in Fig. 4d, 
the Moran’s I values of SpaGCN-detected SVGs are much higher 
than those detected by SPARK and SpatialDE (median of 0.54 for 
SpaGCN against 0.20 for SPARK and 0.16 for SpatialDE). More 
stringent filtering of spots and genes did not improve the performance of SPARK and SpatialDE (Supplementary Fig. 15). The 
P or Q value distributions of SpatialDE and SPARK are highly skewed towards 0 (Supplementary Fig. 16), and genes with similar P or Q values do not necessarily show similar spatial patterns 
and a smaller P or Q value does not guarantee a better spatial pattern (Supplementary Figs. 17 and 18). In contrast, multiple domain 
adaptive filtering criteria implemented in SpaGCN allow it to eliminate false positive SVGs and ensure all detected SVGs have clear 
spatial expression patterns.
To illustrate how the filtering in SpaGCN works, we use domains 
1, 5 and 8 as an example. For each of these domains, SpaGCN 
detected a single SVG enriched in that region. As shown in Fig. 4e, 
PVALB is enriched in domain 1 and TRM62 is enriched in domain 
8. Although domains 1 and 8 are adjacent to each other, these 
two SVGs can still well mark these domains. NRGN is a SVG that 
SpaGCN detected for domains 5 and 7. The high expression of 
NRGN in domains 5 and 7 also indicates that these two domains are 
neuroanatomically similar—both consisting of cortex and the pyramidal layer of the hippocampus. Both the cortex and hippocampus 
are regions that are on the curved surface of the brain. Domains 
5 and 7, which would be contiguous in a three-dimensional (3D) 
reconstruction, are artifactually separated as a result of how the section was cut. Therefore, it is not surprising that in addition to NRGN, 
SpaGCN also detected many other SVGs for domains 5 and 7, some 
of which are highly expressed in both domains (Supplementary 
Fig. 19). The unique and powerful SVG detection procedure in 
SpaGCN ensures that genes such as these are not missed.
SpaGCN only identified four SVGs for domain 0. However, we 
reason that a meta gene, formed by the combination of multiple 
genes, may better reveal spatial patterns than any single genes. We 
used domain 0 as an example to show how SpaGCN can create 
informative meta genes to mark a spatial domain (Fig. 4f). First, by 
lowering the filtering thresholds, SpaGCN identified KLK6 which 
is highly expressed in the lower part of domain 0. Using KLK6 as a 
starting gene, SpaGCN used a novel approach to find a log-linear 
combination of gene expression of KLK6, MBP and ATP1B1, which 
accurately marked the spatial domain 0. In this meta gene, KLK6
and MBP are considered as positive markers because they are 
highly expressed in some spots in domain 0, whereas ATP1B1 is 
considered a negative marker as it is mainly expressed in regions 
other than domain 0. Previous studies have shown that KLK6 and 
MBP expression is restricted to oligodendrocytes, while ATP1B1 is 
mainly expressed in neurons and astrocytes34. This resonates with 
the fact that domain 0 represents white matter which is dominated 
by oligodendrocytes and has few neuronal cell bodies. Therefore, 
the genes that make up this meta gene have meaningful biological 
interpretations. While we focused our analyses on one tissue section, SpaGCN 
can also jointly analyze multiple tissue sections. We show two examples using this mouse brain Visium data provided by 10x Genomics. 
Figure 5a shows SpaGCN clustering results for two mouse posterior sections. As these two tissue sections are from the same region, 
SpaGCN was able to infer cluster correspondence between the two 
tissue sections. Next, we used SpaGCN to analyze jointly two tissue sections with one from the mouse posterior brain and the other 
from the mouse anterior brain. As the anterior section and posterior section are adjacent in the brain, we modified the coordinates 
for spots in the posterior section such that the revised coordinates 
reflect the spatial adjacency of the two tissue sections. Using the 
modified coordinates as input, SpaGCN was able to produce clustering results that reflect the shared layer structure in the anterior 
and posterior brain (Fig. 5b).

Next, we analyzed the 10x Visium dataset obtained from the mouse hindbrain, which contains 3,353 spots and 31,053 genes. This dataset exhibits a more complex organizational structure than the previous two datasets. We compared the clustering results of SpaGCN with Louvain, stLearn and BayesSpace when setting the number of clusters to ten for all methods. Figure 4a shows that the clustering results of Louvain are similar to those of stLearn, BayesSpace and SpaGCN, but the spatial domains detected by the latter three methods are more continuous due to the consideration of the spatial dependence of gene expression.

We further investigated the ability of each method in detecting finer tissue structures. Specifically, we subclustered the spots detected by SpaGCN in domain 5, which corresponds to the cortex (Fig. 4b). The subdomains detected by SpaGCN were in good agreement with the mouse cortex in the Allen Brain Institute reference atlas (Fig. 4c). Detected subdomains included layer 2/3, layer 4/5, layer 6, hippocampus (CA1) and subiculum. Layers 2/3 are the "outer" cortical layers, biologically responsible for the local network in which neurons in this subfield communicate with other neurons in adjacent neocortical areas. Layers 4/5 are the "inner" cortical layers, biologically responsible for longer-range neural networks. For example, the visual cortex corresponds to the neocortical area responsible for receiving visual information from the distant lateral geniculate nucleus. SpaGCN is able to distinguish molecular layers (layer 1), outer layers (layers 2/3), inner layers (layers 4/5) and plexiform layers (layer 6). More importantly, SpaGCN outperforms Louvain and stLearn in discriminating cortical layers, which show a mixture of cortical layers. SpaGCN also outperforms BayesSpace in distinguishing between the plexiform layer (subdomain 1) and the CA1 region of the hippocampus (subdomain 3). In contrast, BayesSpace combines cortical layer 6 with the non-cortical layer CA1 of the hippocampus.

Next, we compared SpaGCN with SPARK and SpatialDE for SVG detection. SpaGCN detected 1,028 SVGs in ten spatial domains, while SPARK and SpatialDE detected 9,678 and 12,676 SVGs, respectively (Fig. 14). As shown in Figure 4d, the Moran’s I values ​​of SVG detected by SpaGCN are much higher than those detected by SPARK and SpatialDE (median of 0.54 for SpaGCN, 0.20 for SPARK, and 0.16 for SpatialDE). Stricter filtering of spots and genes did not improve the performance of SPARK and SpatialDE (Fig. 15). The distributions of P or Q values ​​for SpatialDE and SPARK are highly skewed around 0 (Supplementary Figure 16), genes with similar P or Q values ​​do not necessarily show similar spatial patterns, and smaller P or Q values ​​do not guarantee better The spatial pattern of (Figures 17 and 18). In contrast, the multi-domain adaptive filtering criterion implemented in SpaGCN enables it to eliminate false positive SVGs and ensure that all detected SVGs have clear spatial expression patterns.

To illustrate how filtering in SpaGCN works, we take domains 1, 5 and 8 as examples. For each of these domains, SpaGCN detected a single SVG enriched in that region. As shown in Figure 4e, PVALB was enriched in domain 1 and TRM62 in domain 8. Even though domains 1 and 8 are next to each other, these two SVGs still mark the domains well. NRGN is the SVG detected by SpaGCN for domains 5 and 7. The high expression of NRGN in domains 5 and 7 also suggests that these two domains are neuroanatomically similar, both including cortical and hippocampal pyramidal layers. Both the cortex and the hippocampus are regions that lie on the curved surface of the brain. Domains 5 and 7 are continuous in the 3D reconstruction, but they are separated in reality due to the way the slices are cut. Thus, not only NRGN, SpaGCN also detected many other SVGs for domains 5 and 7, some of which were highly expressed in these two domains (Supplementary Figure 19). The unique and robust SVG detection process in SpaGCN ensures that no such genes are missed.

SpaGCN only identifies four SVGs for domain 0. However, we argue that metagenes formed by combinations of multiple genes may reveal spatial patterns better than any single gene. Using domain 0 as an example, we show how SpaGCN creates informative metagenes to label spatial domains (Fig. 4f). First, by lowering the filtering threshold, SpaGCN identified KLK6, which is highly expressed in the lower part of domain 0. Using KLK6 as the starting gene, SpaGCN employed a novel approach to find a log-linear combination of KLK6, MBP, and ATP1B1 gene expression, which accurately labels the spatial domain 0. In this metagene, KLK6 and MBP were considered positive markers because they are highly expressed in certain spots in domain 0, while ATP1B1 was considered a negative marker because it is mainly expressed in non-domain 0 regions. Previous studies have shown that the expression of KLK6 and MBP is restricted to oligodendrocytes, whereas ATP1B1 is mainly expressed in neurons and astrocytes. This is consistent with the fact that domain 0 represents white matter, which is dominated by oligodendrocytes with few neuronal cell bodies. Therefore, the genes that make up this metagene have meaningful biological interpretations. Although our analysis focuses on a single tissue slice, SpaGCN can also jointly analyze multiple tissue slices. We show two examples using this mouse brain Visium data provided by 10x Genomics. Figure 5a shows the SpaGCN clustering results of two mouse hindbrain slices. Since these two tissue slices are from the same region, SpaGCN is able to infer the cluster correspondence between the two tissue slices. Next, we co-analyzed one tissue section from the mouse hindbrain and another tissue section from the mouse forebrain using SpaGCN. Since the forebrain and hindbrain are adjacent in the brain, we modified the coordinates of the spots in the hindbrain tissue slices so that the modified coordinates reflect the spatial adjacency of the two tissue slices. Using the modified coordinates as input, SpaGCN was able to produce clustering results reflecting the layer structure shared by the forebrain and hindbrain (Fig. 5b).

Application to mouse visual cortex STARmap data. Finally, we 
analyzed a STARmap dataset that has single-cell resolution7
. This 
dataset was generated from mouse visual cortex that spans from 
hippocampus to corpus callosum, and the six neocortical layers. In total, 1,020 genes were measured in 1,207 cells that include 
non-neuronal cells, excitatory and inhibitory neurons. The layer 
structure and cell type distribution of the tissue section provided 
by the original study are shown in Fig. 6a. As the tissue capture area 
of STARmap is much smaller than 10x Visium, we increased the 
contribution of neighboring cells from 0.5 to 1 when calculating 
the weighted gene expression of each cell in SpaGCN. Using this 
approach, SpaGCN detected spatial domains that agreed well with 
the annotated tissue structure (Fig. 6a,c), achieving an ARI of 0.51. 
By contrast, the ARIs of the other methods are much lower (0.30 for 
Louvain, 0.37 for BayesSpace and 0.03 for HMRF) (Fig. 6b). This 
example demonstrates that SpaGCN utilizes spatial information 
more efficiently than BayesSpace and HMRF. Using SpaGCN, we 
further detected 25 SVGs including genes LAMP5, HPCAL1, CPLX1, 
PLP1, NRSN1, ATP1A2 and BSG that showed enriched expression 
patterns for domains 0 to 6 (Fig. 6e and Supplementary Fig. 20). 
Similar to previous analyses, SPARK and SpatialDE detected much 
larger number of SVGs but many of the SVGs lack spatial expression 
patterns (Fig. 6d and Supplementary Figs. 21–24).

Finally, we analyzed a STARmap dataset with single-cell resolution. The dataset was obtained from mouse visual cortex, spanning the hippocampus to the corpus callosum and six neocortical layers. In total, 1,020 genes were measured in 1,207 cells, including non-neuronal cells, excitatory neurons, and inhibitory neurons. The layer structure and cell type distribution of the tissue sections provided by the original study are shown in Fig. 6a. Since the tissue capture area of ​​STARmap is much smaller than 10x Visium, we increased the contribution of neighboring cells from 0.5 to 1 when calculating the weighted gene expression of each cell in SpaGCN. Using this approach, the spatial domains detected by SpaGCN are in good agreement with the annotated organizational structure (Fig. 6a,c), with an ARI of 0.51. In contrast, the ARI of other methods is much lower (0.30 for Louvain, 0.37 for BayesSpace, and 0.03 for HMRF) (Fig. 6b). This example demonstrates that SpaGCN utilizes spatial information more effectively than BayesSpace and HMRF. Using SpaGCN, we further detected 25 SVGs, including the genes LAMP5, HPCAL1, CPLX1, PLP1, NRSN1, ATP1A2, and BSG, which showed enriched expression patterns in domains 0 to 6 (Fig. 6e and Supplementary Fig. 20). . Similar to previous analyses, SPARK and SpatialDE detected more SVGs, but many of them lacked spatial expression patterns (Fig. 6d and Supplements 21–24).

Discussion

detect SVGs and meta genes that have much clearer spatial expression patterns and biological interpretations than genes detected by 
SpatialDE and SPARK. Additionally, the SpaGCN-detected SVGs 
are transferrable and can be utilized for downstream analyses in 
independent tissue sections. SpaGCN is also computationally 
fast and memory efficient compared to SPARK and SpatialDE 
(Supplementary Note 4).
The spatial domain detection step in SpaGCN is flexible. First, 
SpaGCN can adjust the weight of histology in gene expression 
smoothing. For datasets with clear tissue structure in histology, 
higher weight led to clearer separation of cancer versus noncancer regions. Second, during the GCN fitting procedure, the graph 
weights are updated, which allows SpaGCN to learn an efficient way 
to aggregate gene expression from neighboring spots for each gene. 
For data generated from different platforms, the spatial dependency 
between spots/cells is different as the size of the captured tissue 
area varies. The flexibility in modeling spatial dependency makes 
SpaGCN versatile for different types of SRT data.
A limitation of SpaGCN is that the spatial domain detection is 
mainly driven by gene expression, which may lead to the discrepancy 
between the detected domains and the underlying tissue anatomical structure. This is a general problem for gene expression-based 
clustering methods. Another limitation of SpaGCN is the lack of 
separation of spatial variation and cell type variation in gene expression patterns for the detected SVGs. To address these limitations, 
methods that can jointly consider gene expression and histological 
features in clustering are needed. Further, cell type-specific gene 
expression needs to be estimated to tease out the contribution of cell 
types and spatial location in gene expression variation. We anticipate that methods development along these directions is warranted 
for future research.

In this paper, we introduce SpaGCN, a method that integrates gene expression, spatial location, and histology information, for modeling the spatial dependence of gene expression to identify spatial domains and enriched SVGs. SpaGCN has been extensively tested on datasets from different species, regions and tissues generated using different SRT techniques. Additional analyzes of ST, SLIDE-seqV2, and MERFISH data are presented in Notes 1–3. Our results consistently show that SpaGCN is able to identify spatial domains with consistent gene expression and histology, detecting SVGs and metagenes with clearer spatial expression patterns and biological interpretations than genes detected by SpatialDE and SPARK. performance. Furthermore, SVGs detected by SpaGCN can be transferred in independent tissue sections and used for downstream analysis. Compared with SPARK and SpatialDE, SpaGCN has fast computation speed and high memory utilization (Note 4).
The spatial domain detection step in SpaGCN is flexible. First, SpaGCN can adjust the weight of histology in gene expression smoothing. For datasets with well-defined tissue structures, higher weights lead to a clearer separation of cancerous from non-cancerous regions. Second, during the GCN fitting process, the weights of the graph are updated, which allows SpaGCN to learn an efficient way to aggregate the gene expression of each gene in adjacent spots. For data generated from different platforms, the spatial dependence between spot/cell is also different due to the different size of the captured tissue area. The flexibility of modeling spatial dependencies makes SpaGCN suitable for different types of SRT data.
A limitation of SpaGCN is that spatial domain detection is mainly driven by gene expression, which may lead to discrepancies between detected domains and underlying tissue anatomy. This is a general problem with gene expression-based clustering methods. Another limitation of SpaGCN is the lack of separation of spatial variation and cell type variation in the detected gene expression patterns of SVG. To address these limitations, methods that can simultaneously consider gene expression and histological features in clustering need to be developed. Furthermore, estimation of gene expression in specific cell types is required to distinguish the contribution of cell type and spatial location to gene expression variation. We anticipate that future research will require further development of methods along these directions.

Guess you like

Origin blog.csdn.net/qq_43369406/article/details/131706526