SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains an

Title

SpaGCN: Integrating gene expression, spatial 
location and histology to identify spatial domains 
and spatially variable genes by graph 
convolutional network

SpaGCN是一种通过图卷积网络整合基因表达、空间位置和组织学信息，用于识别空间域和空间可变基因的方法。

在SpaGCN中，我们将基因表达、空间位置和组织学信息相结合，构建了一个图来表示数据中所有点之间的关系。通过图卷积层，SpaGCN可以从相邻点中聚合基因表达信息。然后，SpaGCN利用聚合后的表达矩阵使用无监督的迭代聚类算法对点进行聚类，将每个聚类视为一个空间域。接下来，SpaGCN通过差异表达分析检测在特定域中富集的空间可变基因。

SpaGCN的关键优势在于它综合考虑了基因表达、空间位置和组织学信息，从而能够识别具有一致基因表达和组织学的空间域，并检测出具有清晰空间表达模式的空间可变基因。与其他方法相比，SpaGCN检测到的空间可变基因具有更好的生物学解释性和转移性，可用于进一步的研究和分析。

总而言之，SpaGCN通过整合不同信息源的数据，利用图卷积网络的优势，为空间转录组学研究提供了一种强大的工具，可以揭示基因表达在组织微环境中的空间变异，并为进一步理解细胞机制和疾病病理学提供重要线索。

Abstract

Recent advances in spatially resolved transcriptomics (SRT) technologies have enabled comprehensive characterization 
of gene expression patterns in the context of tissue microenvironment. To elucidate spatial gene expression variation, we 
present SpaGCN, a graph convolutional network approach that integrates gene expression, spatial location and histology 
in SRT data analysis. Through graph convolution, SpaGCN aggregates gene expression of each spot from its neighboring 
spots, which enables the identification of spatial domains with coherent expression and histology. The subsequent domain 
guided differential expression (DE) analysis then detects genes with enriched expression patterns in the identified domains. 
Analyzing seven SRT datasets using SpaGCN, we show it can detect genes with much more enriched spatial expression patterns than competing methods. Furthermore, genes detected by SpaGCN are transferrable and can be utilized to study spatial 
variation of gene expression in other datasets. SpaGCN is computationally fast, platform independent, making it a desirable 
tool for diverse SRT studies.

最近在空间分辨转录组学（Spatially Resolved Transcriptomics，简称SRT）技术方面取得了重大进展，这些技术使得我们能够全面地描述组织微环境中的基因表达模式。为了阐明基因表达的空间变异，我们提出了SpaGCN，一种图卷积网络方法，它将基因表达、空间位置和组织学整合到SRT数据分析中。通过图卷积，SpaGCN将每个点的基因表达与其邻近点的基因表达相结合，从而能够确定具有一致表达和组织学的空间区域。随后的区域引导差异表达（Differential Expression，简称DE）分析可以检测到在确定的区域中具有富集表达模式的基因。通过使用SpaGCN分析七个SRT数据集，我们展示了它比其他竞争方法能够检测到具有更多富集空间表达模式的基因。此外，由SpaGCN检测到的基因是可转移的，可以用于研究其他数据集中的基因表达空间变异。SpaGCN计算速度快，平台无关，使其成为各种SRT研究的理想工具。

Introduction

Recent technological advances in SRT have enabled gene 
expression profiling with spatial information in tissues1
. 
Knowledge of the relative locations of different cells in a tissue is critical for understanding disease pathology because spatial 
information helps in understanding how the gene expression of a 
cell is influenced by its surrounding environment. Popular experimental methods for SRT can be broadly classified into two categories. The first category is in situ hybridization or sequencing-based 
technologies with single-cell resolution, which includes seqFISH2,3
, 
seqFISH+4
, MERFISH5,6
, STARmap7
 and FISSEQ8
 that measure the 
expression level for hundreds to thousands of genes in cells within 
their tissue context. The second category is in situ capturing-based 
technologies with spatial barcoding followed by sequencing, which 
includes spatial transcriptomics (ST)9
, SLIDE-seq10, SLIDE-seqV2 
(ref. 11), HDST12 and 10x Visium that measure the expression level 
for thousands of genes in captured locations, referred to as spots. 
These different SRT technologies have made it possible to uncover 
the complex transcriptional architecture of heterogeneous tissues and enhanced our understanding of cellular mechanisms in 
diseases13,14.
In SRT studies, an important step is identifying spatial domains defined as regions that are spatially coherent in both gene expres-sion and histology. Traditional clustering methods such as K-means  and Louvain’s method15 only take gene expression data as input, and the resulting clusters may not be contiguous due to the lack of consideration of spatial information and histology. To account for spatial dependency of gene expression, new methods have been developed. For example, Zhu et al.16 uses a Hidden-Markov random field (HMRF) approach to model spatial dependency of gene expression; stLearn17 uses features extracted from histology image as 
well as expression of neighboring spots spatially to normalize gene 
expression data before clustering; BayesSpace18 employs a Bayesian 
approach for clustering by imposing a prior that gives higher weight 
to physically close spots. Although these methods can cluster spots 
or cells into distinct groups, the lack of flexibility with different 
modalities has made them less versatile. As newer SRT technologies 
continue to be developed19–22, it is desirable to have methods that are 
compatible with different SRT platforms.
To link spatial domains with biological functions, it is crucial 
to identify genes that show enriched expression in the identified 
domains. Methods such as Trendsceek23, SpatialDE24 and SPARK25
have been developed to detect spatially variable genes (SVGs). These 
methods examine each gene independently and return a P value to 
represent the spatial variability of a gene. However, due to the lack 
of consideration of spatial domains, genes detected by these methods do not have guaranteed spatial expression patterns, making it 
difficult to utilize these genes for further biological investigations.
Rather than considering spatial domain and SVG identification 
as separate problems, we developed SpaGCN, a graph convolutional 
network (GCN)-based approach that considers these two problems 
jointly. SpaGCN first identifies spatial domains by integrating gene 
expression, spatial location and histology through the construction 
of an undirected weighted graph that represents the spatial dependency of the data. For each spatial domain, SpaGCN then detects SVGs that are enriched in the domain. By restricting the search 
space to spatial domains, the SVGs detected by SpaGCN are guaranteed to have spatial expression patterns. The spatial domains and 
the corresponding SVGs provide a comprehensive picture of the 
spatial gradients in gene expression in tissue. SpaGCN is versatile 
in analyzing many types of SRT data, including ST, 10x Visium, 
SLIDE-seqV2, STARmap, and MERFISH.

最近在SRT方面的技术进步使得可以在组织中进行具有空间信息的基因表达谱分析。了解组织中不同细胞的相对位置对于理解疾病病理学至关重要，因为空间信息有助于了解细胞的基因表达受其周围环境的影响。流行的SRT实验方法可以广泛分为两类。第一类是基于原位杂交或基于测序的具有单细胞分辨率的技术，包括seqFISH、seqFISH+、MERFISH、STARmap和FISSEQ等，它们可以测量细胞内数百到数千个基因的表达水平，并在其组织环境中进行分析。第二类技术是基于原位捕获的，采用空间条形码标记后再进行测序，包括空间转录组学（ST）、SLIDE-seq、SLIDE-seqV2、HDST和10x Visium等，它们可以测量捕获位置（即spot）中数千个基因的表达水平。这些不同的SRT技术使得我们能够揭示异质组织的复杂转录组结构，并加深我们对疾病中细胞机制的理解。

在SRT研究中，一个重要的步骤是识别空间域，即在基因表达和组织学上空间上具有一致性的区域。传统的聚类方法如K-means和Louvain方法只使用基因表达数据作为输入，由于缺乏空间信息和组织学的考虑，得到的聚类结果可能不是连续的。为了考虑基因表达的空间依赖性，已经开发了新的方法。例如，Zhu等人采用隐马尔可夫随机场（HMRF）方法来建模基因表达的空间依赖性；stLearn利用从组织学图像提取的特征以及邻近spot的表达在空间上进行归一化；BayesSpace通过施加先验，在聚类时赋予物理上接近的spot更高的权重。虽然这些方法可以将spot或细胞聚类为不同的组群，但缺乏对不同模态的灵活性使它们的适用性受到限制。随着新的SRT技术的不断发展，需要能够与不同SRT平台兼容的方法。

为了将空间域与生物功能联系起来，关键是识别在已确定的域中表达富集的基因。已经开发了一些方法来检测空间变异基因（Spatially Variable Genes，SVGs），如Trendsceek、SpatialDE和SPARK。这些方法独立地检查每个基因，并返回一个P值来表示基因的空间变异性。然而，由于缺乏对空间域的考虑，这些方法检测到的基因没有保证的空间表达模式，使得难以将这些基因用于进一步的生物学研究。

我们开发了SpaGCN，一种基于图卷积网络（GCN）的方法，将空间域和SVG的识别作为一个联合问题考虑。SpaGCN首先通过构建无向加权图来整合基因表达、空间位置和组织学，从而识别空间域。对于每个空间域，SpaGCN然后检测在该域中富集的SVG。通过限制搜索空间为空间域，SpaGCN检测到的SVG具有保证的空间表达模式。空间域和相应的SVG提供了组织中基因表达的空间梯度的全面图景。SpaGCN适用于分析许多类型的SRT数据，包括ST、10x Visium、SLIDE-seqV2、STARmap和MERFISH等。

SRT技术分两类根据使用仪器不同为iST和sST，iST是基于in situ hybridization原位杂交，如seqFISH，seqFISH+, MERFISH, STARmap and FISSEQ
ST是基于in situ capturing-based technologies原位捕获技术，如 SLIDE-seq10, SLIDE-seqV2 (ref. 11), HDST12 and 10x Visium。

在SRT研究中分两步后，第一步是identifying spatial domains区分空间域，常用方法有K-means，Louvain但是未考虑空间信息和组织学信息;
常用方法有HMRF，stLearn归一化，BayesSpace加先验增加空间信息，但是缺乏多模态灵活性，兼容性差。

第二步是domains和biological functions联系起来，即在domains中识别enriched expression富集基因。方法如Trendsceek，SpatialDE
和SPARK都用来检测spatially variable genes(SVGs)空间变异基因，用p值来表示基因的空间变异性。但上面的方法缺乏对空间域的考虑。

SpaGCN将domains和detect SVGs联合问题。通过构建undirected weighted graph无向加权图来联合gene expression基因表达, spatial
location空间位置和histology组织学信息，从而识别空间域。

对于每个domains做detect SVGs。该方法适用于多种SRT数据，如ST、10x Visium、SLIDE-seqV2、STARmap和MERFISH等。

in situ hybridization就是iST，注重cell中几百个genes

在这里插入图片描述

in situ capturing-based technologies with spatial barcoding条形码就是sST，注重spot中几千个genes

请添加图片描述

heterogeneous tissues异质性组织指不同细胞或者细胞群

请添加图片描述

spatial variable genes空间变异基因指的是不同域中富集不同的基因

请添加图片描述

Results / Experiments

Overview of SpaGCN and evaluation. We explain the workflow 
of SpaGCN using in situ capturing-based SRT data as an example, 
but the method can be easily modified to analyze other types of SRT 
data. As shown in Fig. 1a, SpaGCN first builds a graph to represent 
the relationship of all spots considering both spatial location and 
histology information. Next, SpaGCN utilizes a graph convolutional 
layer to aggregate gene expression information from neighboring 
spots. Then, SpaGCN uses the aggregated expression matrix to 
cluster spots using an unsupervised iterative clustering algorithm26. 
Each cluster is considered as a spatial domain from which SpaGCN 
then detects SVGs that are enriched in a domain by DE analysis 
(Fig. 1b). When a single gene cannot mark the expression pattern 
of a domain, SpaGCN will construct a meta gene, formed by the 
combination of multiple genes, to represent the expression pattern 
of the domain.
To showcase the strength of SpaGCN, we applied it to seven publicly available datasets (Supplementary Table 1). The spatial domains 
identified by SpaGCN agree better with known tissue structures 
than Louvain, stLearn, and BayesSpace. We also compared SVGs 
detected by SpaGCN with those detected by SpatialDE and SPARK, 
and found that the SpaGCN-detected SVGs have more coherent 
expression patterns and better biological interpretability than the 
other two methods. The specificity of spatial expression patterns 
revealed by SpaGCN-detected SVGs were further confirmed by 
Moran’s I and Geary’s C statistics27, two commonly used metrics for 
quantifying spatial autocorrelation of gene expression28,29.

SpaGCN的概述和评估。我们以基于捕获的原位SRT数据为例，解释了SpaGCN的工作流程，但该方法可以很容易地修改以分析其他类型的SRT数据。如图1a所示，SpaGCN首先构建一个图来表示所有spot之间的关系，考虑到空间位置和组织学信息。接下来，SpaGCN利用图卷积层从相邻spot中聚合基因表达信息。然后，SpaGCN使用聚合的表达矩阵利用无监督的迭代聚类算法对spot进行聚类。每个聚类被视为一个空间域，从中SpaGCN通过差异表达分析检测富集在该域中的SVG（图1b）。当单个基因无法标记一个域的表达模式时，SpaGCN将构建一个元基因，由多个基因组合而成，以代表该域的表达模式。

为了展示SpaGCN的优势，我们将其应用于七个公开可用的数据集（附表1）。SpaGCN识别的空间域与已知的组织结构更加吻合，相比之下Louvain、stLearn和BayesSpace的效果更好。我们还将SpaGCN检测到的SVG与SpatialDE和SPARK检测到的SVG进行了比较，发现SpaGCN检测到的SVG具有更一致的表达模式和更好的生物解释性。通过Moran’s I和Geary’s C统计量进一步验证了SpaGCN检测到的SVG所揭示的空间表达模式的特异性，这两个统计量是用于量化基因表达的空间自相关的常用指标。

SpaGCN适用于in situ capturing-based SRT data。首先构建一个graph，这个graph考虑了spatial location空间信息和histology 
information组织信息；
然后利用GCL/graph convolutional layer从neighboring spots中聚合gene expression information基因信息，得到一个aggregated expression matrix聚合表达矩阵；
对AEM使用聚类算法聚类，将spot聚类得到domains；
再对单个domain使用DE analysis得到单个SVG，但是当单个基因无法表达区域时，就构建一个meta gene元基因，由多个基因组成来表达domain中的SVG。

识别domains比Louvain, stLearn和BayesSpace更好，ARI更高。
detect SVGs比SpatialDE和SPARK更好，Moran's I和Geary's C统计量更好，这两个统计量用于量化基因表达的空间自相关性。

Application to human primary pancreatic cancer ST data. To 
demonstrate the importance of incorporating histology information, we analyzed a human primary pancreatic cancer dataset generated using the ST technology13. This dataset includes 224 spots 
and 16,448 genes with three manually annotated tissue regions. 
The cancer region detected by Louvain based on gene expression 
alone did not closely match the pathologist-annotated cancer region 
(Fig. 2a). Spatial clustering methods such as stLearn and BayesSpace 
did not detect the cancer region either. SpaGCN revealed a similar pattern when using default parameters. As the histology image 
shows a clear difference between the cancer and noncancer regions, 
it suggests histology is informative for clustering. SpaGCN has the 
flexibility of modeling histology with a scaling parameter s, which 
controls the weight given to histology when detecting neighbors 
for each spot. By increasing the value of s from 1 to 2, SpaGCN 
detected a cluster that agrees well with the manually annotated cancer region. It is worth noting that when s was set at the default value 
of 1, SpaGCN detected the noncancer regions well. When s was 
increased to 2, SpaGCN not only maintained the ability to detect 
the noncancer regions but also detected the cancer region. This 
example showed that SpaGCN is flexible in incorporating histology 
information in clustering. Although stLearn can incorporate histology data, its use of histology information is pre-fixed by the radius 
when defining neighboring spots. The lack of flexibility in adjusting 
histology weight led to the discrepancy between their clustering and 
the pathologist’s manual annotation.
Next, we detected SVGs using SpaGCN, SPARK and SpatialDE. 
In total, SpaGCN detected 12 SVGs, with three, eight and one SVGs 
for domains 0, 1 and 2, respectively (Fig. 2b; Supplementary Fig. 1). 
Furthermore, a meta gene using KRT17, MMP11 and SERPINA1 marked the cancer region better than the originally identified 
KRT17 for domain 2 (Fig. 2c). KRT17 functions as a tumor promoter 
and regulates proliferation in pancreatic cancer30, and MMP11 is a 
prognostic biomarker for pancreatic cancer31. Our identification of 
KRT17 and MMP11 as the two positive genes for the cancer region 
agrees well with pancreatic cancer biology. SPARK and SpatialDE 
detected 203 and 163 SVGs, with their P or Q values highly skewed 
towards 0 (Supplementary Figs. 2 and 3). However, the Moran’s 
I and Geary’s C values for their SVGs are much lower than those 
detected by SpaGCN, suggesting their lack of spatial patterns 
(Fig. 2d). Furthermore, genes with smaller P or Q values do not 
necessarily show better spatial expression patterns than those with 
larger P or Q values (Supplementary Figs. 4 and 5). More stringent 
filtering of spots and genes did not improve the spatial pattern for 
SpatialDE and SPARK-detected SVGs (Supplementary Fig. 6).

应用于人类原发性胰腺癌ST数据。为了展示整合组织学信息的重要性，我们分析了使用ST技术生成的人类原发性胰腺癌数据集。该数据集包括224个spot和16,448个基因，并有三个手动注释的组织区域。仅基于基因表达的Louvain聚类方法检测到的癌症区域与病理学家注释的癌症区域不完全吻合。stLearn和BayesSpace等空间聚类方法也没有检测到癌症区域。当使用默认参数时，SpaGCN揭示了一个类似的模式。然而，由于组织学图像在癌症区域和非癌症区域之间存在明显差异，这表明组织学对于聚类是具有信息的。SpaGCN具有调节组织学的灵活性，可以通过一个缩放参数s来控制在检测每个spot的邻居时给予组织学的权重。将s的值从1增加到2，SpaGCN检测到的一个聚类与手动注释的癌症区域吻合较好。值得注意的是，当s设置为默认值1时，SpaGCN也能很好地检测到非癌症区域。当s增加到2时，SpaGCN不仅保持了检测非癌症区域的能力，还检测到了癌症区域。这个例子展示了SpaGCN在聚类中整合组织学信息的灵活性。虽然stLearn可以整合组织学数据，但它在定义邻近spot时使用的组织学信息是由半径预先固定的。无法调整组织学权重导致其聚类结果与病理学家的手动注释存在差异。

接下来，我们使用SpaGCN、SPARK和SpatialDE检测SVGs。总共，SpaGCN检测到12个SVGs，其中分别有3个、8个和1个SVGs分布在域0、1和2（图2b；附图1）。此外，使用KRT17、MMP11和SERPINA1构建的元基因能更好地标记出癌症区域，相比于最初在域2中确定的KRT17（图2c）。KRT17在胰腺癌中起到肿瘤促进因子的作用，并调节增殖，而MMP11是胰腺癌的预后生物标志物。我们将KRT17和MMP11作为癌症区域的两个阳性基因的发现与胰腺癌的生物学相吻合。SPARK和SpatialDE检测到203个和163个SVGs，它们的P值或Q值在0附近高度偏斜。然而，它们的SVGs的Moran’s I和Geary’s C值远低于SpaGCN检测到的SVGs，表明它们缺乏空间模式。此外，具有较小P值或Q值的基因不一定比具有较大P值或Q值的基因显示更好的空间表达模式。对于SpatialDE和SPARK检测到的SVGs，更严格的spots和基因过滤并没有改善空间模式。

Application to human dorsolateral prefrontal cortex 10x Visium 
data. To show quantitatively that SpaGCN outperforms Louvain, 
stLearn and BayesSpace in spatial domain detection, we analyzed 
the LIBD human dorsolateral prefrontal cortex (DLPFC) data generated using 10x Visium32. This study sequenced 12 tissue slices that 
span six neuronal layers plus white matter from the DLPFC in three 
human brains. The manual annotation of the tissue layers provided 
by the original study allows us to evaluate the accuracy of spatial 
domain detection. Figure 3a shows that for the representative tissue slice 151673, both SpaGCN and BayesSpace revealed spatial 
domains that agree better with the manually annotated tissue layers 
than Louvain. Although stLearn utilized histology information, its 
performance is not much better than Louvain and is substantially 
worse than SpaGCN and BayesSpace. The relative performance 
of these methods remains the same when considering all 12 slices 
(Fig. 3b and Supplementary Table 2); the median ARI is 0.36 for 
stLearn, 0.42 for BayesSpace and 0.45 for SpaGCN.
To validate further the identified spatial domains, we detected 
SVGs for each domain in slice 151673. In total, SpaGCN detected 
67 SVGs, with 53 of them being specific to domain 5, which corresponds to white matter (Supplementary Fig. 7). Patterns of SVGs 
for other domains are not very clear. These results indicate that 
gene expression profiles of spots from white matter are distinct 
from spots in the neuronal layers, while gene expression differences 
among the six neuronal layers are much smaller and more difficult to distinguish using individual marker genes. SVGs detected 
by SPARK and SpatialDE also suffered from the same problem. 
SPARK detected 3,187 SVGs with 1,131 of them having false discovery rate (FDR)-adjusted P values equal to 0, most of which 
only marked the white matter region (Supplementary Figs. 8 and 
9). We also found that the SVGs detected by SPARK lack domain 
specificity (Supplementary Fig. 10). SpatialDE detected 3,654 SVGs 
with 806 of them having Q values equal to 0, but these genes do 
not necessarily show better spatial patterns than genes with larger 
Q values (Supplementary Fig. 11). Although SPARK and SpatialDE 
detected much larger numbers of SVGs than SpaGCN, the genes 
detected by these two methods cannot distinguish different degrees 
of spatial expression variability as their P or Q value distributions 
are highly skewed towards 0. Figure 3c shows that the Moran’s I values for SpaGCN-detected SVGs are significantly higher than genes 
detected by SpatialDE and SPARK (median of 0.39 for SpaGCN 
against 0.09 for SPARK and 0.08 for SpatialDE). More stringent 
filtering of spots and genes did not improve the performance of 
SpatialDE and SPARK (Supplementary Fig. 12). For three out of the 
six neuronal layers, SpaGCN detected a single SVG to mark that 
region (Fig. 3d). For example, CAMK2N1 is enriched in domain 0 
(layers 1 and 2), PCP4 is enriched in domain 1 (layer 4) and NEFM
is enriched in domain 3 (layer 3).
To show that SpaGCN-detected SVGs are useful for downstream 
analysis, we performed K-means clustering on slice 151507, which is from a different brain, using all 67 SVGs detected from slice 
151673 by SpaGCN. Compared with manually curated layer assignment, this clustering analysis had a Adjusted Rand Index (ARI) of 
0.23 (Fig. 3e,f). We performed similar analysis using SVGs detected 
by SpatialDE and SPARK. When randomly selecting 67 SVGs with 
0 P or Q value from genes detected by SpatialDE/SPARK, the ARI is 
only 0.13 for SpatialDE and 0.14 for SPARK. The ARIs for SpatialDE 
and SPARK did not improve even with increased numbers of SVGs 
(Fig. 3e). These results further confirmed the lack of spatial patterns 
for genes detected by SPARK and SpatialDE.
Although it is difficult to identify single genes to mark certain 
neuronal layers, SpaGCN was able to find domain-specific meta 
genes. As shown in Fig. 3g, SpaGCN detected meta genes for 
domains 1, 2, 4 and 6. The meta gene for domain 2 is specific to layer 1. As layer 1 only has a few spots, it is difficult to find a highly 
enriched gene. However, by adding depleted genes such as FTH1, 
MBP, MT-CO3 and PLP1, the expression pattern in this region is 
strengthened. Furthermore, the SVGs and meta genes detected by 
SpaGCN are transferrable to slice 151507 obtained from a different brain, in which the meta genes detected in slice 151673 mark 
the same layers in slice 151507 (Fig. 3g and Supplementary Fig. 13).

应用于人类背外侧前额叶皮质（Dorsolateral Prefrontal Cortex，简称DLPFC）的10x Visium数据。为了定量地展示SpaGCN在空间域检测方面优于Louvain、stLearn和BayesSpace，我们分析了使用10x Visium技术生成的LIBD人类DLPFC数据。该研究对来自三个人脑的DLPFC的六个神经元层和白质进行了12个组织切片的测序。原始研究提供的组织层面的手动注释使我们能够评估空间域检测的准确性。图3a显示，对于代表性的组织切片151673，SpaGCN和BayesSpace揭示的空间域与手动注释的组织层面更加吻合，而Louvain的效果不如它们。尽管stLearn利用了组织学信息，但其表现不比Louvain好多少，并且远不及SpaGCN和BayesSpace。考虑所有12个切片时，这些方法的相对性能保持不变（图3b和附表2）；stLearn的ARI中位数为0.36，BayesSpace为0.42，SpaGCN为0.45。

为了进一步验证所识别的空间域，我们在切片151673中为每个域检测了SVGs。总共，SpaGCN检测到67个SVGs，其中有53个专属于域5，对应于白质（附图7）。其他域的SVGs模式不太清晰。这些结果表明，白质中spot的基因表达谱与神经元层中的spot有所不同，而六个神经元层之间的基因表达差异要小得多，更难以使用单个标记基因进行区分。SPARK和SpatialDE检测到的SVGs也存在同样的问题。SPARK检测到3,187个SVGs，其中1,131个FDR调整后的P值等于0，其中大部分只标记了白质区域（附图8和9）。我们还发现，SPARK检测到的SVGs缺乏特定于域的性质（附图10）。SpatialDE检测到3,654个SVGs，其中806个Q值等于0，但这些基因的空间模式不一定比具有较大Q值的基因更好（附图11）。尽管SPARK和SpatialDE检测到的SVGs比SpaGCN多得多，但这两种方法检测到的基因不能区分不同程度的空间表达变异，因为它们的P或Q值分布在0附近高度偏斜。图3c显示，SpaGCN检测到的SVGs的Moran’s I值显著高于SpatialDE和SPARK检测到的基因（SpaGCN的中位数为0.39，SPARK为0.09，SpatialDE为0.08）。对spots和基因的更严格过滤并没有改善SpatialDE和SPARK的性能（附图12）。对于六个神经元层中的三个，SpaGCN检测到一个单一的SVG来标记该区域（图3d）。例如，CAMK2N1在域0（层1和2）、PCP4在域1（层4）和NEFM在域3（层3）中富集。

为了展示SpaGCN检测到的SVGs在下游分析中的有用性，我们对切片151507进行了K-means聚类，该切片来自另一个大脑，使用SpaGCN从切片151673检测到的所有67个SVGs。与手动标记的层分配相比，该聚类分析的调整兰德指数（Adjusted Rand Index，ARI）为0.23（图3e，f）。我们使用SpatialDE和SPARK检测到的SVGs进行了类似的分析。当随机选择由SpatialDE/SPARK检测到的基因中的0个P值或Q值的67个SVGs时，SpatialDE的ARI仅为0.13，SPARK为0.14。即使增加SVGs的数量，SpatialDE和SPARK的ARI也没有改善（图3e）。这些结果进一步证实了SPARK和SpatialDE检测到的基因缺乏空间模式。

虽然很难确定单个基因来标记特定的神经元层，但SpaGCN能够找到特定于域的元基因。如图3g所示，SpaGCN检测到了域1、2、4和6的元基因。域2的元基因特异于层1。由于层1只有少数spot，很难找到高度富集的基因。然而，通过添加如FTH1、MBP、MT-CO3和PLP1等贫乏基因，该区域的表达模式得到了加强。此外，SpaGCN检测到的SVGs和元基因可以转移到从另一个大脑获得的切片151507中，在切片151507中，SpaGCN检测到的元基因标记了相同的层（图3g和附图13）。

Application to mouse posterior brain 10x Visium data. Next, 
we analyzed a 10x Visium dataset generated from mouse posterior brain that includes 3,353 spots and 31,053 genes33. This dataset shows much more complex tissue structure than the previous 
two datasets. We compared the clustering result of SpaGCN with 
Louvain, stLearn and BayesSpace when the number of clusters was 
set at ten for all methods. Figure 4a shows that Louvain’s clustering is similar to stLearn, BayesSpace and SpaGCN, but the spatial 
domains detected by the latter three methods are more spatially 
contiguous due to their ability to account for spatial dependency of 
gene expression.
We further investigated the ability of each method in detecting 
more refined tissue structure. Specifically, we performed subclustering analysis for spots in domain 5 detected by SpaGCN, which 
corresponds to the cortex (Fig. 4b). The subdomains detected by 
SpaGCN agree well with the Allen Brain Institute reference atlas 
diagram of the mouse cortex (Fig. 4c). The detected subdomains 
include layers 2/3, layers 4/5, layer 6, a hippocampal region (CA1) 
and the subiculum. Layers 2/3 are the ‘external’ cortical layers that 
are biologically responsible for local networks in which neurons in 
this subdomain communicate to other neurons in adjacent neocortical regions. Layers 4/5 are the ‘internal’ cortical layers that are biologically responsible for longer range neural networks. For example, 
the visual cortex, which corresponds to the neocortical region, is 
responsible for receiving visual information from the lateral geniculate nucleus that is far away. SpaGCN was able to separate the 
molecular (layer 1), external (layers 2/3), internal (layers 4/5) and 
the plexiform (6) layers. More importantly, SpaGCN outperformed 
Louvain and stLearn, which show combining of neocortical layers. 
SpaGCN also outperformed BayesSpace in distinguishing between 
the plexiform layer (subdomain 1) and the non-neocortical CA1 
region of the hippocampus (subdomain 3). In contrast, BayesSpace 
combined layer 6 of the neocortex with the non-neocortical CA1 
layer of the hippocampus.
Next, we compared SpaGCN with SPARK and SpatialDE for 
SVG detection. SpaGCN detected 1,028 SVGs for the ten spatial 
domains while SPARK and SpatialDE detected 9,678 and 12,676 
SVGs, respectively (Supplementary Fig. 14). As shown in Fig. 4d, 
the Moran’s I values of SpaGCN-detected SVGs are much higher 
than those detected by SPARK and SpatialDE (median of 0.54 for 
SpaGCN against 0.20 for SPARK and 0.16 for SpatialDE). More 
stringent filtering of spots and genes did not improve the performance of SPARK and SpatialDE (Supplementary Fig. 15). The 
P or Q value distributions of SpatialDE and SPARK are highly skewed towards 0 (Supplementary Fig. 16), and genes with similar P or Q values do not necessarily show similar spatial patterns 
and a smaller P or Q value does not guarantee a better spatial pattern (Supplementary Figs. 17 and 18). In contrast, multiple domain 
adaptive filtering criteria implemented in SpaGCN allow it to eliminate false positive SVGs and ensure all detected SVGs have clear 
spatial expression patterns.
To illustrate how the filtering in SpaGCN works, we use domains 
1, 5 and 8 as an example. For each of these domains, SpaGCN 
detected a single SVG enriched in that region. As shown in Fig. 4e, 
PVALB is enriched in domain 1 and TRM62 is enriched in domain 
8. Although domains 1 and 8 are adjacent to each other, these 
two SVGs can still well mark these domains. NRGN is a SVG that 
SpaGCN detected for domains 5 and 7. The high expression of 
NRGN in domains 5 and 7 also indicates that these two domains are 
neuroanatomically similar—both consisting of cortex and the pyramidal layer of the hippocampus. Both the cortex and hippocampus 
are regions that are on the curved surface of the brain. Domains 
5 and 7, which would be contiguous in a three-dimensional (3D) 
reconstruction, are artifactually separated as a result of how the section was cut. Therefore, it is not surprising that in addition to NRGN, 
SpaGCN also detected many other SVGs for domains 5 and 7, some 
of which are highly expressed in both domains (Supplementary 
Fig. 19). The unique and powerful SVG detection procedure in 
SpaGCN ensures that genes such as these are not missed.
SpaGCN only identified four SVGs for domain 0. However, we 
reason that a meta gene, formed by the combination of multiple 
genes, may better reveal spatial patterns than any single genes. We 
used domain 0 as an example to show how SpaGCN can create 
informative meta genes to mark a spatial domain (Fig. 4f). First, by 
lowering the filtering thresholds, SpaGCN identified KLK6 which 
is highly expressed in the lower part of domain 0. Using KLK6 as a 
starting gene, SpaGCN used a novel approach to find a log-linear 
combination of gene expression of KLK6, MBP and ATP1B1, which 
accurately marked the spatial domain 0. In this meta gene, KLK6
and MBP are considered as positive markers because they are 
highly expressed in some spots in domain 0, whereas ATP1B1 is 
considered a negative marker as it is mainly expressed in regions 
other than domain 0. Previous studies have shown that KLK6 and 
MBP expression is restricted to oligodendrocytes, while ATP1B1 is 
mainly expressed in neurons and astrocytes34. This resonates with 
the fact that domain 0 represents white matter which is dominated 
by oligodendrocytes and has few neuronal cell bodies. Therefore, 
the genes that make up this meta gene have meaningful biological 
interpretations. While we focused our analyses on one tissue section, SpaGCN 
can also jointly analyze multiple tissue sections. We show two examples using this mouse brain Visium data provided by 10x Genomics. 
Figure 5a shows SpaGCN clustering results for two mouse posterior sections. As these two tissue sections are from the same region, 
SpaGCN was able to infer cluster correspondence between the two 
tissue sections. Next, we used SpaGCN to analyze jointly two tissue sections with one from the mouse posterior brain and the other 
from the mouse anterior brain. As the anterior section and posterior section are adjacent in the brain, we modified the coordinates 
for spots in the posterior section such that the revised coordinates 
reflect the spatial adjacency of the two tissue sections. Using the 
modified coordinates as input, SpaGCN was able to produce clustering results that reflect the shared layer structure in the anterior 
and posterior brain (Fig. 5b).

接下来，我们分析了从小鼠后脑获得的10x Visium数据集，该数据集包含3,353个spot和31,053个基因。这个数据集展示了比前两个数据集更复杂的组织结构。当设置所有方法的聚类数为十时，我们将SpaGCN的聚类结果与Louvain、stLearn和BayesSpace进行了比较。图4a显示，Louvain的聚类结果与stLearn、BayesSpace和SpaGCN相似，但后三种方法检测到的空间域由于考虑了基因表达的空间依赖性而更加连续。

我们进一步研究了每种方法在检测更精细组织结构方面的能力。具体而言，我们对SpaGCN检测到的域5中的spot进行了子聚类分析，该域对应于皮质（图4b）。SpaGCN检测到的子域与Allen Brain Institute参考图谱中的小鼠皮质非常吻合（图4c）。检测到的子域包括层2/3、层4/5、层6、海马区（CA1）和subiculum。层2/3是“外部”皮质层，生物学上负责局部网络，该子域中的神经元与相邻的新皮质区域中的其他神经元进行通信。层4/5是“内部”皮质层，生物学上负责较长范围的神经网络。例如，视觉皮质对应于新皮质区域，负责接收来自远处的外侧膝状核的视觉信息。SpaGCN能够区分分子层（层1）、外部层（层2/3）、内部层（层4/5）和丛状层（层6）。更重要的是，SpaGCN在区分皮质层面上的表现优于Louvain和stLearn，后两者显示了皮质层的混合。SpaGCN还在区分丛状层（子域1）和海马CA1区（子域3）方面优于BayesSpace。相比之下，BayesSpace将皮质层6与非皮质层的海马CA1层组合在一起。

接下来，我们将SpaGCN与SPARK和SpatialDE进行了SVG检测的比较。SpaGCN在十个空间域中检测到了1,028个SVG，而SPARK和SpatialDE分别检测到了9,678和12,676个SVG（附图14）。如图4d所示，SpaGCN检测到的SVG的Moran’s I值远高于SPARK和SpatialDE检测到的值（SpaGCN的中位数为0.54，SPARK为0.20，SpatialDE为0.16）。对spots和基因的更严格过滤并没有改善SPARK和SpatialDE的性能（附图15）。SpatialDE和SPARK的P或Q值分布在0附近高度偏斜（附图16），具有类似P或Q值的基因不一定显示类似的空间模式，而较小的P或Q值也不能保证更好的空间模式（附图17和18）。相反，SpaGCN中实施的多域自适应过滤标准使其能够消除虚假阳性的SVG，并确保所有检测到的SVG具有清晰的空间表达模式。

为了说明SpaGCN中的过滤是如何工作的，我们以域1、5和8为例。对于这些域中的每个域，SpaGCN检测到了一个富集在该区域的单个SVG。如图4e所示，PVALB在域1中富集，TRM62在域8中富集。尽管域1和8彼此相邻，这两个SVG仍然可以很好地标记这些域。NRGN是SpaGCN为域5和7检测到的SVG。NRGN在域5和7中的高表达还表明这两个域在神经解剖上是相似的，两者都包括皮质和海马金字塔层。皮质和海马都是位于大脑曲面上的区域。域5和7在三维重建中是连续的，但由于切片的切割方式，它们在现实中是分离的。因此，不仅NRGN，SpaGCN还为域5和7检测到了许多其他SVG，其中一些在这两个域中高度表达（附图19）。SpaGCN中独特而强大的SVG检测过程确保不会错过这些基因。

SpaGCN只为域0确定了四个SVG。然而，我们认为通过多个基因的组合形成的元基因可能比任何单个基因更能揭示空间模式。我们以域0为例，展示了SpaGCN如何创建信息丰富的元基因来标记空间域（图4f）。首先，通过降低过滤阈值，SpaGCN确定了KLK6，它在域0的较低部分高度表达。使用KLK6作为起始基因，SpaGCN采用了一种新方法，找到了KLK6、MBP和ATP1B1基因表达的对数线性组合，这准确地标记了空间域0。在这个元基因中，KLK6和MBP被认为是阳性标记物，因为它们在域0的某些spot中高表达，而ATP1B1被认为是阴性标记物，因为它主要在非域0区域表达。先前的研究表明，KLK6和MBP的表达限制在寡突胶质细胞，而ATP1B1主要在神经元和星形胶质细胞中表达。这与域0代表白质的事实相一致，白质由寡突胶质细胞主导，几乎没有神经元细胞体。因此，构成这个元基因的基因具有有意义的生物学解释。虽然我们的分析重点放在了一个组织切片上，但SpaGCN也可以共同分析多个组织切片。我们使用10x Genomics提供的这个小鼠脑Visium数据展示了两个例子。图5a展示了两个小鼠后脑切片的SpaGCN聚类结果。由于这两个组织切片来自同一区域，SpaGCN能够推断出两个组织切片之间的聚类对应关系。接下来，我们使用SpaGCN共同分析了一片来自小鼠后脑的组织切片和另一片来自小鼠前脑的组织切片。由于前脑和后脑在大脑中相邻，我们修改了后脑组织切片中spot的坐标，使修改后的坐标反映了两个组织切片的空间邻接关系。使用修改后的坐标作为输入，SpaGCN能够产生反映前脑和后脑共享的层结构的聚类结果（图5b）。

Application to mouse visual cortex STARmap data. Finally, we 
analyzed a STARmap dataset that has single-cell resolution7
. This 
dataset was generated from mouse visual cortex that spans from 
hippocampus to corpus callosum, and the six neocortical layers. In total, 1,020 genes were measured in 1,207 cells that include 
non-neuronal cells, excitatory and inhibitory neurons. The layer 
structure and cell type distribution of the tissue section provided 
by the original study are shown in Fig. 6a. As the tissue capture area 
of STARmap is much smaller than 10x Visium, we increased the 
contribution of neighboring cells from 0.5 to 1 when calculating 
the weighted gene expression of each cell in SpaGCN. Using this 
approach, SpaGCN detected spatial domains that agreed well with 
the annotated tissue structure (Fig. 6a,c), achieving an ARI of 0.51. 
By contrast, the ARIs of the other methods are much lower (0.30 for 
Louvain, 0.37 for BayesSpace and 0.03 for HMRF) (Fig. 6b). This 
example demonstrates that SpaGCN utilizes spatial information 
more efficiently than BayesSpace and HMRF. Using SpaGCN, we 
further detected 25 SVGs including genes LAMP5, HPCAL1, CPLX1, 
PLP1, NRSN1, ATP1A2 and BSG that showed enriched expression 
patterns for domains 0 to 6 (Fig. 6e and Supplementary Fig. 20). 
Similar to previous analyses, SPARK and SpatialDE detected much 
larger number of SVGs but many of the SVGs lack spatial expression 
patterns (Fig. 6d and Supplementary Figs. 21–24).

最后，我们分析了一个具有单细胞分辨率的STARmap数据集。该数据集是从小鼠视皮层中获得的，跨越了海马到胼胝体和六个新皮质层。总共，在1,207个细胞中测量了1,020个基因，包括非神经元细胞、兴奋性神经元和抑制性神经元。原始研究提供的组织切片的层结构和细胞类型分布如图6a所示。由于STARmap的组织捕获区域远小于10x Visium，我们在计算SpaGCN中每个细胞的加权基因表达时，将相邻细胞的贡献从0.5增加到1。使用这种方法，SpaGCN检测到的空间域与注释的组织结构非常吻合（图6a,c），ARI达到了0.51。相比之下，其他方法的ARI要低得多（Louvain为0.30，BayesSpace为0.37，HMRF为0.03）（图6b）。这个例子证明了SpaGCN比BayesSpace和HMRF更有效地利用了空间信息。使用SpaGCN，我们进一步检测到了25个SVG，包括基因LAMP5、HPCAL1、CPLX1、PLP1、NRSN1、ATP1A2和BSG，它们在0到6个域中显示出富集的表达模式（图6e和附图20）。与先前的分析类似，SPARK和SpatialDE检测到了更多的SVG，但其中许多SVG缺乏空间表达模式（图6d和附图21-24）。

Discussion

detect SVGs and meta genes that have much clearer spatial expression patterns and biological interpretations than genes detected by 
SpatialDE and SPARK. Additionally, the SpaGCN-detected SVGs 
are transferrable and can be utilized for downstream analyses in 
independent tissue sections. SpaGCN is also computationally 
fast and memory efficient compared to SPARK and SpatialDE 
(Supplementary Note 4).
The spatial domain detection step in SpaGCN is flexible. First, 
SpaGCN can adjust the weight of histology in gene expression 
smoothing. For datasets with clear tissue structure in histology, 
higher weight led to clearer separation of cancer versus noncancer regions. Second, during the GCN fitting procedure, the graph 
weights are updated, which allows SpaGCN to learn an efficient way 
to aggregate gene expression from neighboring spots for each gene. 
For data generated from different platforms, the spatial dependency 
between spots/cells is different as the size of the captured tissue 
area varies. The flexibility in modeling spatial dependency makes 
SpaGCN versatile for different types of SRT data.
A limitation of SpaGCN is that the spatial domain detection is 
mainly driven by gene expression, which may lead to the discrepancy 
between the detected domains and the underlying tissue anatomical structure. This is a general problem for gene expression-based 
clustering methods. Another limitation of SpaGCN is the lack of 
separation of spatial variation and cell type variation in gene expression patterns for the detected SVGs. To address these limitations, 
methods that can jointly consider gene expression and histological 
features in clustering are needed. Further, cell type-specific gene 
expression needs to be estimated to tease out the contribution of cell 
types and spatial location in gene expression variation. We anticipate that methods development along these directions is warranted 
for future research.

在本文中，我们介绍了SpaGCN，一种整合了基因表达、空间位置和组织学信息的方法，用于建模基因表达的空间依赖性，以识别空间域和富集的SVGs。SpaGCN已经在使用不同的SRT技术生成的来自不同物种、区域和组织的数据集上进行了广泛测试。附注1-3中还展示了对ST、SLIDE-seqV2和MERFISH数据的其他分析结果。我们的结果一致表明，SpaGCN能够识别具有一致基因表达和组织学的空间域，检测出具有更清晰的空间表达模式和生物学解释的SVGs和元基因，比SpatialDE和SPARK检测到的基因具有更好的性能。此外，SpaGCN检测到的SVGs可以在独立的组织切片中进行转移，并用于下游分析。与SPARK和SpatialDE相比，SpaGCN计算速度快且内存利用率高（附注4）。
SpaGCN中的空间域检测步骤具有灵活性。首先，SpaGCN可以调整组织学在基因表达平滑中的权重。对于具有明确组织结构的数据集，更高的权重导致癌症区域与非癌症区域的更清晰分离。其次，在GCN拟合过程中，图的权重被更新，这使得SpaGCN可以学习一种有效的方式来聚合每个基因在相邻spot的基因表达。对于来自不同平台生成的数据，由于捕获的组织面积大小不同，spot/cell之间的空间依赖性也不同。建模空间依赖性的灵活性使得SpaGCN适用于不同类型的SRT数据。
SpaGCN的一个局限性是空间域检测主要由基因表达驱动，这可能导致检测到的域与潜在的组织解剖结构之间存在差异。这是基于基因表达的聚类方法的一个普遍问题。SpaGCN的另一个局限性是在检测到的SVG的基因表达模式中缺乏对空间变异和细胞类型变异的分离。为了解决这些局限性，需要开发能够在聚类中同时考虑基因表达和组织学特征的方法。此外，需要估计特定细胞类型的基因表达，以区分细胞类型和空间位置对基因表达变异的贡献。我们预计未来的研究需要沿着这些方向进一步发展方法。