Seurat package standard process

Click here for data download

1 Install the packages that need to be loaded

install.packages(c('dplyr','patchwork','seurat'))

2 Load the required packages

library(dplyr)
library(Seurat)
library(patchwork)

3 read data

pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
#路径是你自己下载数据的位置

4 Create a Seurat object

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
#数据集中测到的少于200个基因的细胞(min.features = 200)和少于3个细胞覆盖的基因(min.cells = 3)被过滤掉
Pbmc

5 Standard preprocessing workflow

The following steps include the standard preprocessing workflow for scRNA-seq data in Seurat. Including QC indicator-based filtering, data standardization and normalization, and the function of detecting highly variable genes.

QC (quality control) and selection of cells for further analysis

Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. Some commonly used QC indicators include

  • The number of genes detected in each cell.

  • Low-quality cells or empty droplets often have few genes

  • Cells that are doubled or polycellular may exhibit abnormally high gene counts

  • Likewise, the total number of molecules detected within a cell (closely associated with unique genes)

  • The percentage of the mitochondrial genome in this cell that is read

  • Low quality/dead cells often exhibit extensive mitochondrial contamination

  • We calculate the mitochondrial QC metric using the PercentageFeatureSet() function, which calculates the percentage of counts from a set of features

  • We use the set of all genes starting from MT-

# [[运算符可以为对象元数据添加列。)这是存放QC统计数据的好地方。
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

In the example below, we visualize QC metrics and use these metrics to filter cells.

  • We filter cells with unique signature counts greater than 2,500 or less than 200

  • We filtered cells with mitochondrial counts >5%

# 使用violin plot可视化  QC指标,并使用这些指标过滤细胞
VlnPlot(pbmc,features = c("nFeature_RNA","nCount_RNA","percent.mt"),ncol = 3)

# FeatureScatter 通常用于可视化两个特征之间的关系
plot1<-FeatureScatter(pbmc,feature1 = "nCount_RNA",feature2 = "percent.mt")
plot2<-FeatureScatter(pbmc,feature1 = "nCount_RNA",feature2 = "nFeature_RNA")
plot1+plot2

# 过滤具有2500或少于200的独特特征计数的细胞,过滤线粒体计数>5%的细胞
pbmc<-subset(pbmc,subset=nFeature_RNA>200 & nFeature_RNA<2500 & percent.mt<5)

6 Normalized data

After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global scaling normalization method "LogNormalize", which normalizes each cell's characteristic expression measurement to the total expression, multiplying it by a scaling factor (10,000 by default) , and logarithmically transform the results. Normalized values ​​are stored in . pbmc[["RNA"]]@data

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

For clarity, in previous lines of code (and in future commands) we have provided default values ​​for certain parameters in function calls. However, this is not required and the same behavior can be achieved using:

pbmc <- NormalizeData(pbmc)

7 Feature Selection

默认情况下,每个数据集返回 2,000 个特征。这些将用于下游分析,如PCA。FindVariableFeatures()

pbmc<-FindVariableFeatures(pbmc,selection.method = "vst",nfeatures = 2000)
# 确定高表达的前十个基因
top10<-head(VariableFeatures(pbmc),10)
plot1<-VariableFeaturePlot(pbmc)
plot2<-LabelPoints(plot = plot1,points = top10,repel = TRUE)
plot1+plot2

8缩放数据

  • 移动每个基因的表达,使跨细胞的平均表达为 0

  • 缩放每个基因的表达,使细胞之间的方差为 1

这一步在下游分析中给予同等的权重,因此高表达基因不会占主导地位

  • 其结果存储在pbmc[["RNA"]]@scale.data

all.genes<-rownames(pbmc)
pbmc<-ScaleData(pbmc,features = all.genes)

9线性降维

pbmc<-RunPCA(pbmc,features = VariableFeatures(object=pbmc))
print(pbmc[["pca"]],dims = 1:5,nfeatures = 5)

# Seurat提供可视化细胞和定义PCA,包括功能的几种有用的方法ViziDimReduction(),DimPlot()和DimHeatmap()
VizDimLoadings(pbmc,dims = 1:2,reduction = "pca")

DimPlot(pbmc,reduction = "pca")

DimHeatmap(pbmc,dims = 1,cells = 500,balanced = TRUE)

DimHeatmap(pbmc,dims = 1:15,cells=500,balanced = TRUE)

10 确定数据集的维度

Seurat 根据 PCA 分数对单元单元进行聚类,每台 PC 基本上代表一个"元结构",该"元结构"将信息组合在相关功能集中。我们随机排列数据的子集(默认情况下为 1%)并重新运行 PCA,构建功能分数的"空分布",并重复此过程。我们确定"重要"PC。

pbmc<-JackStraw(pbmc,num.replicate = 100)
pbmc<-ScoreJackStraw(pbmc,dims = 1:20)
# 可视化处理 JackStrawPlot()功能提供了一个可视化工具,用于将每个 PC 的 p 值分布与统一分布(虚线)进行比较。"重要"PC 将显示在虚线上方。
JackStrawPlot(pbmc,dims = 1:15)

[ElbowPlot()]根据每个(函数)解释的方差百分比对原则组件进行排名。在此示例中,我们可以观察到 PC9-10 周围的"肘部",表明大多数真实信号在前 10 个 PC 中被捕获。
ElbowPlot(pbmc)

11聚类细胞

# 建立KNN图,并基于其局部领域中的共享重叠细化任意两个单元之间的边权重
pbmc<-FindNeighbors(pbmc,dims = 1:10)
#对细胞进行聚类,应用模块化优化技术,Louvain算法(默认)或SLM
# 以迭代方式将细胞分组在一起,目标是优化标准模块化函数
pbmc<-FindClusters(pbmc,resolution = 0.5)
head(Idents(pbmc),5)

12 运行非线性降维(UMAP/tSNE)

pbmc<-RunUMAP(pbmc,dims = 1:10)
DimPlot(pbmc,reduction = "umap")

13 寻找差异表达的特征(簇生物标志物)

# findmarkers为所有集群自动执行此过程,也可以测试集群组之间的对比,或针对所有单元格进行测试
# 默认情况下,ident.1与所有其他细胞相比,他识别单个簇的阳性和阴性标记。
# min.pct参数要求在两组细胞中的任何一组中以最小百分比检测到一个特征
# 而 thresh.test 参数要求一个特征在两组之间差异表达(平均)一定量
# 寻找cluster2的所有markers
cluster2.markers<-FindMarkers(pbmc,ident.1 = 2,min.pct = 0.25)
head(cluster2.markers,n=5)
# 寻找cluster5中与cluster0和cluster3n不同的所有markers
cluster5.markers<-FindMarkers(pbmc,ident.1 = 5,ident.2=c(0,3),min.pct = 0.25)
head(cluster5.markers,n=5)
# 找出每个细胞簇的标记物,与所有剩余的细胞进行比较,只报告阳性细胞 
pbmc.markers<-FindAllMarkers(pbmc,only.pos = TRUE,min.pct = 0.25,logfc.threshold =0.25 )
pbmc.markers %>%
  group_by(cluster) %>%
  slice_max(n=2,order_by = avg_log2FC)
# 还有用于可视化标记表达的工具,Vlnplot(显示跨集群的表达概率分布)和FeaturePlot()(在tSNE或PCA图上可视化特征表达)是最常用的可视化,建议探索RidgePlot(),CellScatter(),和DotPlot()作为查看数据集的其他方法
VlnPlot(pbmc,features = c("MS4A1","CD79A"))
# 可以绘制行数据
VlnPlot(pbmc,features = c("NKG7","PF4"),slot = "counts",log = TRUE)
FeaturePlot(pbmc,features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
                              "CD8A"))

#DoHeatmap()为给定的细胞和特征生成一个表达热图。在这种情况下,我们为每个集群绘制前 20 个标记(或所有标记,如果小于 20)。
pbmc.markers%>%
  group_by(cluster)%>%
  top_n(n=10,wt=avg_log2FC)->top10
DoHeatmap(pbmc,features = top10$gene)+NoLegend()

14将细胞类型标识分配给集群

new.cluster.ids<-c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
                   "NK", "DC", "Platelet")
names(new.cluster.ids)<-levels(pbmc)
pbmc<-RenameIdents(pbmc,new.cluster.ids)
DimPlot(pbmc,reduction = "umap",label = TRUE,pt.size = 0.5)+NoLegend()

saveRDS(pbmc,file = "./pbmc3k_final.rds")

Guess you like

Origin blog.csdn.net/weixin_56845253/article/details/129249902
Recommended