R language Circos diagram visualization

Prepare data

Two data need to be prepared: one is the gene expression profile, and the other is the annotation of the gene (it can be KO annotation, or it can be any other annotation)

gene expression profiling

sample1 sample2 sample3 ...
genes1 1.0 2.0 2.0 ...
gene2 3.0 3.0 4.0 ...
gene3 5.0 5.0 5.0 ...
gene4 6.0 7.0 9.0 ...
... ... ... ... ...

Path information

gene IS pathway
genes1 KO1 pathway1
gene2 KO2 pathway1
gene3 KO2 pathway2
... ... ...

simulated data

library(tidyverse)
library(magrittr)
library(circlize)
#模拟数据
## Data1
fpkm <- rbind(cbind(matrix(rnorm(500*3, mean = 1), nr = 500), 
                   matrix(rnorm(500*3, mean = 2), nr = 500),
                   matrix(rnorm(500*3, mean = 3), nr = 500)))
fpkm <- fpkm[sample(500, 500), ] # randomly permute rows
rownames(fpkm) <- paste0("gene", seq(500))
colnames(fpkm) <- c("A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3")
fpkm %<>% as.data.frame() %>% mutate(gene = row.names(.))
# Data2
pathways <- rep(paste0("pathway", seq(6)),  sample(12:20, size = 6)) %>% sample(70)
KOs <- rep(paste0("KO", seq(20)),  sample(5:20, size = 20, replace = TRUE)) %>% sample(70)
KOannotation <- data.frame(KO=KOs, pathway=pathways)
KOannotation <- KOannotation[sample(70, 200, TRUE),]
KOannotation$gene <- sample(paste0("gene", seq(500)),200)
             
# 假设你有富集到的想要可视化的通路
maps <- c("pathway1", "pathway2", "pathway3")
samples <- c("A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3")

Think about drawing data types

First, let’s briefly introduce the circos object. Just like an ordinary graph has an x-axis and a Y-axis, you can understand that a circos graph has many axes (the specific number is determined by your data). Then there are naturally corresponding positions on each axis (1, 2, 3, 4 in the figure below).

So naturally it is easy to imagine that if you want to draw a link, you need a data like this

from_axis from_position to_axis to_position
A 1 B 4
A 2 C 5
A 3 D 6

Furthermore, if you want to control the width of the link, you should specify the starting and ending positions of each link.

from_axis from_position_start from_position_end to_axis to_position_start to_position_end
A 0.5 1.5 B 3.5 4.5
A 1.5 2.5 C 4.5 5.5
A 2.5 3.5 D 5.5 6.5

After thinking about links, think about heat maps. First confirm that you are using circos.heatmap()a drawn heat map (as shown above). This data is relatively simple, so consider drawing the heat map first, and then consider how to draw the chord map in the middle.

heat map

Data cleaning

# 准备热图颜色
col_fun1 = colorRamp2(c(-2, 0, 2), c("#247ab5", "white", "#fda1a0"))
# 需要画图的基因
plot_gene <- KOannotation %>% 
  filter(pathway %in% maps) %>% 
  pull(gene) %>%
  {filter(fpkm, row.names(fpkm) %in% .)}
# 需要画图的KO
plot_KO <- plot_gene %>% 
  left_join(KOannotation) %>%
  filter(pathway %in% maps) %>% # There are some unenriched map
  group_by(KO) %>%
  summarise(across(samples,sum))
# 需要画图的Pathway
plot_map <- plot_gene %>% 
    left_join(KOannotation) %>%
    filter(pathway %in% maps) %>% # There are some unenriched map
    group_by(pathway) %>%
    summarise(across(samples,sum))

plot_data1 <- bind_rows(plot_gene %>% rename(id=gene), 
                        plot_KO %>% rename(id=KO), 
                        plot_map %>% rename(id=pathway)) %>%
    `row.names<-`(.$id) %>%
    select(-id)
plot_data1 <- t(scale(t(plot_data1))) %>% as.data.frame()
# 在热图上划分为gene、KO、pathway
lev_split = row.names(plot_data1) %>% str_match("[a-zA-Z]+") %>% factor()

Draw a picture

circos.clear()
circos.par(gap.degree=10, track.height=0.1)
# 分多次画表达谱数据,更有层次
circos.heatmap(plot_data1[samples[1:3]], split = lev_split , col = col_fun1, rownames.side = "outside", cluster = TRUE)
circos.heatmap(plot_data1[samples[7:9]], split = lev_split , col = col_fun1)
circos.heatmap(plot_data1[samples[4:6]], split = lev_split , col = col_fun1)

According to lev_split, this heat map is divided into gene, KO, and pathwaythree axes. What needs to be pointed out here is that the length of each gene, KO, and pathway on each axis is 1. For example, the position of gene67 on the gene axis is 0-1, and the position of pathway2 on the pathway axis is 0-1. Therefore, if we want to draw a link, according to the previous discussion, if there are three KOs that need to be connected to pathway2, we should need data similar to the following:

from_axis from_position_start from_position_end to_axis to_position_start to_position_end
IS 0.5 1.5 pathway 0 0.333
IS 1.5 2.5 pathway 0.333 0667
IS 2.5 3.5 pathway 0.667 1

Note the above table. There may be multiple KOs connected to a pathway, so we need to reasonably split the start and end positions to avoid overlap. Another advantage of doing this is that it can make the link lines thicker and thinner, which looks much more beautiful. so

On the other hand, we should note that since the genes on the circos heat map are arranged according to the clustering results, the order of the data in the data frame is different. Therefore, we first need to obtain each gene and KO after drawing the map. , the coordinates of the pathway on its corresponding axis. At this time, you need to circlizeobtain get.cell.meta.datathe corresponding information from the graph.

Chord diagram

Data cleaning

According to the results of the above discussion, our data cleaning should achieve two purposes:

  1. Obtain the information of each axis on gene, KO, and pathway after the heat map is generated, which can be organized into the following format:

    id sector position
    genes1 gene 23
    KO1 IS 45
    pathway1 pathway 56

    Table A

  2. Calculate the one-to-many gene-KO relationship and KO-pathway relationship to obtain relative positions

    from_axis from_position_start from_position_end to_axis to_position_start to_position_end
    gene 0.5 1.5 IS 5 5.33
    gene 1.5 2.5 IS 5.33 5.66
    gene 2.5 3.5 IS 5.66 6

    Table B

    Pay attention to the above table. Since the three genes are connected to the same KO, I connected them to different positions of the KO.

    Furthermore, if you want to obtain this table, you can split the process into the following steps:

    2.1 Generate a connection object table

    from to
    genes1 KO2
    gene2 KO1
    KO1 pathway1

    2.2 Calculate a displacement based on the number of times the connection object appears in the table

    from to from_start from_end to_start to_end
    genes1 KO2 0 1 0 0.333
    gene2 KO1 0 1 0.333 0.667
    KO1 pathway1 0 0.5 0.667 1
    KO1 pathway2 0.5 1 0 0.0333

    Table C

    2.3 Combine Table A and Table C to calculate Table B

    After clarifying the idea, the following is the code

# 1 获得gene、KO、pathway在每个轴上的位置
plot_data3 = data.frame()
for(lev in levels(lev_split)){
  a <- rownames(plot_data1)[lev_split==lev][get.cell.meta.data("row_order", sector.index = lev)]
  a <- seq(length(a)) %>% `names<-`(a) %>% enframe("id", "position")
  a$sector = lev
  plot_data3 <- rbind(plot_data3, a)
plot_data3$position <- plot_data3$position - 1 # 因为每个元素的范围是0~1,所以统一减一方便后面相加
#2.1 获取点对点连接表
plot_data2 <- KOannotation %>%
    filter(gene %in% row.names(plot_data1), KO %in% row.names(plot_data1)) %>%
    select(gene, KO) %>%
    rename(from=gene, to=KO)
tmp_obj1 <- KOannotation %>%
    filter(KO %in% row.names(plot_data1), pathway %in% row.names(plot_data1)) %>%
    select(KO, pathway) %>%
    rename(from=KO, to=pathway)
plot_data2 %<>% bind_rows(tmp_obj1) %>% distinct(from, to)
#2.2 计算位移
tmp_obj1<-plot_data2 %>% 
    group_by(from) %>% 
    mutate(V1=1/n()) %>% 
    group_by(from) %>% 
    mutate(from_end = cumsum(V1), 
           from_start = from_end-V1) %>%
    select(from, to, from_start, from_end)
tmp_obj2<-plot_data2 %>% 
    group_by(to) %>% 
    mutate(V1=1/n()) %>% 
    group_by(to) %>% 
    mutate(to_end = cumsum(V1), 
           to_start = to_end-V1) %>%
    select(from, to, to_start, to_end)
plot_data2 %<>% left_join(tmp_obj1) %>% left_join(tmp_obj2)
# 2.3 合并上述两表
plot_data2 %<>% 
  left_join(plot_data3, by=c('from'='id')) %>% 
  rename('from_position'=position, 'from_sector'=sector) %>%
  left_join(plot_data3, by=c('to'='id')) %>%
  rename('to_position'=position, 'to_sector'=sector)

Since the above plot_data2includes the connection relationships between all points, we may not need so many, so you can choose the data you want to display. This step may require drawing the graph several times to determine the data that needs to be displayed.

#3. 进一步筛选想要展示的连线
# highlight the pathway I wanted
highlight_pathway <- c("pathway1", "pathway2")
highlight_KO <- c("KO5", "KO6", "KO20")
highlight_gene <-  c("gene292", "gene256", "gene67", "gene146", "gene52", "gene391", "gene139", "gene327", "gene218", "gene142", "gene375", "gene194")

plot_link <- plot_data2 %>% filter(to %in% c(highlight_pathway, highlight_KO)) %>% mutate(col="#dcdcdc80")
highlight_link <- plot_link %>% filter (from %in% c(highlight_gene, highlight_KO)) %>% mutate(col="#fb9b9a80")
plot_link <- plot_link %>% filter (!from %in% c(highlight_gene, highlight_KO))
plot_data2<-rbind(plot_link, highlight_link)

Graphing using circos.linkloops

for ( idx in seq(nrow(plot_data2))){
  tmp_obj <- plot_data2[idx,]
  circos.link(tmp_obj[['from_sector']], 
              c(tmp_obj[['from_position']] + tmp_obj[['from_start']], 
                tmp_obj[['from_position']] + tmp_obj[['from_end']]), 
              tmp_obj[['to_sector']], 
              c(tmp_obj[['to_position']] + tmp_obj[['to_start']], 
                tmp_obj[['to_position']] + tmp_obj[['to_end']]),
              col = tmp_obj[['col']],
              border = NA
              )
}

All code

library(tidyverse)
library(magrittr)
library(circlize)
#模拟数据
## Data1
fpkm <- rbind(cbind(matrix(rnorm(500*3, mean = 1), nr = 500), 
                   matrix(rnorm(500*3, mean = 2), nr = 500),
                   matrix(rnorm(500*3, mean = 3), nr = 500)))
fpkm <- fpkm[sample(500, 500), ] # randomly permute rows
rownames(fpkm) <- paste0("gene", seq(500))
colnames(fpkm) <- c("A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3")
fpkm %<>% as.data.frame() %>% mutate(gene = row.names(.))
# Data2
pathways <- rep(paste0("pathway", seq(6)),  sample(12:20, size = 6)) %>% sample(70)
KOs <- rep(paste0("KO", seq(20)),  sample(5:20, size = 20, replace = TRUE)) %>% sample(70)
KOannotation <- data.frame(KO=KOs, pathway=pathways)
KOannotation <- KOannotation[sample(70, 200, TRUE),]
KOannotation$gene <- sample(paste0("gene", seq(500)),200)
             
# 假设你有富集到的想要可视化的通路
maps <- c("pathway1", "pathway2", "pathway3")
samples <- c("A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3")
# 准备热图颜色
col_fun1 = colorRamp2(c(-2, 0, 2), c("#247ab5", "white", "#fda1a0"))
# 需要画图的基因
plot_gene <- KOannotation %>% 
  filter(pathway %in% maps) %>% 
  pull(gene) %>%
  {filter(fpkm, row.names(fpkm) %in% .)}
# 需要画图的KO
plot_KO <- plot_gene %>% 
  left_join(KOannotation) %>%
  filter(pathway %in% maps) %>% # There are some unenriched map
  group_by(KO) %>%
  summarise(across(samples,sum))
# 需要画图的Pathway
plot_map <- plot_gene %>% 
    left_join(KOannotation) %>%
    filter(pathway %in% maps) %>% # There are some unenriched map
    group_by(pathway) %>%
    summarise(across(samples,sum))

plot_data1 <- bind_rows(plot_gene %>% rename(id=gene), 
                        plot_KO %>% rename(id=KO), 
                        plot_map %>% rename(id=pathway)) %>%
    `row.names<-`(.$id) %>%
    select(-id)
plot_data1 <- t(scale(t(plot_data1))) %>% as.data.frame()
# 在热图上划分为gene、KO、pathway
lev_split = row.names(plot_data1) %>% str_match("[a-zA-Z]+") %>% factor()
circos.clear()
circos.par(gap.degree=10, track.height=0.1)
# 分多次画表达谱数据,更有层次
circos.heatmap(plot_data1[samples[1:3]], split = lev_split , col = col_fun1, rownames.side = "outside", cluster = TRUE)
circos.heatmap(plot_data1[samples[7:9]], split = lev_split , col = col_fun1)
circos.heatmap(plot_data1[samples[4:6]], split = lev_split , col = col_fun1)
plot_data3 = data.frame()
for(lev in levels(lev_split)){
  a <- rownames(plot_data1)[lev_split==lev][get.cell.meta.data("row_order", sector.index = lev)]
  a <- seq(length(a)) %>% `names<-`(a) %>% enframe("id", "position")
  a$sector = lev
  plot_data3 <- rbind(plot_data3, a)
plot_data3$position <- plot_data3$position - 1 # 因为每个元素的范围是0~1,所以统一减一方便后面相加
#2.1 获取点对点连接表
plot_data2 <- KOannotation %>%
    filter(gene %in% row.names(plot_data1), KO %in% row.names(plot_data1)) %>%
    select(gene, KO) %>%
    rename(from=gene, to=KO)
tmp_obj1 <- KOannotation %>%
    filter(KO %in% row.names(plot_data1), pathway %in% row.names(plot_data1)) %>%
    select(KO, pathway) %>%
    rename(from=KO, to=pathway)
plot_data2 %<>% bind_rows(tmp_obj1) %>% distinct(from, to)
#2.2 计算位移
tmp_obj1<-plot_data2 %>% 
    group_by(from) %>% 
    mutate(V1=1/n()) %>% 
    group_by(from) %>% 
    mutate(from_end = cumsum(V1), 
           from_start = from_end-V1) %>%
    select(from, to, from_start, from_end)
tmp_obj2<-plot_data2 %>% 
    group_by(to) %>% 
    mutate(V1=1/n()) %>% 
    group_by(to) %>% 
    mutate(to_end = cumsum(V1), 
           to_start = to_end-V1) %>%
    select(from, to, to_start, to_end)
plot_data2 %<>% left_join(tmp_obj1) %>% left_join(tmp_obj2)
# 2.3 合并上述两表
plot_data2 %<>% 
  left_join(plot_data3, by=c('from'='id')) %>% 
  rename('from_position'=position, 'from_sector'=sector) %>%
  left_join(plot_data3, by=c('to'='id')) %>%
  rename('to_position'=position, 'to_sector'=sector)
highlight_pathway <- c("pathway1", "pathway2")
highlight_KO <- c("KO5", "KO6", "KO20")
highlight_gene <-  c("gene292", "gene256", "gene67", "gene146", "gene52", "gene391", "gene139", "gene327", "gene218", "gene142", "gene375", "gene194")

plot_link <- plot_data2 %>% filter(to %in% c(highlight_pathway, highlight_KO)) %>% mutate(col="#dcdcdc80")
highlight_link <- plot_link %>% filter (from %in% c(highlight_gene, highlight_KO)) %>% mutate(col="#fb9b9a80")
plot_link <- plot_link %>% filter (!from %in% c(highlight_gene, highlight_KO))
plot_data2<-rbind(plot_link, highlight_link)
for ( idx in seq(nrow(plot_data2))){
  tmp_obj <- plot_data2[idx,]
  circos.link(tmp_obj[['from_sector']], 
              c(tmp_obj[['from_position']] + tmp_obj[['from_start']], 
                tmp_obj[['from_position']] + tmp_obj[['from_end']]), 
              tmp_obj[['to_sector']], 
              c(tmp_obj[['to_position']] + tmp_obj[['to_start']], 
                tmp_obj[['to_position']] + tmp_obj[['to_end']]),
              col = tmp_obj[['col']],
              border = NA
              )
}

Finished product

important point

  1. If you draw a multi-layer circle diagram, it will be aligned based on the line number, but not the line name . Therefore, expression profiling data for all groups must be prepared in one step.

  2. There's a lot of data cleaning and calculations going on between the heat map and chord map code, don't be afraid, it's okay. Because only after getting the heat map can we obtain the positions corresponding to genes, KOs, and pathways.

  3. It is a habit to circos.clear()clear the cache before each drawing.

Supongo que te gusta

Origin blog.csdn.net/qq_42458954/article/details/127207703
Recomendado
Clasificación