R language to draw chromosome mutation position distribution map, RIdeogram package

Chromosome distribution map of mutation sites

The content shared today is to draw the chromosome locus distribution map through the RIdeogram package, and introduce a method to display differential loci.

everything

In genetic research, genotype information at certain locations on the genome is obtained through sequencing and other methods. As shown in the table below, the first column is the ID of the mutation site, the second column is the chromosome, the third column is the physical location, and the last two columns are the genotypes of the two samples.

everything

This file usually has tens of thousands of lines. To quickly obtain effective information from it, you can use the mutation site chromosome distribution map function. For example, in the picture below, the red part indicates the difference between sample A and sample B. You can see at a glance which positions are more critical.

everything

data preparation

everything

There are two main input files. The first is the chromosome structure information (length, centromere position, etc.), and the second is the site information to be analyzed and annotated (genotypes of different materials at a certain position), as shown below. :

everything

How to operate

First, install and load the R package:

library(RIdeogram)
library(tidyverse)
library(xlsx)

Setting parameters

Save the organized files in csv format and enter the file name. centwidth is the centromere display width, and singewidth is the display width of a single locus. These two values ​​​​can be customized. (If the value set is larger, the lines will be thicker)

my_filename <- "data.csv"
my_centwidth <- 5000000
my_color <- c("#e0e0e0""#304ffe""#dd2c00","#00bfa5")
my_singewidth <- 5000000
everything

Add chromosome information

Since the length of each chromosome (start and end positions) and the position of the centromere are required for drawing, prepare a txt file in advance and organize the data in the following format for subsequent drawing.everything

Here is the code to read chromosome information:

df_chr_pos_cent <- read.table("./Ref_chromedata.txt",header = T)
df_chr_pos_cent <- cbind(df_chr_pos_cent$chr,"0",
                         df_chr_pos_cent$length,
                         df_chr_pos_cent$centromere,
                         as.numeric(df_chr_pos_cent$centromere) + my_centwidth)
df_chr_pos_cent <- as.data.frame(df_chr_pos_cent)
df_chr_pos_cent <- df_chr_pos_cent[1:21,] %>% 
      mutate_at(vars(V2,V3,V4,V5),as.integer)
colnames(df_chr_pos_cent) <- c("Chr","Start","End","CE_start","CE_end")
> head(df_chr_pos_cent)
  Chr Start       End  CE_start    CE_end
1  1A     0 594102056 213000000 218000000
2  1B     0 689851870 240600000 245600000
3  1D     0 495453186 170000000 175000000
4  2A     0 780798557 326650000 331650000
5  2B     0 801256715 347850000 352850000
6  2D     0 651852609 268450000 273450000
everything

First read the data from the file named "Ref_chromedata.txt" and store it in the data frame named df_chr_pos_cent, then select the contents of certain columns and splice them into a new matrix to automatically calculate The start and end positions of each chromosome and the start and end positions of the centromere.

Create a function to determine the relationship between offspring and parents

In order to more conveniently calculate the presence of a certain site in parents and offspring, quickly obtain the source of mutation information in offspring, and create a genetic relationship determination function.

# 函数:输入两个亲本和子代基因型,判断子代来源
determine_progeny <- function(p1,p2,prog,name_p1,name_p2){
      if (p1 == p2){
            if (p1 == prog & p2 == prog){
                  return("Both")
            }else{
                  return("Unknown")
            }
      }else{
            if (p1 == prog){
                  return(name_p1)
            }else{
                  if (p2 == prog){
                        return(name_p2)
                  }else{
                        return("Unknown")
                  }
            }
      }
}

This function is called determine_progeny and it is used to determine the relationship between two parents (p1 and p2) and an offspring (prog).

Parameter Description:

  • p1 - ​​the identity of the first parent.
  • p2 - the identifier of the second parent.
  • prog - the id of the descendant.
  • name_p1 - the name of the first parent
  • name_p2 - the name of the second parent

Features:

First, check that if the first parent p1 and the second parent p2 are equal, then there are two cases for the relationship between them:

  1. If they are both equal to the offspring prog, then "Both" is returned, indicating that both parents are related to the offspring.
  2. In another case, "Unknown" is returned, indicating that the relationship between them cannot be determined.

If the first parent p1 and the second parent p2 are not equal, then there are two situations in the relationship between them:

  1. If the first parent p1 is equal to the offspring prog, then name_p1 is returned, indicating that the first parent is the parent of the offspring.
  2. If the second parent p2 is equal to the offspring prog, then name_p2 is returned, indicating that the second parent is the parent of the offspring.

Finally, if none of the above conditions are met, "Unknown" is returned, indicating that the relationship between them cannot be determined. The purpose of this function is to determine the relationship between the input parent and offspring information and return a corresponding label information.

Create a function to determine heterozygosity

If some sites are in a heterozygous state, they need to be identified and marked. The following provides an algorithm for identification:

# 函数:判断杂合子
decide_hybrid <- function(genetype){
      tem <- str_split(genetype,"")
      if (length(tem[[1]]) != 2){
            return("Error")
      }else{
            first <- tem[[1]][1]
            sencend <- tem[[1]][2]
            if (first == sencend){
                  return("equal")
            }else{ return("diff")}
      }
}

This code defines a function called decide_hybrid, which accepts a string parameter genetype, then splits the string into two characters and checks whether the two characters are equal.

If the length of the input string is not equal to 2, the function will return "Error", indicating an input error.

Otherwise, if the two characters are equal, the function will return "equal", indicating equality.

If the two characters are not equal, the function will return "diff", indicating inequality. This function is mainly used to compare whether the input genotypes are equal or unequal.everything

Calculate variant type at each site

for (i in 1:nrow(df)){
    
    
      if (df$chr[i] == "#N/A"){
            df$type[i] <- "del"
            next
      }
      if (df$pos[i] == "#N/A"){
            df$type[i] <- "del"
            next
      }
      if (decide_hybrid(df[i,3]) == "diff" |
          decide_hybrid(df[i,4]) == "diff" |
          decide_hybrid(df[i,5]) == "diff"){
            df$type[i] <- "del"
            next
      }
      df$type[i] <- determine_progeny(df[i,3],df[i,4],df[i,5],
                                      colnames(df)[3],colnames(df)[4])
}

The main function of this code is to determine the value of the "type" column of each row in the df data frame based on a series of conditions, and to perform corresponding processing based on different conditions. If certain conditions are met, it sets the type to 'del', otherwise it calculates the new type and stores it in the data frame.

everything

Preliminary sorting of drawing data

df_marker <- df %>% filter(type != "del")
df_marker$chr <- sub("chr","",df_marker$chr)
df_marker$Start <- df_marker$pos
df_marker$End <- as.numeric(df_marker$pos) + my_singewidth
df_marker$Value[which(df_marker$type == "Both")] <- 4
df_marker$Value[which(df_marker$type == colnames(df)[3])] <- 3
df_marker$Value[which(df_marker$type == colnames(df)[4])] <- 2
df_marker$Value[which(df_marker$type == "Unknown")] <- 1

First, filter out the rows whose type column is not equal to "del" from the data frame df, and then perform a series of operations on the filtered data frame df_marker:

Remove the "chr" prefix from the df_marker$chr column and store the result back in df_marker$chr. Assign the value of column df_marker$pos to the new column df_marker$Start to rename the location information. Calculate a new column df_marker$End based on the value of my_singewidth. The value of this column is equal to the value of df_marker$pos plus my_singewidth.

everything

Update the df_marker$Value column according to the conditions

如果type等于"Both",则将df_marker$Value设置为4。
如果type等于数据框中的第三列的列名,则将df_marker$Value设置为3。
如果type等于数据框中的第四列的列名,则将df_marker$Value设置为2。
如果type等于"Unknown",则将df_marker$Value设置为1。

The main purpose is to filter and transform the df data frame based on conditions and create a data framedf_marker, which includes renaming columns, calculating new columns, and assigning values ​​based on different type values corresponding numerical value.

Convert data format

df_marker_plot <- cbind(df_marker[,c(1,7,8,9)])
df_marker_plot <- df_marker_plot %>% 
      mutate_at(vars(Start,End,Value),as.integer)
df_marker_plot <- df_marker_plot[which(df_marker_plot$chr !="Un"),]
colnames(df_marker_plot) <- c("Chr","Start","End","Value")

The function of this code is to process the data frame named df_marker. First, select specific columns (columns 1, 7, 8, and 9), and then convert the data types of these columns to integer types. , then delete the rows whose chr column value is not equal to "Un", and finally rename these columns "Chr"、"Start"、"End"和"Value" for subsequent data visualization or analysis.

Draw a panoramic view of chromosomes

# 绘制全景图
ideogram(karyotype = df_chr_pos_cent,
         overlaid = df_marker_plot,
         colorset1 = my_color)
convertSVG("chromosome.svg", device = "pdf")

Finally, use the above code to generate the final result image. This R package very cleverly uses SVG syntax to construct graphics, and provides many ways to convert svg to pdf, png, etc.

everything

Tip: If categorical discrete data is converted into continuous data through numerical mapping, drawing calculations can be performed through functions. If there are multiple dimensions of data that need to be displayed, other types of annotations can be added.

Multi-omics applications:

  • Distribution of differentially expressed genes (RNA-seq)
  • Distribution of open chromatin (ATAC-seq)
  • CTCF binding site (ChIP-seq)
  • Distribution of mutation sites on chromosomes (WGS)
  • Distribution of DNA methylation (WGBS)
  • Distribution of exons on chromosomes (WES)

Genetics applications:

  • Panoramic mapping of variant sites
  • Mark physical location display diagram drawing
  • Display of sources of genetic diversity
  • Display of gene linkage mapping results
>>> 参考资料
https://github.com/cran/RIdeogram
http://doi.org/10.7717/peerj-cs.251
https://www.jianshu.com/p/07ae1fe18071
>>> Tips:本文所有示例数据均随机生成,不具有任何意义

This article is published by mdnice Multiple platforms

Guess you like

Origin blog.csdn.net/ZaoJewin/article/details/132942345