Calculate the abundance of Contigs and Genes in samples in metagenomics sequence analysis based on BWA, Bowtie2, samtools, checkm and other tools, and compare various calculation methods and scripts

Calculating the relative abundance of contigs and genes can provide information about the structure and function of microbial communities. Here’s what it means to calculate these two metrics:

1. Relative abundance of contigs: contigs are fragment sequences obtained using genome sequencing technology. By calculating the relative abundance of contigs, we can understand the relative abundance of different bacterial species in the microbial community. This can help researchers understand the species composition and community structure of microbial communities.

2. Relative abundance of Genes: Genes are the basic units of function in organisms. By calculating the relative abundance of genes, we can understand the functional characteristics of different bacterial groups. This can help researchers understand the metabolic capabilities, biosynthetic capabilities and environmental adaptability of microbial communities.

By simultaneously calculating the relative abundance of contigs and genes, comprehensive microbial community information can be provided. This information is of great significance for researchers to understand the composition and function of microbial communities and reveal the interaction between microorganisms and their hosts.

The first method is based on Bowtie2, samtools, checkm

To calculate the abundance of contigs, assembly results are generally used. When calculating gene abundance, prodigal results are generally used. In prodigal results, it is recommended to output protein sequence and nucleic acid sequence files at the same time. For gene annotation, diamond is generally used to use protein sequences, and abundance is calculated here. It needs to be compared with the original nucleic acid sequence, so the nucleic acid sequence is used here. The nucleic acid sequence output by prodigal is the same as the protein sequence ID, so you only need to use the sequence ID for mapping at the end.

First build a new Index based on the spliced ​​contigs, as shown below:

bowtie2-build --threads 20 sample1/final_assembly.fasta sample1.contig

# 或 prodigal结果
bowtie2-build --threads 20 sample1.nucle_seq.fa sample1.gene

Next, all reads from metagenomic sequencing are mapped to the spliced ​​Contigs. Each read can only be assigned to at most one Contigs:

#注意前面步骤的输出文件名,与这里的-x参数对应
# 如果是使用assembly的
bowtie2 -p 20 \
    -x sample1.contig \
    -1 sample1_clean_reads_1.fq \
    -2 sample1_clean_reads_2.fq \
    -S sample1.contig.sam \
    --fast

# 或 prodigal结果
bowtie2 -p 20 \
    -x sample1.gene \
    -1 sample1_clean_reads_1.fq \
    -2 sample1_clean_reads_2.fq \
    -S sample1.gene.sam \
    --fast

The following is the second way to share

Use the samtools tool to convert sam files into bam files:

samtools view -bS --threads 20 sample1.contig.sam > sample1.contig.bam

# prodigal
samtools view -bS --threads 20 sample1.gene.sam > sample1.gene.bam

Sort the reads in the bam file according to the compared position coordinates:

samtools sort sample1.contig.bam -o sample1.contig.reads.sorted.bam --threads 20

# prodigal结果
samtools sort sample1.gene.bam -o sample1.gene.reads.sorted.bam --threads 20

This bam file stores the mapping results of reads. Next, you usually write your own script to parse. We can also use CheckM to achieve this. To calculate coverage, you first need to prepare the bam index file, as shown below:

samtools index sample1.contig.reads.sorted.bam

#prodigal 结果
samtools index sample1.gene.reads.sorted.bam

After running, the accompanying index file contig.reads.sorted.bam.bai will be generated, which is placed in the same path as the corresponding sorted bam. CheckM is a metagenomic bins evaluation tool. At this time, we can treat all contigs sequence files as a "bin" and put them in the bins folder. Next use CheckM to calculate coverage:

#每个样品一个文件夹,作为一个bin
mkdir sample1
cp sample1.contig.fasta sample1
checkm coverage \
    -x fasta \
    -m 20 \
    -t 20 \
    sample1 \
    sample1.contigs_coverage.out \
    sample1.contig.reads.sorted.bam

### prodigal
cp sample1.nucle_seq.fa sample1
checkm coverage \
    -x fasta \
    -m 20 \
    -t 20 \
    sample1 \
    sample1.gene_coverage.out \
    sample1.gene.reads.sorted.bam

The results include the contig sequence ID, the ID of the bin where it is located, coverage and other information, as shown below. Use excel to align it:

  • Sequence Id: Unique identifier for the sequence.
  • Bin Id: The Bin to which this sequence belongs (that is, which category it is grouped into). In metagenomics, a Bin usually refers to an assembled group of similar traits or species.
  • Sequence length (bp): The length of the sequence, calculated in base pairs (bp).
  • Bam Id: The sequencing data file corresponding to this sequence.
  • Coverage: Coverage refers to the average number of times this contig appears in the sample. It is usually derived from the comparison of sequencing reads.
  • Mapped reads: The number of mapped reads refers to the number of sequencing reads that can be successfully mapped to this contig.

Relative abundance calculation formula:

To calculate the relative abundance of a gene in a sample, you can use coverage and mapped reads. Typically, abundance can be estimated using coverage and the total number of sequenced reads. For example, relative abundance can be calculated using the following formula:

in:

  • Mapped reads is the number of mapped reads for this contig.
  • Average read length is the average length of sequencing reads.
  • Total reads is the total number of all sequencing reads.

Total reads statistics:

 python script:

def count_reads_fastq(fastq_file):
    with open(fastq_file, 'r') as f:
        count = sum(1 for line in f) // 4  # 每4行代表一个read,因此除以4得到reads数量
    return count

# 替换为您的FASTQ文件路径
file_path = 'path/to/your/fastq_file.fastq'

# 调用函数计算reads数量
reads_count = count_reads_fastq(file_path)
print(f"FASTQ文件中的reads数量为: {reads_count}")

bash script:

# 统计FASTQ文件中reads的数量
grep -c "^@" your_fastq_file.fastq

Here we need to pay attention to the sequence id in the first column, and we need to map it to the corresponding seq id in the gene annotation results. In addition, we only need the mapping information of the reads. Next, the relative abundance can be calculated based on the number of reads in the map, that is, divided by the length of the contig and the total number of reads, similar to the RPKM normalization method in RNA-seq. If it is a mixed splicing of multiple samples, you only need to repeat the above operation for each sample's read data, and finally integrate it according to the contig ID.

The second method: BWA (recommended), samtools, CheckM

#首先对参考序列构建index:
bwa index -p sample1_gene sample1.nucle_seq.fa
#使用BWA-MEM进行比对:
bwa mem \
    -t 20 \
    sample1_gene \
    sample1_clean_1.fastq \
    sample1_clean_2.fastq \
    >sample1_gene.sam

Starting here is the same as starting with the first way samtools steps

The third way: bedtools calculation

# 步骤1:比对测序reads到参考基因组
# 假设使用Bowtie2进行比对
bowtie2-build your_genome.fa your_genome_index  # 如果尚未构建索引
bowtie2 -x your_genome_index -U your_reads.fastq -S aligned.sam

# 步骤2:将比对结果转换为BAM格式
samtools view -S -b aligned.sam > aligned.bam
# 再sort排序一下
samtools sort aligned.bam -o aligned.sorted.bam --threads 20
samtools index aligned.sorted.bam

# 步骤3:提取覆盖度信息
# 假设使用bedtools进行提取覆盖度信息
bedtools genomecov -ibam aligned.sorted.bam > coverage.bed

# 步骤4:计算基因长度
# 假设已经有基因长度信息的文件,如genes_lengths.txt


# 步骤5:计算相对丰度
awk 'BEGIN {OFS="\t"} NR==FNR {len[$1]=$2; next} {print $1, $2/len[$1]}' genes_coverage.txt genes_lengths.txt > relative_abundance.txt

Full process calculation script

For multiple samples, please do the for loop operation yourself

Automatic analysis script 1

is used to calculate the relative abundance of Contigs based on bwa-mem, samtools and CheckM. This script assumes that you already have the reference genome and sequencing data, and have the appropriate software installed.

#!/bin/bash

# 定义文件路径
reference_genome="your_reference_genome.fa"
reads="your_reads.fastq"

# 步骤1:用bwa-mem比对测序reads到参考基因组
bwa index $reference_genome  # 如果尚未构建索引
bwa mem -t 4 $reference_genome $reads > aligned.sam

# 步骤2:将比对结果转换为BAM格式
samtools view -bS aligned.sam > aligned.bam
samtools sort -o aligned_sorted.bam aligned.bam
samtools index aligned_sorted.bam

# 步骤3:使用CheckM估算Contigs的丰度
checkm lineage_wf -f checkm_output.txt -x fa $reference_genome contigs_dir/ checkm_results/

# 步骤4:提取覆盖度信息
checkm qa -o 2 -f checkm_coverage.txt checkm_results/lineage.ms contigs_dir/ coverage.txt

Automatic analysis script 2

Automatic analysis script for calculating the relative abundance of Contigs or Genes in a sample based on bowtie2, samtools and bedtools . Please note that this script is for reference only and cannot be run directly, because the file path, parameters and specific data may need to be adjusted according to the actual situation.

#!/bin/bash

# 假设有参考基因组文件your_genome.fa,测序reads文件your_reads.fastq和基因注释文件genes_annotation.gff

# 步骤1:构建参考基因组索引
bowtie2-build your_genome.fa your_genome_index

# 步骤2:将测序reads比对到参考基因组
bowtie2 -x your_genome_index -U your_reads.fastq -S aligned.sam

# 步骤3:将比对结果转换为BAM格式
samtools view -S -b aligned.sam > aligned.bam

# 步骤4:生成基因覆盖度文件
bedtools genomecov -ibam aligned.bam -g your_genome.fa.fai > coverage.txt

# 步骤5:根据基因注释文件提取基因长度信息
awk '{if($3=="gene") print $0}' genes_annotation.gff | cut -f 1,4,5 > genes_lengths.txt

# 步骤6:根据覆盖度和基因长度信息计算相对丰度
awk 'BEGIN {OFS="\t"} NR==FNR {len[$1]=$3-$2; next} {print $1, $2/len[$1]}' coverage.txt genes_lengths.txt > relative_abundance.txt

# 输出结果
echo "相对丰度计算完成。结果保存在 relative_abundance.txt 文件中。"

Automatic analysis script 3 

# python
import subprocess
import os

# 定义文件路径
ref_genome = 'path/to/your_reference_genome.fasta'
sample_reads = 'path/to/your_sample_reads.fastq'
gene_lengths = 'path/to/your_gene_lengths.txt'

# 步骤1:比对测序reads到参考基因组
# 使用Bowtie2进行比对
bowtie_index = 'your_genome_index'
subprocess.run(['bowtie2-build', ref_genome, bowtie_index])
subprocess.run(['bowtie2', '-x', bowtie_index, '-U', sample_reads, '-S', 'aligned.sam'])

# 步骤2:将比对结果转换为BAM格式
subprocess.run(['samtools', 'view', '-S', '-b', 'aligned.sam', '-o', 'aligned.bam'])

# 步骤3:提取覆盖度信息
subprocess.run(['bedtools', 'genomecov', '-ibam', 'aligned.bam', '-g', ref_genome + '.fai', '>', 'coverage.bed'])

# 步骤4:计算基因长度
# 假设已经有基因长度信息的文件

# 步骤5:计算相对丰度
with open('genes_coverage.txt', 'r') as cov_file, open(gene_lengths, 'r') as len_file, open('relative_abundance.txt', 'w') as output:
    for cov_line, len_line in zip(cov_file, len_file):
        contig_id, coverage = cov_line.strip().split('\t')
        gene_id, length = len_line.strip().split('\t')
        rel_abundance = float(coverage) / float(length)
        output.write(f"{gene_id}\t{rel_abundance}\n")

# 清理临时文件(可选)
os.remove('aligned.sam')
os.remove('aligned.bam')
os.remove('coverage.bed')

Automatic analysis script 4

NGless is a domain-specific language (DSL) for analyzing metagenomic data.

Installation and usage reference:Introduction, installation configuration and detailed usage of NGLess (Next Generation Less), a leading special language environment in the field of bioinformatics analysis - CSDN Blog

Here is an example script using NGless to calculate the relative abundance of contigs or genes in a sample. Please note that this is just a simplified example and actual analysis scripts may need to be adjusted based on your specific data and needs.

# 加载所需模块
ngless "0.11"
import "mapped"

# 定义输入文件
input = paired('sample_R1.fastq.gz', 'sample_R2.fastq.gz') using |read|:
    read = read.subsample(percent=10)  # 使用10%的数据进行演示

# 比对reads到Contigs或Genes
mapped = map(input, reference='contigs.fasta.gz', fafile=True) using |read|:
    read = read.subsample(percent=10)  # 使用10%的数据进行演示

# 计算覆盖度
cov = coverage(mapped)

# 计算Contigs或Genes的相对丰度
geneabundance = abundance(cov)

# 输出结果
write(geneabundance, ofile='gene_relative_abundance.txt', format="tsv")

NGless script description:

  1. Module import and import definition: uses ngless version 0.11 and imports the mapped module. Define the input file as paired-end sequencing reads of the sample.

  2. Align reads to Contigs or Genes: Use the map function to align sequencing reads to the reference sequence file of Contigs or Genes (here is a schematic specific file namecontigs.fasta.gz, which actually needs to be modified according to the specific file name).

  3. Calculate coverage: Use the coverage function to calculate coverage information from the alignment results.

  4. Calculate relative abundance: Use the abundance function to calculate the relative abundance of Contigs or Genes from coverage information.

  5. Output results: Use the write function to write the relative abundance results to a filegene_relative_abundance.txt and tabulate Delimited text format.

Automatic analysis script 5

# R
# 加载所需的R包
library("GenomicRanges")

## ######################获取contigs或者genes覆盖度数据
# 假设你有参考基因组文件为 genome.fa,测序 reads 文件为 reads.fastq

# 步骤1:比对测序 reads 到参考基因组
# 这里假设使用Bowtie2进行比对,需要Bowtie2已安装
system("bowtie2-build genome.fa genome_index")  # 构建索引

system("bowtie2 -x genome_index -U reads.fastq -S aligned.sam")  # 进行比对

# 步骤2:将比对结果转换为BAM格式
# 需要安装samtools
system("samtools view -S -b aligned.sam > aligned.bam")

# 步骤3:使用GenomicRanges包计算覆盖度
# 安装GenomicRanges包:install.packages("GenomicRanges")
library(GenomicRanges)

# 读取 BAM 文件
bam <- readGAlignments("aligned.bam", use.names=TRUE, param=ScanBamParam(what="pos"))

# 计算覆盖度
coverage <- coverage(bam)

# 将覆盖度信息写入文件
write.table(coverage, file="coverage_data.txt", sep="\t", quote=FALSE, col.names=TRUE, row.names=TRUE)

##############获取 contigs和genes 的长度数据
# 加载所需的R包
library("data.table")

# 步骤1:从组装结果文件中获取Contigs的长度
# 假设有一个示意的组装结果文件(示意数据)
assembly_data <- fread("assembly_results.csv")  # 读取组装结果文件

# 计算Contigs的长度
contigs_lengths <- nchar(assembly_data$Contig_Sequence)  # 假设Contig_Sequence列包含Contig序列
contigs_data <- data.frame(Contig = assembly_data$Contig_ID, Length = contigs_lengths)

# 步骤2:从基因预测结果文件中获取Genes的长度
# 假设有一个示意的基因预测结果文件(示意数据)
gene_prediction_data <- fread("gene_prediction_results.csv")  # 读取基因预测结果文件

# 计算Genes的长度
genes_lengths <- gene_prediction_data$Gene_End - gene_prediction_data$Gene_Start + 1
genes_data <- data.frame(Gene = gene_prediction_data$Gene_ID, Length = genes_lengths)

# 显示Contigs和Genes的长度信息
print("Contigs长度信息:")
print(contigs_data)

print("Genes长度信息:")
print(genes_data)


############# 计算丰度
# 步骤1:读取数据
# 假设有Contigs或Genes的覆盖度数据和长度数据文件(示意)
coverage_data <- read.table("coverage_data.txt", header=TRUE)  # Contigs或Genes的覆盖度数据文件
gene_lengths <- read.table("gene_lengths.txt", header=TRUE)  # Contigs或Genes的长度数据文件

# 步骤2:计算相对丰度
# 合并覆盖度数据和基因长度数据
merged_data <- merge(coverage_data, gene_lengths, by="Gene")

# 计算相对丰度(示意:使用简单的覆盖度除以长度)
merged_data$Relative_Abundance <- merged_data$Coverage / merged_data$Length

# 显示计算结果
print(merged_data)

Guess you like

Origin blog.csdn.net/zrc_xiaoguo/article/details/135026392