Introduction and detailed usage of the metagenomic sequence reference-free gene assembly tool idba-ud

introduce

The idba-ud tool is a tool for assembling reference-free genomes that converts high-throughput sequencing data into genome sequences. It is an upgraded version of the idba tool specifically designed for assembling diverse reference-free genomes.

The main function of idba-ud is to generate sequences without a reference genome by assembling sequencing data. It is capable of processing both short-read and long-read sequencing data types, and can handle highly heterogeneous data during the assembly process. idba-ud is also highly parallelized and can make full use of computing resources for rapid genome assembly.

The background of idba-ud arises from the need for reference-free genome assembly in the field of biology. When studying the genome of some species, a suitable reference sequence may not be found for comparison. In this case, it is necessary to use a reference-free genome assembly tool to obtain the genome sequence of the species. Since the genomes of different species have different characteristics, idba-ud is optimized and improved for diverse reference-free genomes.

The development and improvement of the idba-ud tool is based on previous research work in the field of reference-free genome assembly. It uses a data structure called a de Bruijn graph to convert sequencing data into sequence fragments and assemble the genome by aligning, connecting, and determining the directionality of the sequences. idba-ud also fully considers the heterogeneity of data and uses a variety of strategies to process highly heterogeneous data, improving the accuracy and reliability of genome assembly.

In general, the function of the idba-ud tool is to obtain the genome sequence of a species by assembling a reference-free genome and provide important basic data for biological research. Its background stems from the need for reference-free genome assembly, and is improved and optimized based on previous research work, making it capable of processing diverse reference-free genome data, and has a high degree of parallelization and processing of heterogeneous data. Ability.

 Install

git clone https://github.com/loneknightpy/idba.git

$ ./configure
$ make

 Put it into the system environment and set it up according to your needs. Personally, I use the absolute path directly.

use

sequence conversion

Idba uses fasta files as input by default, so fastq files and fastq files of double-ended pairs need to be converted using fq2fa

fq2fa read.fq read.fa

#双端转换
fq2fa --merge --filter read_1.fq read_2.fq read.fa

Sequence assembly:

It's super simple, but you have to pay attention to the machine memory. Although it doesn't consume that much memory, it will consume a lot for a slightly larger data set.

idba_ud -r read.fa -o idba_assembly

# -r 输入reads序列
# -o 输出结果目录

Full parameter help information:

idba_ud --help
idba_ud: unrecognized option '--help'
uknown option
IDBA-UD - Iterative de Bruijn Graph Assembler for sequencing data with highly uneven depth.
Usage: idba_ud -r read.fa -o output_dir
Allowed Options: 
  -o, --out arg (=out)                   output directory
  -r, --read arg                         fasta read file (<=600)
      --read_level_2 arg                 paired-end reads fasta for second level scaffolds
      --read_level_3 arg                 paired-end reads fasta for third level scaffolds
      --read_level_4 arg                 paired-end reads fasta for fourth level scaffolds
      --read_level_5 arg                 paired-end reads fasta for fifth level scaffolds
  -l, --long_read arg                    fasta long read file (>600)
      --mink arg (=20)                   minimum k value (<=312)
      --maxk arg (=100)                  maximum k value (<=312)
      --step arg (=20)                   increment of k-mer of each iteration
      --inner_mink arg (=10)             inner minimum k value
      --inner_step arg (=5)              inner increment of k-mer
      --prefix arg (=3)                  prefix length used to build sub k-mer table
      --min_count arg (=2)               minimum multiplicity for filtering k-mer when building the graph
      --min_support arg (=1)             minimum supoort in each iteration
      --num_threads arg (=0)             number of threads
      --seed_kmer arg (=30)              seed kmer size for alignment
      --min_contig arg (=200)            minimum size of contig
      --similar arg (=0.95)              similarity for alignment
      --max_mismatch arg (=3)            max mismatch of error correction
      --min_pairs arg (=3)               minimum number of pairs
      --no_bubble                        do not merge bubble
      --no_local                         do not use local assembly
      --no_coverage                      do not iterate on coverage
      --no_correct                       do not do correction
      --pre_correction                   perform pre-correction before assembly

Guess you like

Origin blog.csdn.net/zrc_xiaoguo/article/details/135335349