introduce
The idba-ud tool is a tool for assembling reference-free genomes that converts high-throughput sequencing data into genome sequences. It is an upgraded version of the idba tool specifically designed for assembling diverse reference-free genomes.
The main function of idba-ud is to generate sequences without a reference genome by assembling sequencing data. It is capable of processing both short-read and long-read sequencing data types, and can handle highly heterogeneous data during the assembly process. idba-ud is also highly parallelized and can make full use of computing resources for rapid genome assembly.
The background of idba-ud arises from the need for reference-free genome assembly in the field of biology. When studying the genome of some species, a suitable reference sequence may not be found for comparison. In this case, it is necessary to use a reference-free genome assembly tool to obtain the genome sequence of the species. Since the genomes of different species have different characteristics, idba-ud is optimized and improved for diverse reference-free genomes.
The development and improvement of the idba-ud tool is based on previous research work in the field of reference-free genome assembly. It uses a data structure called a de Bruijn graph to convert sequencing data into sequence fragments and assemble the genome by aligning, connecting, and determining the directionality of the sequences. idba-ud also fully considers the heterogeneity of data and uses a variety of strategies to process highly heterogeneous data, improving the accuracy and reliability of genome assembly.
In general, the function of the idba-ud tool is to obtain the genome sequence of a species by assembling a reference-free genome and provide important basic data for biological research. Its background stems from the need for reference-free genome assembly, and is improved and optimized based on previous research work, making it capable of processing diverse reference-free genome data, and has a high degree of parallelization and processing of heterogeneous data. Ability.
Install
git clone https://github.com/loneknightpy/idba.git
$ ./configure
$ make
Put it into the system environment and set it up according to your needs. Personally, I use the absolute path directly.
use
sequence conversion
Idba uses fasta files as input by default, so fastq files and fastq files of double-ended pairs need to be converted using fq2fa
fq2fa read.fq read.fa
#双端转换
fq2fa --merge --filter read_1.fq read_2.fq read.fa
Sequence assembly:
It's super simple, but you have to pay attention to the machine memory. Although it doesn't consume that much memory, it will consume a lot for a slightly larger data set.
idba_ud -r read.fa -o idba_assembly
# -r 输入reads序列
# -o 输出结果目录
Full parameter help information:
idba_ud --help
idba_ud: unrecognized option '--help'
uknown option
IDBA-UD - Iterative de Bruijn Graph Assembler for sequencing data with highly uneven depth.
Usage: idba_ud -r read.fa -o output_dir
Allowed Options:
-o, --out arg (=out) output directory
-r, --read arg fasta read file (<=600)
--read_level_2 arg paired-end reads fasta for second level scaffolds
--read_level_3 arg paired-end reads fasta for third level scaffolds
--read_level_4 arg paired-end reads fasta for fourth level scaffolds
--read_level_5 arg paired-end reads fasta for fifth level scaffolds
-l, --long_read arg fasta long read file (>600)
--mink arg (=20) minimum k value (<=312)
--maxk arg (=100) maximum k value (<=312)
--step arg (=20) increment of k-mer of each iteration
--inner_mink arg (=10) inner minimum k value
--inner_step arg (=5) inner increment of k-mer
--prefix arg (=3) prefix length used to build sub k-mer table
--min_count arg (=2) minimum multiplicity for filtering k-mer when building the graph
--min_support arg (=1) minimum supoort in each iteration
--num_threads arg (=0) number of threads
--seed_kmer arg (=30) seed kmer size for alignment
--min_contig arg (=200) minimum size of contig
--similar arg (=0.95) similarity for alignment
--max_mismatch arg (=3) max mismatch of error correction
--min_pairs arg (=3) minimum number of pairs
--no_bubble do not merge bubble
--no_local do not use local assembly
--no_coverage do not iterate on coverage
--no_correct do not do correction
--pre_correction perform pre-correction before assembly