Installation and detailed usage of error correction and genome assembly tools for third-generation sequencing data or long contigs

introduce:

Canu is an error correction and genome assembly tool for long-read contigs. It was originally designed to process long-read DNA sequencing data generated by third-generation sequencing technologies such as PacBio. More recently, Canu has also begun to support other long-read sequencing technologies such as Oxford Nanopore.

Canu's goal is to provide high-quality genome assembly results by leveraging long-read sequencing data. Its design idea is an assembly method based on self-correction. Canu first constructs contigs by splitting long-read sequencing data into shorter overlaps and then performing error correction and overlapping extension. Next, Canu uses an iterative process of error correction and overlap expansion to improve contig quality, and assembles contigs by establishing complementary relationships among reads.

The usage scenarios of Canu depend on the needs of the problem to be solved. Canu is the right choice when you need to perform high-quality genome assemblies, especially when processing long-read sequencing data. It is suitable for various biological research fields, such as microbiology, botany, zoology, etc. At the same time, Canu is also suitable for processing large genomes, especially those that cannot be accurately assembled from short-read sequencing data. Using Canu provides longer contigs and better genome coverage, helping to identify genes and other genetic elements.

As usual, read the article first:

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

De novo assembly of haplotype-resolved genomes with trio binning | Nature Biotechnology 

HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads | bioRxiv 

再看github: marbl/canu: A single molecule sequence assembler for genomes large and small. (github.com)

 

Error correction and genome assembly are important tasks in the field of genomics, which can help researchers quickly obtain high-quality genome sequences. The following is an introduction to the installation and use of some commonly used error correction and genome assembly tools for third-generation sequencing data or long contigs:

installation method

Compile and install from source code

Clone the Canu project source code library:

Note that the official does not recommend downloading the zip file directly, so clone it directly.

git clone https://github.com/marbl/canu.git
cd canu/src

Install dependencies (if not already installed)

Canu relies on some third-party software and libraries, such as zlib, bzip2, perl, c++ compiler, etc. Make sure these dependencies are installed correctly on the system.

I won’t talk about installation and debugging here.

Compile Canu

make -j

Set environment variables. This is done according to your preference. If you don’t do this step, just use the absolute path to run it.

export PATH="/your-path-to-canu/canu/bin:$PATH"

Install using a package management tool (e.g. conda)

It is recommended to use mamba, which is fast.

mamba create -n canu
mamba activate canu
mamba install -c conda-forge -c bioconda -c defaults canu

 Conda environment configuration reference: Miniconda3 installation configuration under Linux-centos9stream-Miniconda3 Linux 64-bit-CSDN blog

Canu assembly usage and specific steps

Assuming you have a nanopore_reads.fastq.gzraw data file named Oxford Nanopore and want to perform genome assembly, here is a basic Canu command line example:

canu -p project_name \
    -d output_directory \
    genomeSize=genome_size_in_bp \
    useGrid=false \
    -nanopore-raw nanopore_reads.fastq.gz \
    -maxMemory memory_limit \
    -threads num_threads


#官方参考样例:
canu [-haplotype|-correct|-trim] \
   [-s <assembly-specifications-file>] \
   -p <assembly-prefix> \
   -d <assembly-directory> \
   genomeSize=<number>[g|m|k] \
   [other-options] \
   [-trimmed|-untrimmed|-raw|-corrected] \
   [-pacbio|-nanopore|-pacbio-hifi] *fastq

Parameter explanation:

  • -p project_name: Specify the output result prefix.
  • -d output_directory: Set the output directory path.
  • genomeSize: Estimated target genome size in base pairs.
  • useGrid=false: Set to false if not running in a grid computing environment.
  • -nanopore-raw: Enter the original long-read sequencing data file path.
  • -maxMemory: Set the maximum memory usage of the program.
  • -threads: Specifies the number of threads to use.

Pay attention to the parameters here. If the system is configured with an environment such as supercomputing slurm, supercomputing will be enabled by default. Therefore, if you do not use a supercomputing environment, add useGrid=false, which will enable a single node for calculation.

Here, the assembled contigs of second-generation sequencing are directly used as input to start running. It is recommended to use nohup to run in the background. 

Full parameter help information:

canu --help

usage:   canu [-version] [-citation] \
              [-haplotype | -correct | -trim | -assemble | -trim-assemble] \
              [-s <assembly-specifications-file>] \
               -p <assembly-prefix> \
               -d <assembly-directory> \
               genomeSize=<number>[g|m|k] \
              [other-options] \
              [-haplotype{NAME} illumina.fastq.gz] \
              [-corrected] \
              [-trimmed] \
              [-pacbio |
               -nanopore |
               -pacbio-hifi] file1 file2 ...

example: canu -d run1 -p godzilla genomeSize=1g -nanopore-raw reads/*.fasta.gz 


  To restrict canu to only a specific stage, use:
    -haplotype     - generate haplotype-specific reads
    -correct       - generate corrected reads
    -trim          - generate trimmed reads
    -assemble      - generate an assembly
    -trim-assemble - generate trimmed reads and then assemble them

  The assembly is computed in the -d <assembly-directory>, with output files named
  using the -p <assembly-prefix>.  This directory is created if needed.  It is not
  possible to run multiple assemblies in the same directory.

  The genome size should be your best guess of the haploid genome size of what is being
  assembled.  It is used primarily to estimate coverage in reads, NOT as the desired
  assembly size.  Fractional values are allowed: '4.7m' equals '4700k' equals '4700000'

  Some common options:
    useGrid=string
      - Run under grid control (true), locally (false), or set up for grid control
        but don't submit any jobs (remote)
    rawErrorRate=fraction-error
      - The allowed difference in an overlap between two raw uncorrected reads.  For lower
        quality reads, use a higher number.  The defaults are 0.300 for PacBio reads and
        0.500 for Nanopore reads.
    correctedErrorRate=fraction-error
      - The allowed difference in an overlap between two corrected reads.  Assemblies of
        low coverage or data with biological differences will benefit from a slight increase
        in this.  Defaults are 0.045 for PacBio reads and 0.144 for Nanopore reads.
    gridOptions=string
      - Pass string to the command used to submit jobs to the grid.  Can be used to set
        maximum run time limits.  Should NOT be used to set memory limits; Canu will do
        that for you.
    minReadLength=number
      - Ignore reads shorter than 'number' bases long.  Default: 1000.
    minOverlapLength=number
      - Ignore read-to-read overlaps shorter than 'number' bases long.  Default: 500.
  A full list of options can be printed with '-options'.  All options can be supplied in
  an optional sepc file with the -s option.

  For TrioCanu, haplotypes are specified with the -haplotype{NAME} option, with any
  number of haplotype-specific Illumina read files after.  The {NAME} of each haplotype
  is free text (but only letters and numbers, please).  For example:
    -haplotypeNANNY nanny/*gz
    -haplotypeBILLY billy1.fasta.gz billy2.fasta.gz

  Reads can be either FASTA or FASTQ format, uncompressed, or compressed with gz, bz2 or xz.

  Reads are specified by the technology they were generated with, and any processing performed.

  [processing]
    -corrected
    -trimmed

  [technology]
    -pacbio      <files>
    -nanopore    <files>
    -pacbio-hifi <files>

Recommended other analysis tools and processes:

EasyMetagenome - a simple and easy-to-use metagenomic analysis process - the secret weapon from Liu Yongxin's team_Liu Yongxin metagenomic file - CSDN Blog  Metagenome - Phosphorus cycle Pcycle functional gene analysis - from analysis process to code and result demonstration - Super detailed nanny-level process_pcycdb-CSDN blog

 Metagenomics analysis tool MetaWRAP 1.3.2 based on conda environment installation and use, basic sequence analysis process automatic analysis script_(metawrap132) [lzh2023@master metawrap_db]$ quast--CSDN blog

 Calculation of the abundance of Contigs and Genes in samples in metagenomics sequence analysis based on BWA, Bowtie2, Salmon, SAMtools, checkm and other tools, comparison of multiple calculation methods and scripts (updated in 20231217)_bwa display results-CSDN blog

Using mamba/conda to install and configure the QIIME 2 2023.9 ​​Amplicon amplicon analysis environment based on the conda environment, introduction and use of the main functional modules of q2cli_qiime 2 amplicon distribution-CSDN blog 

Guess you like

Origin blog.csdn.net/zrc_xiaoguo/article/details/135332993