Transcriptome quality control plus sequence cutting (fastp) [study notes easy-to-understand version]

Transcriptome quality control plus sequence cutting (fastp)

Introduction to this article: This article is based on my own study and practice, and I have referred to a lot of materials. If 如若有大佬能指出错误,我将感激不敬there are any mistakes, please leave a message in the comment area, thank you.
Update date: July 14, 2023 Address of
this article : https://zhuanlan.zhihu.com/p/643683121

Author: CYH-BI

参考地址:OpenGene/fastp: An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging…) (github.com)

Introduction to fastp

Fastp is a software developed by Chen Shifu (co-founder/CTO of Hypros), which can perform quality control and sequence cutting on transcriptome off-machine data. That is to say, this software can realize the basic functions of FastQC + Trimmomatic , and the speed of this software is very fast.

Some functions are as follows: (This sentence refers to fastp: an ultra-fast full-featured FASTQ file automatic quality control + filtering + correction + preprocessing software - Zhihu (zhihu.com) )

  • Automatically conduct all-round quality control on data and generate humanized reports
  • filter function (low quality, too short, too many N...);
  • For the head or tail of each sequence, calculate the quality mean within the sliding window, and cut off the subsequence with a lower mean (similar to Trimmomatic, but much faster);
  • Global clipping (at the head/tail, does not affect deduplication), for Illumina off-machine data, it often needs to be processed in the last one or two cycles;
  • Remove connector contamination. What's great is that you don't need to input the linker sequence, because the algorithm will automatically identify the linker sequence and cut it;
  • For paired-end sequencing (PE) data, the software will automatically find the overlapping region of each pair of reads and correct the mismatched base pairs in the overlapping region;
  • Remove the polyG from the tail. For the sequencing data of Illumina NextSeq/NovaSeq, because of the two-color method of luminescence, polyG is common, so this feature is enabled by default for these two types of sequencing platforms;
  • For the inconsistent base pairs in the overlap interval in the PE data, it is corrected according to the quality value;
  • Data with molecular tags (UMI) can be preprocessed, regardless of whether the UMI is on the insert or on the index; - The output can be split, and two modes are supported, respectively
    specifying the number of splits, or the number of lines of each file after splitting;
  • ...etc.

The summary is:

  • Input and output file settings
  • joint treatment
  • Global pruning (that is, directly cut off the low-quality bases at the start and end)
  • Sliding window quality trimming (similar to trimmomatic, the software introduction and usage can be found on my homepage)
  • filter too short sequences
  • Correct bases (for paired-end sequencing)
  • quality filter

In short, very good.

fastp installation

  • method one

    Install using the official website installation package (GitHub)

    # 使用wegt下载安装并赋予可执行权限
    wget http://opengene.org/fastp/fastp
    chmod a+x ./fastp
    
  • Method Two

    install using codna

    conda install -c bioconda fastp
    

    This method is the easiest (recommended)

The use of fastp

fastp supports the input and output of **.fq** and **.fq.gz**, and fastp software will generate a report in HTML format. There is no static picture in this report, and all charts are dynamically drawn using JavaScript (following example)

Many parameters are enabled by default, such as joint processing, which is automatically recognized and cut by the algorithm, which is very convenient in short. Next, let's take a look at its parameters:

Use the following command to view the parameter introduction:

fastp -h
[chenyh@mu01 ~]$ fastp -h
option needs value: --html
usage: fastp [options] ... 
options:
  -i, --in1                            read1 input file name (string [=])
  -o, --out1                           read1 output file name (string [=])
  -I, --in2                            read2 input file name (string [=])
  -O, --out2                           read2 output file name (string [=])
      --unpaired1                      for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=])
      --unpaired2                      for PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --unpaired1 (default mode), both unpaired reads will be written to this same file. (string [=])
      --overlapped_out                 for each read pair, output the overlapped region if it has no any mismatched base. (string [=])
      --failed_out                     specify the file to store reads that cannot pass the filters. (string [=])
  -m, --merge                          for paired-end input, merge each pair of reads into a single read if they are overlapped. The merged reads will be written to the file given by --merged_out, the unmerged reads will be written to the files specified by --out1 and --out2. The merging mode is disabled by default.
      --merged_out                     in the merging mode, specify the file name to store merged output, or specify --stdout to stream the merged output (string [=])
      --include_unmerged               in the merging mode, write the unmerged or unpaired reads to the file specified by --merge. Disabled by default.
  -6, --phred64                        indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)
  -z, --compression                    compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 4. (int [=4])
      --stdin                          input from STDIN. If the STDIN is interleaved paired-end FASTQ, please also add --interleaved_in.
      --stdout                         stream passing-filters reads to STDOUT. This option will result in interleaved FASTQ output for paired-end output. Disabled by default.
      --interleaved_in                 indicate that <in1> is an interleaved FASTQ which contains both read1 and read2. Disabled by default.
      --reads_to_process               specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])
      --dont_overwrite                 don't overwrite existing files. Overwritting is allowed by default.
      --fix_mgi_id                     the MGI FASTQ ID format is not compatible with many BAM operation tools, enable this option to fix it.
  -V, --verbose                        output verbose log information (i.e. when every 1M reads are processed).
  -A, --disable_adapter_trimming       adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled
  -a, --adapter_sequence               the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped. (string [=auto])
      --adapter_sequence_r2            the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as <adapter_sequence> (string [=auto])
      --adapter_fasta                  specify a FASTA file to trim both read1 and read2 (if PE) by all the sequences in this FASTA file (string [=])
      --detect_adapter_for_pe          by default, the auto-detection for adapter is for SE data input only, turn on this option to enable it for PE data.
  -f, --trim_front1                    trimming how many bases in front for read1, default is 0 (int [=0])
  -t, --trim_tail1                     trimming how many bases in tail for read1, default is 0 (int [=0])
  -b, --max_len1                       if read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1. Default 0 means no limitation (int [=0])
  -F, --trim_front2                    trimming how many bases in front for read2. If it's not specified, it will follow read1's settings (int [=0])
  -T, --trim_tail2                     trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings (int [=0])
  -B, --max_len2                       if read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2. Default 0 means no limitation. If it's not specified, it will follow read1's settings (int [=0])
  -D, --dedup                          enable deduplication to drop the duplicated reads/pairs
      --dup_calc_accuracy              accuracy level to calculate duplication (1~6), higher level uses more memory (1G, 2G, 4G, 8G, 16G, 24G). Default 1 for no-dedup mode, and 3 for dedup mode. (int [=0])
      --dont_eval_duplication          don't evaluate duplication rate to save time and use less memory.
  -g, --trim_poly_g                    force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
      --poly_g_min_len                 the minimum length to detect polyG in the read tail. 10 by default. (int [=10])
  -G, --disable_trim_poly_g            disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
  -x, --trim_poly_x                    enable polyX trimming in 3' ends.
      --poly_x_min_len                 the minimum length to detect polyX in the read tail. 10 by default. (int [=10])
  -5, --cut_front                      move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise.
  -3, --cut_tail                       move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise.
  -r, --cut_right                      move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop.
  -W, --cut_window_size                the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])
  -M, --cut_mean_quality               the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20])
      --cut_front_window_size          the window size option of cut_front, default to cut_window_size if not specified (int [=4])
      --cut_front_mean_quality         the mean quality requirement option for cut_front, default to cut_mean_quality if not specified (int [=20])
      --cut_tail_window_size           the window size option of cut_tail, default to cut_window_size if not specified (int [=4])
      --cut_tail_mean_quality          the mean quality requirement option for cut_tail, default to cut_mean_quality if not specified (int [=20])
      --cut_right_window_size          the window size option of cut_right, default to cut_window_size if not specified (int [=4])
      --cut_right_mean_quality         the mean quality requirement option for cut_right, default to cut_mean_quality if not specified (int [=20])
  -Q, --disable_quality_filtering      quality filtering is enabled by default. If this option is specified, quality filtering is disabled
  -q, --qualified_quality_phred        the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
  -u, --unqualified_percent_limit      how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])
  -n, --n_base_limit                   if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])
  -e, --average_qual                   if one read's average quality score <avg_qual, then this read/pair is discarded. Default 0 means no requirement (int [=0])
  -L, --disable_length_filtering       length filtering is enabled by default. If this option is specified, length filtering is disabled
  -l, --length_required                reads shorter than length_required will be discarded, default is 15. (int [=15])
      --length_limit                   reads longer than length_limit will be discarded, default 0 means no limitation. (int [=0])
  -y, --low_complexity_filter          enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).
  -Y, --complexity_threshold           the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])
      --filter_by_index1               specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])
      --filter_by_index2               specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])
      --filter_by_index_threshold      the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])
  -c, --correction                     enable base correction in overlapped regions (only for PE data), default is disabled
      --overlap_len_require            the minimum length to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 30 by default. (int [=30])
      --overlap_diff_limit             the maximum number of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 5 by default. (int [=5])
      --overlap_diff_percent_limit     the maximum percentage of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. Default 20 means 20%. (int [=20])
  -U, --umi                            enable unique molecular identifier (UMI) preprocessing
      --umi_loc                        specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])
      --umi_len                        if the UMI is in read1/read2, its length should be provided (int [=0])
      --umi_prefix                     if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])
      --umi_skip                       if the UMI is in read1/read2, fastp can skip several bases following UMI, default is 0 (int [=0])
      --umi_delim                      delimiter to use between the read name and the UMI, default is : (string [=:])
  -p, --overrepresentation_analysis    enable overrepresented sequence analysis.
  -P, --overrepresentation_sampling    one in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20. (int [=20])
  -j, --json                           the json format report file name (string [=fastp.json])
  -h, --html                           the html format report file name (string [=fastp.html])
  -R, --report_title                   should be quoted with ' or ", default is "fastp report" (string [=fastp report])
  -w, --thread                         worker thread number, default is 3 (int [=3])
  -s, --split                          split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0])
  -S, --split_by_lines                 split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (long [=0])
  -d, --split_prefix_digits            the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding (int [=4])
      --cut_by_quality5                DEPRECATED, use --cut_front instead.
      --cut_by_quality3                DEPRECATED, use --cut_tail instead.
      --cut_by_quality_aggressive      DEPRECATED, use --cut_right instead.
      --discard_unmerged               DEPRECATED, no effect now, see the introduction for merging.
  -?, --help                           print this message

As mentioned earlier, fastp software can handle single-ended and double-ended data, so I will use these two aspects as examples:

  • For single-ended data

    fastp -i in_put.fa -o out_put.fq
    

    -i input (.fq format, or .fq.gz)

    -o output (.fq format, or .fq.gz)

  • For double-ended data

    fastp -i /home/chenyh/RNAseq_data/sample1_1.fq.gz -I /home/chenyh/RNAseq_data/sample1_2.fq.gz -o /home/chenyh/ly_NT_RNAseq/fastp_result/sample1-1_trimmed.fq.gz -O /home/chenyh/ly_NT_RNAseq/fastp_result/sample1_2_trimmed.fq.gz -h /home/chenyh/ly_NT_RNAseq/fastp_result/html/sample1.html
    

    The code is explained as follows:

    -i first reads (.fq format, or .fq.gz)
    -I second reads (.fq format, or .fq.gz)
    -o first reads result (not zero, is "oh")
    -O second reads result (small i corresponds to small o, big I corresponds to big O)
    -h output quality control report, including the results of two reads

    Other parameters depend on your needs! I only need to remove low-quality bases and linker sequences, so I think the above parameters are ok.

Whether it is single-ended or double-ended, an html file will be generated, which is the quality control report. If it is double-ended, it will contain the basic information of the two reads before and after the quality control. Next, after obtaining the html file, we will view the quality control report.

Interpretation of fastp quality control report

This part of the report is similar to the report generated by the FastQC software, but more detailed than the FastQC quality control report.

  • summary section

    This part is the summary information of each reads. Through this part, we can clearly see the quality control information before and after each reads, which can provide help for our subsequent processing.

    Q20 and Q30 represent the percentage of a certain base quality value in the total number of bases, which is similar to the pass rate of a product. Different quality standards will produce different pass rates. The higher the standard, the better the quality, and the less qualified. Generally speaking, for next-generation sequencing, it is best to have more than 95% of the bases reaching Q20 (no less than 90% at worst), and more than 85% (no less than 80% at worst) for Q30. (This sentence is excerpted from (4 messages) Shengxin Study Notes: Interpretation of the report results generated by fastp quality control processing_twocanis' blog-CSDN blog )

insert image description here

  • Adapters (joint part)

    The linker is the next-generation sequencing process, the sequence is too short, and the linker is detected during cycle sequencing (my data is 150 bases, depending on your situation, it may be different), as shown in the figure below:
    insert image description here

For the results generated by the fastp software, the two files (reads at both ends) list the occurrences of adapters from 1 to tens of digits, and the number of other unlisted connectors (I only have 8 digits here, which means that the connectors are automatically recognized)

insert image description here

  • Insert size estimation part

    This estimate is based on paired-end overlap analysis, which found that 22.768265% of reads did not overlap.
    Non-overlapping read pairs may have an insert size of less than 30 or greater than 270, or contain too many sequencing errors to be detected as overlapping.

    As for why there are inserts , please see this article: An article that clarifies what is an "insert"? - Zhihu (zhihu.com)

    (22.768265% of the readings in my data do not overlap, you have to look at your data, I can only give examples here)

    Put the mouse on the graph to view the results of a specific location (try it yourself)
    insert image description here

  • filtering part

    This section has a lot of content, including the results before and after each reads.

    quality section

    The quality distribution of bases at different positions, generally speaking, the quality should be >30 and the fluctuation is small, which is good data

    The mouse can be placed on the line, you can view the data of a specific location, and you can also press and hold the left button to draw a frame, and the framed part will be enlarged.
    insert image description here

base quality part

The distribution of base ratios at each position of the read is to analyze the degree of separation of bases. What is base segregation? It is known that AT pairing and CG pairing, if the sequencing process is relatively random (random means good), then the ratio of A and T should be similar at each position, and the ratio of C and G should be similar. As shown in the figure above, even if there is a deviation between the two, it should not be too large, preferably within 1% on average. If it is too high, unless there is a reasonable reason, such as some specific capture sequencing, you need to pay attention to whether there is any deviation in the sequencing process.

The part with a lot of fluctuations in the front is basically a problem when building the database, and this problem is partially solved.

The mouse can be placed on the line, you can view the data of a specific location (as shown in the figure below), and you can also press and hold the left button to draw a frame, and the framed part will be enlarged.
insert image description here

  • KMER count

    Fastp counts the number of occurrences of all combinations with a length of 5 bases, and then puts it in a table. Each element of the table is a white character with a dark background. The darker the background, the more repetitions. You can find out what abnormal information is there by observing the color. The mouse can stay on a specific combination to see the number of occurrences and the average proportion.

    The mouse can be placed in the small box to view specific location data
    insert image description here

All other parts are the same reference, not in the description

The fastp interpretation part is over.

At this point, the content of this article is over. This article is based on my own study and practice. I have referred to a lot of materials. 如若有大佬能指出错误,我将感激不敬If there are any mistakes, please leave a message in the comment area, thank you.

Guess you like

Origin blog.csdn.net/qq_74093550/article/details/131733101