Sequence cutting of transcriptome (Trimmomatic) [easy-to-understand version of study notes]

Sequence splicing of transcriptome (Trimmomatic)

Introduction to this article: This article is based on my own study and practice, and I have referred to a lot of information. If someone can point out the error, I will be grateful and disrespectful. If there is an error, please leave a message in the comment area, thank you. Date of update: July 6, 2023

This article knows the address: https://zhuanlan.zhihu.com/p/642000061

Here we first introduce why sequence trimming is necessary; sequence trimming is a preprocessing step in sequencing data analysis, the purpose of which is to remove low-quality bases and noise to improve data accuracy. quality and accuracy.

以下是进行序列剪切的一些常见原因:

  1. Removal of low-quality bases: During the sequencing process, each base has an associated quality value, indicating the confidence that the base was sequenced correctly. Sequence trimming can remove low-quality bases at the beginning or end to improve data quality and reliability.
  2. Removal of adapter sequences: During DNA sequencing, adapter sequences are usually used to connect DNA fragments to sequencing chips. Adapter sequences may remain in the sequencing data and interfere with subsequent analysis. Sequence splicing can help remove these adapter sequences.
  3. Removal of repeated sequences: In some experiments, repeated sequences caused by PCR amplification may appear, which may affect the accuracy of subsequent analysis. Sequence splicing can help remove these repetitive sequences.
  4. Remove Noise and Impurities: Noise, sequencing errors, or other impurities may be present in the sequencing data. Sequence trimming can help remove these noises and impurities, improving the accuracy and confidence of the data.

通过序列剪切,可以提高测序数据的质量和准确性,减少后续分析的误差,并帮助从原始测序数据中提取出更有意义的信息

The origin of the joint

Before library construction, nucleic acids are randomly interrupted, and some of them are of different lengths (mRNA), so the lengths of fragments between adapters are also of different lengths. However, the sequencing length of next-generation sequencing is generally fixed, and some of them will definitely be shorter than sequencing. The long-read sequence is sequenced, so the sequencing sequence contains part or all of the adapter sequence, and it is necessary to detect the adapter sequence and filter out the corresponding reads or truncate the adapter sequence. (As shown in the blue connector part in the figure below, some connectors were detected (some or all of them are possible). As for the reason, it is recommended to look at the principle of next-generation sequencing. There are too many contents here, so it is not suitable to expand)
insert image description here

Introduction to Trimmomatic Software

Trimmomatic is a popular data filtering tool for the Illumina platform. Data from other platforms such as Iron torrent and PGM sequencing data can be filtered with fastx_toolkit and NGSQC toolkit. It supports multi-threading and has a fast data processing speed. It is mainly used to remove adapters in the Fastq sequence of the Illumina platform, and to trim Fastq according to the base quality value. The software has two filtering modes, corresponding to SE and PE sequencing data, and supports gzip and bzip2 compressed files.

Last time we used FastQC software to get the results after quality control. From the results, we can get whether a certain data needs to be excised, how much, and whether there are adapters. In the overexpression part of FastQC, we can see that there are repeated sequences, and we can find the same Laws can also use this sequence to remove unwanted parts (I didn't say this because I haven't tried it).

Trimmomatic software download (Linux only)

1. Download the Trimmomatic software.

wget https://github.com/usadellab/Trimmomatic/files/5854859/Trimmomatic-0.39.zip

2. Unzip

unzip Trimmomatic-0.39.zip

Configure the environment (I configured it in the ~/.bashrc file)

# trimmomatic
echo "export PATH="/home/cyh/biosoft/Trimmomatic-0.39: $PATH" " >> ~/.bashrc

3. Test (help)

java -jar /home/cyh/biosoft/Trimmomatic-0.39/trimmomatic-0.39.jar -h

Use of Trimmomatic software

Official website reference address: USADELLAB.org - Trimmomatic: A flexible read trimming tool for Illumina NGS data

Introduction to some parameters:

ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 # 切除TruSeq3-PE中提供的Illumina适配器,去接头
SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.
SLIDINGWINDOW:4:15 #扫描4个碱基宽滑动窗口,当每个碱基的平均质量下降到15以下时进行剪切
LEADING: Cut bases off the start of a read, if below a threshold quality
LEADING:3      # 删除前低质量碱基(低于质量3)
TRAILING: Cut bases off the end of a read, if below a threshold quality
TRAILING:3      # 删除后低质量碱基(低于质量3)
CROP: Cut the read to a specified length
CROP:length    #从reads开始开始所要保留的碱基数为length
HEADCROP: Cut the specified number of bases from the start of the read
HEADCROP:12      #删除前12个碱基
MINLEN: Drop the read if it is below a specified length
MINLEN:36      # 上述步骤完成后,删除小于36个碱基的reads (放最后)
TOPHRED33: Convert quality scores to Phred-33
-phred33    #表示将质量分数转换为 Phred-33
TOPHRED64: Convert quality scores to Phred-64
-phred64    #表示将质量分数转换为 Phred-64

"phred33" is a representation of a common sequencing data format used to represent quality values ​​in DNA sequencing. In this format, each base corresponds to an ASCII character, indicating the score of sequencing quality. Among them, the character "!" means the quality value is 0, and the character "I" means the quality value is 40. This representation is mainly used in technologies such as Sanger sequencing and Illumina sequencing.

Single-ended (example):

java -jar trimmomatic-0.35.jar SE -phred33 input.fq.gz output.fq.gz ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

The above example describes: SE means single-ended, using the data format of 33; input.fq.gz means the input file, output.fq.gz means the output file, both file formats are compressed files in .fq format (Trimmomatic can input Output compressed file); ILLUMINACLIP:TruSeq3-SE:2:30:10 means the Illumina adapter that comes with the software, used to remove adapters; LEADING:3 means that the front (end) excision of bases with a quality lower than 3; TRAILING:3 means Back (end) excision of bases whose quality is lower than 3; SLIDINGWINDOW:4:15 means to scan a 4-base wide sliding window, and cut when the average quality of each base drops below 15; MINLEN:36 means to delete Reads less than 36 bases (this parameter needs to be used at the end)

double ended:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Double-ended is similar to single-ended, where the PE parameter is specified, indicating that this is double-ended data, because it is double-ended data, so there are two input data, and these two data must be paired (that is, the positive and negative sides of the nucleic acid chain data), there are four output results, and each reads will generate two files. " output_forward_paired.fq.gz " indicates that after being processed by trimmomatic, the paired sequences (paired sequences) in the (forward) read sequence are output to this file. " output_forward_unpaired.fq.gz " indicates that after being processed by trimmomatic, the unpaired sequences (unpaired sequences) in the (forward) read sequence are output to this file. " output_reverse_paired.fq.gz " indicates that after being processed by trimmomatic, the paired sequences (paired sequences) in the backward (forward) read sequence are output to this file . " output_reverse_unpaired.fq.gz " indicates that after being processed by trimmomatic, the unpaired sequences (unpaired sequences) in the backward (forward) read sequence are output to this file . Pay attention to PE when specifying adapters (ILLUMINACLIP:TruSeq3- PE.fa:2:30:10)

HEADCROP:12 means delete the first 12 bases

Note: If the joint file is not in the current directory, specify and write the full path, otherwise an error will be reported.

Practical example:

Here I used a pair (positive and negative data) SRR12415652_1.fastq and SRR12415652_2.fastq, this data can be downloaded in the database (as for NCBI, TCGA is old, so I forgot)

java -jar /home/cyh/biosoft/Trimmomatic-0.39/trimmomatic-0.39.jar PE /home/cyh/Desktop/fastq_dir/SRR12415652_1.fastq /home/cyh/Desktop/fastq_dir/SRR12415652_2.fastq /home/cyh/Desktop/SRR124_result/SRR12415652_1_paired.fq.gz /home/cyh/Desktop/SRR124_result/SRR12415652_1_unpaired.fq.gz /home/cyh/Desktop/SRR124_result/SRR12415652_2_paired.fq.gz /home/cyh/Desktop/SRR124_result/SRR12415652_2_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 HEADCROP:12 MINLEN:36 

After running the data, we need to perform quality control (FastQC) on the generated results (only paired), and look at the quality control report again.

The results are as follows: (SRR12415652_1, 2 are almost the same, not shown)

  • Per base sequence quality part

    剪切前后对比

。。。
insert image description here

This part is perfectly visible to the naked eye,

  • Per base sequence content部分

    剪切前

insert image description here

剪切后

insert image description here

In this part, because I deleted the first 12 bases, there is no front. (In fact, this can be cut, depending on the demand)

  • Adapter Content section

    剪切前

insert image description here

剪切后
insert image description here

The connector part can be seen with the naked eye. Don’t remove the connector. This part is very important. The connector in the data must be removed, so that it will not affect the subsequent data, reduce noise, and improve data quality and accuracy.

Other quality control parts will not be read. You can find a data and run it yourself.

At this point, the content of this article is over. This article is based on my own study and practice, and I have referred to a lot of information.如若有大佬能指出错误,我将感激

Guess you like

Origin blog.csdn.net/qq_74093550/article/details/131587363