Quality control and filtering of transcriptome raw data·Try various methods

Because I started directly with trinity, I started to add data pre-processing later, after all, it’s okay to practice hands. But in fact this part should be placed first.
Download a bacterium with a very small genome to test: Pelagibacter phage Greip EXVC021P

1. Download and unzip

nohup wget www.XXXX &
nohup fastq-dump -gzip -split-3 -A  SRR11559267 &
gunzip SRR11559267_1.fastq.gz

2. Raw data quality control chart #####fastqc######

Install fastqc

conda create -n fastqc
conda activate fastqc
conda install -c bioconda fastqc
fastqc --help

Run fastqc

fastqc -t 4 -o ./ SRR11559267_1.fastq SRR11559267_2.fastq

Get the file SRR11559267_1_fastqc.html SRR11559267_2_fastqc.html
browser to view the quality control file, the result is not very good

3. Filter reads

The official website of NGSQCToolkit cannot be opened, and not many people use it.
Install libgd and GD before installing NGSQCToolkit [gd library provides a series of APIs for processing pictures, using GD library can process pictures, or generate pictures. The GD library on the website is usually used to generate thumbnails or to add watermarks to pictures or generate reports on website data. That is to say, with the gd library, we will be handy when we use php to process images. ]
#conda install libgd#

Reference URL

A. FASTX-Toolkit

Before using the process, we need to simply judge whether the sequencing format is Phred+33 or Phred+64. Those with = are generally Phred+33. In fact, the sequencing results in recent years are generally Phred+33. Early data downloaded from the Internet may be Phred+64. Some people also judged this way:

grep 2 rosalind_filt_1_dataset.txt  #有结果
grep X rosalind_filt_1_dataset.txt  # 无结果
# 基本上断定这个是Phred33

B. FASTQ/A Clipper delinker sequence

Here -v can display the input and output functions, -l 18 is to remove reads with a length less than 18nt, you should make good use of fastx_clipper -h, so that you can choose the parameters you want. -Q 33 must be added in the Fastx Toolkit application. This is not displayed in -h. The explanation I can find for the time being is that -Q is an undocumented parameter to indicate that quality values use ASCII 33 encoding. The results are as follows:

fastx_clipper -Q 33 -l 18 -a TGGAATTCTCGGGTGCCAAGG -v -i SRR11559267_1.fastq -o SRR11559267_1_clipped.fastq

Regarding the adapter sequence: It’s easy to say that your own sequence is natural. Some sequences downloaded on the Internet, such as NCBI, cannot find the adapter sequence. At this time, you can use the Fastqc tool to find the adapter content with a red cross in the Adapter Content in the result. A statistic can be made from these several sequences as a linker adapter. If there are not too many connectors remaining, this step can be omitted.

C. fastq_quality_filter to low quality reads

fastq_quality_filter -Q 33 -v -q 30 -p 80 -i SRR11559267_1.fastq -o SRR11559267_1_qualified.fastq

About -q and -p The following picture explains very clearly, the reads filtered out by -q 30 -p 80 are between -q 20 -p 90 and -q 20 -p 100.

D. trimmomatic filters low-quality bases

If you use Fastqc to find that the quality of the bases before and after the sequence is not good, we can use trimmomatic filtering to filter before and after the read according to a certain threshold. For example,
java -classpath trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticSE -phred33 data/s1.fq data/tmp.fq TRAILING:30 MINLEN:50
is to filter the bases below 30 before and after filtering, and then delete the bases less than 50 read.

#单端
trimmomatic SE -phred33 SRR11559267_2.fastq out1.fq LEADING:22 TRAILING:22
#双端
trimmomatic PE -threads 5 -phred33 SRR11559267_1.fastq SRR11559267_2.fastq -baseout SRR11559267.fastq SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25

During the learning process, the author found that there are many softwares that can do quality control filtering, such as the mentioned FASTX-Toolkit, trimmomatic, as well as sickle and seqtk.
A post compared trimmomatic, sickle, and seqtk. The
results show:
If you need to remove the sequencing adapter sequence at the same time, then it is recommended to use Trimmomatic;
if you only need to filter low-quality bases or low-quality reads, you can choose Trimmomatic or sickle, sometimes sickle will be faster ,
If you do not want to read is filtered and the quality value system is phred33, you can choose seqtk.

In the author's own trial, there is very little information about the connector, but FASTX-Toolkit seems to be unable to solve the problem of unbalanced sequencing files at the left and right ends caused by the removal of low-quality reads from PE ( please criticize and correct if it can be balanced ), so I learned again sickle . For detailed instructions, please refer to the next blog post: Sickle Transcriptome Data Filter·Use Case

PS : How to use conda to install the downloaded software package:
1. Download the .tar.gz2 file with wget and move it into the miniconda/pkgs/ folder
2. Find the urls.txt file in pkgs and manually add the download address
3. install

More trimmomatic tutorials

Several good assembly cases:
1. Assemble the bacterial genome-Trimmomatic
2. Transcriptome analysis study notes (continuous supplement)
3. phylogenomic_dataset_construction