Filter out chloroplast and mitochondrial reads in the transcriptome data

The transcriptome can also extract chloroplast and mitochondrial reads!

Today, I followed Ya Yang's transcriptome assembly process to test [click to enter] . The highlight is the ability to extract the organelle genome. For a detailed introduction of transcriptome extraction of organelle genes, please see the blog post: Literature sharing: RNA-Seq data: a goldmine for organelle research

This post mainly talks about quality control and screening of transcriptome raw data. After running, you will find that these steps have been done:


  1. Random sequencing error correction with Rcorrector
  2. Removes read pairs that cannot be corrected
  3. Remove sequencing adapters and low quality sequences with Trimmomatic
  4. Filter organelle reads (cpDNA, mtDNA or both) with Bowtie2. Files containing only organelle reads will be produced which can be use to assemble for example the plastomes with Fast-Plast
  5. Runs FastQC to check read quality and detect over-represented reads
  6. Remove Over-represented sequences

Download code block

wget https://bitbucket.org/yanglab/phylogenomic_dataset_construction/get/00bf25405914.zip

install software:

I have been accustomed to installing software with conda these days, which is faster.

The following script recommends using python2.7,python3不兼容!!!

The software used in quality control and screening includes:

biopython
fastqc
rcorrector
trimmomatic
bowtie2

Other software may be needed, because I have already installed some on the Linux system, I am not sure there is something missing, such as BLAST?

Change software location very important


After cloning the repository you need to change several paths within the next scripts in order to make them work on your local computer:
– Unauthorized use of vi: /xxx; and the command find -name XXX to find files –

extract_sequences.py: Change the path of CP_DATABASE and MT_DATABASE both files will be in the repository folder called databases.

Change the CP_DATABASE and MT_DATABASE in extract_sequences.py to the path where databases are located in your folder

rcorrector_wrapper.py : Change the path of APPS_HOME which is where Rcorrector is located on your computer.
Change the APPS_HOME in rcorrector_wrapper.py to the path where Rcorrector is installed in your folder. Note that it is the location of run_rcorrector.pl in the next line. Note that two lines are modified, because different versions may have different path names

trimmomatic_wrapper.py : Change the path of APPS_HOME which is where Trimmomatic is located on your computer. Also change TruSeq_ADAPTER, this file will be in the repository folder called databases .
Change APPS_HOME in trimmomatic_wrapper.py to the folder where Trimmomatic is installed Path, pay attention to the location of the next line of trimmomatic.jar, when modifying, pay attention to modify two lines, because different versions may have different path names; change TruSeq_ADAPTER to the path of databases under your folder

run_chimera_detection.py: Change the path of SCRIPTS_HOME, this will be the path to the folder scripts from the cloned repository.

Change SCRIPTS_HOME in run_chimera_detection.py to the path where scripts in your folder are located

transdecoder_wrapper.py : Change the path of BLASTP_DB_PATH, this will be the path to your custom blast database. One with proteomes of Arabidopsis and Beta is provided in the repository folder called databases as db.
Change the BLASTP_DB_PATH in transdecoder_wrapper.py to your folder The path where the databases are located,特可以自己下载fasta格式的叶绿体基因组做参考基因组,其实本质上就是一个bowtie2:


Run the program to startdebug

Official instructions:

For paired end reads:
python filter_fq.py taxonID_1.fq.gz taxonID_2.fq.gz Order_name genome_to_filter[cp, mt or both] num_cores output_dir

The first two arguments are the read files. The Order_name is the plant Order (eg. Caryophyllales) will be used for bowtie2 to create a database to filter the organelle reads and can be replaced with any plant Order (or any taxonomic rank following NCBI taxonomy) where you study group belongs. For a list of available genomes with their correspondence taxonomy check for the cp_lookout or mt_lookout tables in the databases folder. For the organelle genome you can especify cpDNA, mtDNA or both. num_core is the number of cpus or threads to used. output_dir is where all the output files will be saved (any existing directory can be used).

filter_fq.py taxonID_1.fq.gz taxonID_2.fq.gz : The first two parameters are the files to be read.

Order_name : is the order of plants, which will be used by bowtie2 to create a database to filter organelle readings, and can be replaced with any plant order (or any classification ranking under the NCBI classification) belonging to your research group. For a list of available genomes and their corresponding taxonomy, please check the cp_lookout or mt_lookout table in the database folder.

genome_to_filter[cp, mt or both] : For organelle genome, you can specify cpDNA, mtDNA or both to filter.

num_core : is the number of cpus or threads to be used.

output_dir : is the save location of all output files (any existing directory can be used).

The command I used is as follows (run in the scripts directory):

python filter_fq.py ../SRR11559267_1.fastq ../SRR11559267_2.fastq Caryophyllales cp  5  ../results

The result file is as shown:

$ ls -al
total 642660
drwxrwxr-x 4 user user      4096 11月 27 21:20 .
drwxrwxr-x 7 user user      4096 11月 27 21:20 ..
-rw-rw-r-- 1 user user   6221991 11月 27 20:13 Caryophyllales_cp.1.bt2
-rw-rw-r-- 1 user user   1513468 11月 27 20:13 Caryophyllales_cp.2.bt2
-rw-rw-r-- 1 user user       728 11月 27 20:13 Caryophyllales_cp.3.bt2
-rw-rw-r-- 1 user user   1513462 11月 27 20:13 Caryophyllales_cp.4.bt2
-rw-rw-r-- 1 user user   6163301 11月 27 20:13 Caryophyllales_cp.fa
-rw-rw-r-- 1 user user   6221991 11月 27 20:13 Caryophyllales_cp.rev.1.bt2
-rw-rw-r-- 1 user user   1513468 11月 27 20:13 Caryophyllales_cp.rev.2.bt2
-rw-rw-r-- 1 user user  57869675 11月 27 20:13 SRR11559267_1.cor.fq
-rw-rw-r-- 1 user user  53354170 11月 27 20:13 SRR11559267_1.fix.fq
drwxrwxr-x 4 user user      4096 11月 27 20:13 SRR11559267_1.org_filtered_fastqc
-rw-rw-r-- 1 user user    233889 11月 27 20:13 SRR11559267_1.org_filtered_fastqc.html
-rw-rw-r-- 1 user user    251358 11月 27 20:13 SRR11559267_1.org_filtered_fastqc.zip
-rw-rw-r-- 1 user user  53275949 11月 27 20:13 SRR11559267_1.org_filtered.fq
-rw-rw-r-- 1 user user      2676 11月 27 20:13 SRR11559267_1.org_reads.fq
-rw-rw-r-- 1 user user  50319461 11月 27 20:13 SRR11559267_1.overep_filtered.fq
-rw-rw-r-- 1 user user  53278625 11月 27 20:13 SRR11559267_1.paired.trim.fq
-rw-rw-r-- 1 user user     70912 11月 27 20:13 SRR11559267_1.unpaired.trim.fq
-rw-rw-r-- 1 user user  56126823 11月 27 20:13 SRR11559267_2.cor.fq
-rw-rw-r-- 1 user user  51650784 11月 27 20:13 SRR11559267_2.fix.fq
drwxrwxr-x 4 user user      4096 11月 27 20:13 SRR11559267_2.org_filtered_fastqc
-rw-rw-r-- 1 user user    241238 11月 27 20:13 SRR11559267_2.org_filtered_fastqc.html
-rw-rw-r-- 1 user user    265855 11月 27 20:13 SRR11559267_2.org_filtered_fastqc.zip
-rw-rw-r-- 1 user user  51574065 11月 27 20:13 SRR11559267_2.org_filtered.fq
-rw-rw-r-- 1 user user      2672 11月 27 20:13 SRR11559267_2.org_reads.fq
-rw-rw-r-- 1 user user  48706654 11月 27 20:13 SRR11559267_2.overep_filtered.fq
-rw-rw-r-- 1 user user  51576737 11月 27 20:13 SRR11559267_2.paired.trim.fq
-rw-rw-r-- 1 user user       328 11月 27 20:13 SRR11559267_2.unpaired.trim.fq
-rw-rw-r-- 1 user user       129 11月 27 20:13 SRR11559267_fix_pe.log
-rw-rw-r-- 1 user user        69 11月 27 20:13 SRR11559267_over_pe.log
-rw-rw-r-- 1 user user 106052315 11月 27 20:13 SRR11559267.sam

XXX.overep_filtered.fq is the transcriptome data that has been filtered out of cp (chloroplast), and quality control is also done, while XXX.org_reads.fq is the file that stores chloroplast reads.
It's a pity that I didn't realize that I was using a bacterial transcriptome until I ran the process. The 6 cp reads screened out already gave me face, and if none were screened out, I would debug for several days! Amazing.

If you do not do the quality control in the yang process or want to do the quality control yourself, just want to deduct a chloroplast gene, you can refer to the following content

#My reference file is too big, I divided it up, and it's about 300M

split -l 5000000 SRR11554880_1.fastq
mv xaf mollen300_1.fq
rm x*
split -l 5000000 SRR11554880_2.fastq
mv xaf mollen300_2.fq
rm x*

#Get each mollendorffi transcriptome original file of about 300M

#Only screen chloroplasts, no quality control

python ./scripts/filter_organelle_reads.py ./databases/mollendorfi.fasta.txt  mollen300_1.fq mollen300_2.fq  5 ./results_mi/
cd results_mi/
#mollendorfi.fasta.txt:NCBI上下载的mollendorfi叶绿体叶绿体基因组文件
more mollen300_1.org_reads.fq

Guess you like

Origin blog.csdn.net/mushroom234/article/details/110247207