The transcriptome can also extract chloroplast and mitochondrial reads!
Today, I followed Ya Yang's transcriptome assembly process to test [click to enter] . The highlight is the ability to extract the organelle genome. For a detailed introduction of transcriptome extraction of organelle genes, please see the blog post: Literature sharing: RNA-Seq data: a goldmine for organelle research
This post mainly talks about quality control and screening of transcriptome raw data. After running, you will find that these steps have been done:
- Random sequencing error correction with Rcorrector
- Removes read pairs that cannot be corrected
- Remove sequencing adapters and low quality sequences with Trimmomatic
- Filter organelle reads (cpDNA, mtDNA or both) with Bowtie2. Files containing only organelle reads will be produced which can be use to assemble for example the plastomes with Fast-Plast
- Runs FastQC to check read quality and detect over-represented reads
- Remove Over-represented sequences
Download code block
wget https://bitbucket.org/yanglab/phylogenomic_dataset_construction/get/00bf25405914.zip
install software:
I have been accustomed to installing software with conda these days, which is faster.
The following script recommends using python2.7,python3不兼容!!!
The software used in quality control and screening includes:
biopython
fastqc
rcorrector
trimmomatic
bowtie2
Other software may be needed, because I have already installed some on the Linux system, I am not sure there is something missing, such as BLAST?
Change software location very important
After cloning the repository you need to change several paths within the next scripts in order to make them work on your local computer:
– Unauthorized use of vi: /xxx; and the command find -name XXX to find files – |
---|
extract_sequences.py: Change the path of CP_DATABASE and MT_DATABASE both files will be in the repository folder called databases.
Change the CP_DATABASE and MT_DATABASE in extract_sequences.py to the path where databases are located in your folder
rcorrector_wrapper.py : Change the path of APPS_HOME which is where Rcorrector is located on your computer.
Change the APPS_HOME in rcorrector_wrapper.py to the path where Rcorrector is installed in your folder. Note that it is the location of run_rcorrector.pl in the next line. Note that two lines are modified, because different versions may have different path names
trimmomatic_wrapper.py : Change the path of APPS_HOME which is where Trimmomatic is located on your computer. Also change TruSeq_ADAPTER, this file will be in the repository folder called databases .
Change APPS_HOME in trimmomatic_wrapper.py to the folder where Trimmomatic is installed Path, pay attention to the location of the next line of trimmomatic.jar, when modifying, pay attention to modify two lines, because different versions may have different path names; change TruSeq_ADAPTER to the path of databases under your folder
run_chimera_detection.py: Change the path of SCRIPTS_HOME, this will be the path to the folder scripts from the cloned repository.
Change SCRIPTS_HOME in run_chimera_detection.py to the path where scripts in your folder are located
transdecoder_wrapper.py : Change the path of BLASTP_DB_PATH, this will be the path to your custom blast database. One with proteomes of Arabidopsis and Beta is provided in the repository folder called databases as db.
Change the BLASTP_DB_PATH in transdecoder_wrapper.py to your folder The path where the databases are located,特可以自己下载fasta格式的叶绿体基因组做参考基因组,其实本质上就是一个bowtie2:
Run the program to startdebug
Official instructions:
For paired end reads:
python filter_fq.py taxonID_1.fq.gz taxonID_2.fq.gz Order_name genome_to_filter[cp, mt or both] num_cores output_dir
The first two arguments are the read files. The Order_name is the plant Order (eg. Caryophyllales) will be used for bowtie2 to create a database to filter the organelle reads and can be replaced with any plant Order (or any taxonomic rank following NCBI taxonomy) where you study group belongs. For a list of available genomes with their correspondence taxonomy check for the cp_lookout or mt_lookout tables in the databases folder. For the organelle genome you can especify cpDNA, mtDNA or both. num_core is the number of cpus or threads to used. output_dir is where all the output files will be saved (any existing directory can be used).
filter_fq.py taxonID_1.fq.gz taxonID_2.fq.gz : The first two parameters are the files to be read.
Order_name : is the order of plants, which will be used by bowtie2 to create a database to filter organelle readings, and can be replaced with any plant order (or any classification ranking under the NCBI classification) belonging to your research group. For a list of available genomes and their corresponding taxonomy, please check the cp_lookout or mt_lookout table in the database folder.
genome_to_filter[cp, mt or both] : For organelle genome, you can specify cpDNA, mtDNA or both to filter.
num_core : is the number of cpus or threads to be used.
output_dir : is the save location of all output files (any existing directory can be used).
The command I used is as follows (run in the scripts directory):
python filter_fq.py ../SRR11559267_1.fastq ../SRR11559267_2.fastq Caryophyllales cp 5 ../results
The result file is as shown:
$ ls -al
total 642660
drwxrwxr-x 4 user user 4096 11月 27 21:20 .
drwxrwxr-x 7 user user 4096 11月 27 21:20 ..
-rw-rw-r-- 1 user user 6221991 11月 27 20:13 Caryophyllales_cp.1.bt2
-rw-rw-r-- 1 user user 1513468 11月 27 20:13 Caryophyllales_cp.2.bt2
-rw-rw-r-- 1 user user 728 11月 27 20:13 Caryophyllales_cp.3.bt2
-rw-rw-r-- 1 user user 1513462 11月 27 20:13 Caryophyllales_cp.4.bt2
-rw-rw-r-- 1 user user 6163301 11月 27 20:13 Caryophyllales_cp.fa
-rw-rw-r-- 1 user user 6221991 11月 27 20:13 Caryophyllales_cp.rev.1.bt2
-rw-rw-r-- 1 user user 1513468 11月 27 20:13 Caryophyllales_cp.rev.2.bt2
-rw-rw-r-- 1 user user 57869675 11月 27 20:13 SRR11559267_1.cor.fq
-rw-rw-r-- 1 user user 53354170 11月 27 20:13 SRR11559267_1.fix.fq
drwxrwxr-x 4 user user 4096 11月 27 20:13 SRR11559267_1.org_filtered_fastqc
-rw-rw-r-- 1 user user 233889 11月 27 20:13 SRR11559267_1.org_filtered_fastqc.html
-rw-rw-r-- 1 user user 251358 11月 27 20:13 SRR11559267_1.org_filtered_fastqc.zip
-rw-rw-r-- 1 user user 53275949 11月 27 20:13 SRR11559267_1.org_filtered.fq
-rw-rw-r-- 1 user user 2676 11月 27 20:13 SRR11559267_1.org_reads.fq
-rw-rw-r-- 1 user user 50319461 11月 27 20:13 SRR11559267_1.overep_filtered.fq
-rw-rw-r-- 1 user user 53278625 11月 27 20:13 SRR11559267_1.paired.trim.fq
-rw-rw-r-- 1 user user 70912 11月 27 20:13 SRR11559267_1.unpaired.trim.fq
-rw-rw-r-- 1 user user 56126823 11月 27 20:13 SRR11559267_2.cor.fq
-rw-rw-r-- 1 user user 51650784 11月 27 20:13 SRR11559267_2.fix.fq
drwxrwxr-x 4 user user 4096 11月 27 20:13 SRR11559267_2.org_filtered_fastqc
-rw-rw-r-- 1 user user 241238 11月 27 20:13 SRR11559267_2.org_filtered_fastqc.html
-rw-rw-r-- 1 user user 265855 11月 27 20:13 SRR11559267_2.org_filtered_fastqc.zip
-rw-rw-r-- 1 user user 51574065 11月 27 20:13 SRR11559267_2.org_filtered.fq
-rw-rw-r-- 1 user user 2672 11月 27 20:13 SRR11559267_2.org_reads.fq
-rw-rw-r-- 1 user user 48706654 11月 27 20:13 SRR11559267_2.overep_filtered.fq
-rw-rw-r-- 1 user user 51576737 11月 27 20:13 SRR11559267_2.paired.trim.fq
-rw-rw-r-- 1 user user 328 11月 27 20:13 SRR11559267_2.unpaired.trim.fq
-rw-rw-r-- 1 user user 129 11月 27 20:13 SRR11559267_fix_pe.log
-rw-rw-r-- 1 user user 69 11月 27 20:13 SRR11559267_over_pe.log
-rw-rw-r-- 1 user user 106052315 11月 27 20:13 SRR11559267.sam
XXX.overep_filtered.fq is the transcriptome data that has been filtered out of cp (chloroplast), and quality control is also done, while XXX.org_reads.fq is the file that stores chloroplast reads.
It's a pity that I didn't realize that I was using a bacterial transcriptome until I ran the process. The 6 cp reads screened out already gave me face, and if none were screened out, I would debug for several days! Amazing.
If you do not do the quality control in the yang process or want to do the quality control yourself, just want to deduct a chloroplast gene, you can refer to the following content
#My reference file is too big, I divided it up, and it's about 300M
split -l 5000000 SRR11554880_1.fastq
mv xaf mollen300_1.fq
rm x*
split -l 5000000 SRR11554880_2.fastq
mv xaf mollen300_2.fq
rm x*
#Get each mollendorffi transcriptome original file of about 300M
#Only screen chloroplasts, no quality control
python ./scripts/filter_organelle_reads.py ./databases/mollendorfi.fasta.txt mollen300_1.fq mollen300_2.fq 5 ./results_mi/
cd results_mi/
#mollendorfi.fasta.txt:NCBI上下载的mollendorfi叶绿体叶绿体基因组文件
more mollen300_1.org_reads.fq