筛出转录组数据中的叶绿体和线粒体reads

转录组也能提取叶绿体和线粒体reads了！

今天照 Ya Yang 的转录组组装流程进行试炼【点击进入】，亮点是能够提取细胞器基因组。关于转录组提取细胞器基因的详细介绍见博文：文献分享：RNA-Seq data: a goldmine for organelle research

这贴主要讲转录组原始数据质控和筛选，运行完你会发现已经做了这些步骤：

Random sequencing error correction with Rcorrector

Removes read pairs that cannot be corrected

Remove sequencing adapters and low quality sequences with Trimmomatic

Filter organelle reads (cpDNA, mtDNA or both) with Bowtie2. Files containing only organelle reads will be produced which can be use to assemble for example the plastomes with Fast-Plast

Runs FastQC to check read quality and detect over-represented reads

Remove Over-represented sequences

下载代码块

wget https://bitbucket.org/yanglab/phylogenomic_dataset_construction/get/00bf25405914.zip

安装软件：

我这几天一直习惯用conda安装软件，这样比较快。

以下脚本建议用python2.7，python3不兼容！！！

质控和筛选用到的软件有：

biopython
fastqc
rcorrector
trimmomatic
bowtie2

可能还需要其他软件，因为我linux系统上本来就装过一些，我也不肯定还有哪里漏了，比如BLAST？

更改软件位置 `very important`

After cloning the repository you need to change several paths within the next scripts in order to make them work on your local computer:

– 擅用vi：/xxx；以及命令find -name XXX 查找文件 –

extract_sequences.py: Change the path of CP_DATABASE and MT_DATABASE both files will be in the repository folder called databases.

更改extract_sequences.py里的CP_DATABASE、MT_DATABASE到你文件夹下databases所在的路径

rcorrector_wrapper.py: Change the path of APPS_HOME which is where Rcorrector is located on your computer.
更改rcorrector_wrapper.py里的APPS_HOME到你文件夹下安装Rcorrector所在的路径，注意是下一行run_rcorrector.pl的位置，修改的时候注意是修改两行，因为不同的版本可能路径名不同

trimmomatic_wrapper.py: Change the path of APPS_HOME which is where Trimmomatic is located on your computer. Also change TruSeq_ADAPTER, this file will be in the repository folder called databases.
更改trimmomatic_wrapper.py里的APPS_HOME到你文件夹下安装Trimmomatic 所在的路径，注意是下一行trimmomatic.jar的位置，修改的时候注意是修改两行，因为不同的版本可能路径名不同；更改TruSeq_ADAPTER到你文件夹下databases所在的路径

run_chimera_detection.py: Change the path of SCRIPTS_HOME, this will be the path to the folder scripts from the cloned repository.

更改run_chimera_detection.py里的SCRIPTS_HOME到你文件夹下 scripts所在的路径

transdecoder_wrapper.py: Change the path of BLASTP_DB_PATH, this will be the path to your custom blast database. One with proteomes of Arabidopsis and Beta is provided in the repository folder called databases as db.
更改transdecoder_wrapper.py里的 BLASTP_DB_PATH到你文件夹下 databases所在的路径，特可以自己下载fasta格式的叶绿体基因组做参考基因组，其实本质上就是一个bowtie2：

运行程序开始`debug`

官方说明：

For paired end reads:
python filter_fq.py taxonID_1.fq.gz taxonID_2.fq.gz Order_name genome_to_filter[cp, mt or both] num_cores output_dir

The first two arguments are the read files. The Order_name is the plant Order (eg. Caryophyllales) will be used for bowtie2 to create a database to filter the organelle reads and can be replaced with any plant Order (or any taxonomic rank following NCBI taxonomy) where you study group belongs. For a list of available genomes with their correspondence taxonomy check for the cp_lookout or mt_lookout tables in the databases folder. For the organelle genome you can especify cpDNA, mtDNA or both. num_core is the number of cpus or threads to used. output_dir is where all the output files will be saved (any existing directory can be used).

filter_fq.py taxonID_1.fq.gz taxonID_2.fq.gz：前两个参数是读取的文件。

Order_name：是植物目，将用于bowtie2创建一个数据库来过滤细胞器读数，并且可以用您研究组所属的任何植物order（或NCBI分类法下的任何分类排名）替换。有关可用基因组及其对应分类法的列表，请检查数据库文件夹中的cp_lookout或mt_lookout表。

genome_to_filter[cp, mt or both]:对于细胞器基因组，您可以指定cpDNA，mtDNA或都筛选。

num_core:是要使用的cpus或线程数。

output_dir:是所有输出文件的保存位置（可以使用任何现有目录）。

我用的命令如下（在scripts目录下运行）：

python filter_fq.py ../SRR11559267_1.fastq ../SRR11559267_2.fastq Caryophyllales cp  5  ../results

结果文件如图所示：

$ ls -al
total 642660
drwxrwxr-x 4 user user      4096 11月 27 21:20 .
drwxrwxr-x 7 user user      4096 11月 27 21:20 ..
-rw-rw-r-- 1 user user   6221991 11月 27 20:13 Caryophyllales_cp.1.bt2
-rw-rw-r-- 1 user user   1513468 11月 27 20:13 Caryophyllales_cp.2.bt2
-rw-rw-r-- 1 user user       728 11月 27 20:13 Caryophyllales_cp.3.bt2
-rw-rw-r-- 1 user user   1513462 11月 27 20:13 Caryophyllales_cp.4.bt2
-rw-rw-r-- 1 user user   6163301 11月 27 20:13 Caryophyllales_cp.fa
-rw-rw-r-- 1 user user   6221991 11月 27 20:13 Caryophyllales_cp.rev.1.bt2
-rw-rw-r-- 1 user user   1513468 11月 27 20:13 Caryophyllales_cp.rev.2.bt2
-rw-rw-r-- 1 user user  57869675 11月 27 20:13 SRR11559267_1.cor.fq
-rw-rw-r-- 1 user user  53354170 11月 27 20:13 SRR11559267_1.fix.fq
drwxrwxr-x 4 user user      4096 11月 27 20:13 SRR11559267_1.org_filtered_fastqc
-rw-rw-r-- 1 user user    233889 11月 27 20:13 SRR11559267_1.org_filtered_fastqc.html
-rw-rw-r-- 1 user user    251358 11月 27 20:13 SRR11559267_1.org_filtered_fastqc.zip
-rw-rw-r-- 1 user user  53275949 11月 27 20:13 SRR11559267_1.org_filtered.fq
-rw-rw-r-- 1 user user      2676 11月 27 20:13 SRR11559267_1.org_reads.fq
-rw-rw-r-- 1 user user  50319461 11月 27 20:13 SRR11559267_1.overep_filtered.fq
-rw-rw-r-- 1 user user  53278625 11月 27 20:13 SRR11559267_1.paired.trim.fq
-rw-rw-r-- 1 user user     70912 11月 27 20:13 SRR11559267_1.unpaired.trim.fq
-rw-rw-r-- 1 user user  56126823 11月 27 20:13 SRR11559267_2.cor.fq
-rw-rw-r-- 1 user user  51650784 11月 27 20:13 SRR11559267_2.fix.fq
drwxrwxr-x 4 user user      4096 11月 27 20:13 SRR11559267_2.org_filtered_fastqc
-rw-rw-r-- 1 user user    241238 11月 27 20:13 SRR11559267_2.org_filtered_fastqc.html
-rw-rw-r-- 1 user user    265855 11月 27 20:13 SRR11559267_2.org_filtered_fastqc.zip
-rw-rw-r-- 1 user user  51574065 11月 27 20:13 SRR11559267_2.org_filtered.fq
-rw-rw-r-- 1 user user      2672 11月 27 20:13 SRR11559267_2.org_reads.fq
-rw-rw-r-- 1 user user  48706654 11月 27 20:13 SRR11559267_2.overep_filtered.fq
-rw-rw-r-- 1 user user  51576737 11月 27 20:13 SRR11559267_2.paired.trim.fq
-rw-rw-r-- 1 user user       328 11月 27 20:13 SRR11559267_2.unpaired.trim.fq
-rw-rw-r-- 1 user user       129 11月 27 20:13 SRR11559267_fix_pe.log
-rw-rw-r-- 1 user user        69 11月 27 20:13 SRR11559267_over_pe.log
-rw-rw-r-- 1 user user 106052315 11月 27 20:13 SRR11559267.sam

XXX.overep_filtered.fq 就是已经筛除cp(叶绿体)的转录组数据，还给做了质控，而XXX.org_reads.fq 才是存放叶绿体 reads的文件。
很可惜，直到我跑完流程才发现用的是个细菌转录组，筛选出6个cp reads已经是给我面子了，要是一个都没筛出来我估计会debug好几天！Amazing。

如果不做yang流程里的质控或者想自己做质控，纯粹想扣个叶绿体基因，可以参考以下内容

#我的参考文件太大，分割了一下，取大概300M

split -l 5000000 SRR11554880_1.fastq
mv xaf mollen300_1.fq
rm x*
split -l 5000000 SRR11554880_2.fastq
mv xaf mollen300_2.fq
rm x*

#得到每个300M左右的mollendorffi转录组原始文件

#只筛选叶绿体，不做质控

python ./scripts/filter_organelle_reads.py ./databases/mollendorfi.fasta.txt  mollen300_1.fq mollen300_2.fq  5 ./results_mi/
cd results_mi/
#mollendorfi.fasta.txt：NCBI上下载的mollendorfi叶绿体叶绿体基因组文件
more mollen300_1.org_reads.fq