Transcript assembly and quantification (stringtie) for transcriptome learning [easy-to-understand version of study notes]

Transcript assembly and quantification (stringtie) for transcriptome learning [easy-to-understand version of study notes]

date : 2023.07.25

recorder : CYH-BI

Special Note: This article is a study record of my own study, without any authority, it can only provide ideas and reference for beginners.
This article knows the address: https://zhuanlan.zhihu.com/p/645770755

stringtie tool for transcript assembly and quantification

Software introduction

StringTie is a fast and efficient RNA-Seq sequence alignment assembler. Its input includes not only alignments of other transcript assemblers but also short-read sequences. To identify differentially expressed genes between experiments, Ballgown, Cuffdiff, or other ( DESeq2 , edgeR, etc.) specialized software can be used to process the output of StringTie .

Stringtie applies a network flow algorithm derived from optimization theory, together with an optional de novo assembly strategy, to assemble these short reads into transcripts. Compared with other current transcript assembly software, stringtie has more accurate gene assembly effect and better gene expression estimation, and the number of assembled transcripts obtained through it is also more than other software.

Useful address: https://phantom-aria.github.io/2022/04/17/a.html (this article solves many problems)

Stringtie tool installation

Method 1: Use the official website installation package to install

1. Download package

wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.2.1.Linux_x86_64.tar.gz

2. Unzip

tar -zxvf stringtie-2.2.1.Linux_x86_64.tar.gz

3. Configuration environment

vim ~/.bashrc
export PATH=$PATH:"/home/cyh/biosoft/stringtie-2.2.1.Linux_x86_64: $PATH"
source ~/.bsahrc

Method 2: Install using conda

conda install -c bioconda stringtie

Use of Stringties

Use -h or –help to view the parameters and usage

 --mix : both short and long read data alignments are provided
        (long read alignments must be the 2nd BAM/CRAM input file)
 --rf : assume stranded library fr-firststrand
 --fr : assume stranded library fr-secondstrand
 -G reference annotation to use for guiding the assembly process (GTF/GFF)
 --conservative : conservative transcript assembly, same as -t -c 1.5 -f 0.05
 --ptf : load point-features from a given 4 column feature file <f_tab>
 -o output path/file name for the assembled transcripts GTF (default: stdout)
 -l name prefix for output transcripts (default: STRG)
 -f minimum isoform fraction (default: 0.01)
 -L long reads processing; also enforces -s 1.5 -g 0 (default:false)
 -R if long reads are provided, just clean and collapse the reads but
    do not assemble
 -m minimum assembled transcript length (default: 200)
 -a minimum anchor length for junctions (default: 10)
 -j minimum junction coverage (default: 1)
 -t disable trimming of predicted transcripts based on coverage
    (default: coverage trimming is enabled)
 -c minimum reads per bp coverage to consider for multi-exon transcript
    (default: 1)
 -s minimum reads per bp coverage to consider for single-exon transcript
    (default: 4.75)
 -v verbose (log bundle processing details)
 -g maximum gap allowed between read mappings (default: 50)
 -M fraction of bundle allowed to be covered by multi-hit reads (default:1)
 -p number of threads (CPUs) to use (default: 1)
 -A gene abundance estimation output file
 -E define window around possibly erroneous splice sites from long reads to
    look out for correct splice sites (default: 25)
 -B enable output of Ballgown table files which will be created in the
    same directory as the output GTF (requires -G, -o recommended)
 -b enable output of Ballgown table files but these files will be 
    created under the directory path given as <dir_path>
 -e only estimate the abundance of given reference transcripts (requires -G)
 --viral : only relevant for long reads from viral data where splice sites
    do not follow consensus (default:false)
 -x do not assemble any transcripts on the given reference sequence(s)
 -u no multi-mapping correction (default: correction enabled)
 --ref/--cram-ref reference genome FASTA file for CRAM input

Transcript merge usage mode: 

  stringtie --merge [Options] {
    
     gtf_list | strg1.gtf ...}
With this option StringTie will assemble transcripts from multiple
input files generating a unified non-redundant set of isoforms. In this mode
the following options are available:
  -G <guide_gff>   reference annotation to include in the merging (GTF/GFF3)
  -o <out_gtf>     output file name for the merged transcripts GTF
                    (default: stdout)
  -m <min_len>     minimum input transcript length to include in the merge
                    (default: 50)
  -c <min_cov>     minimum input transcript coverage to include in the merge
                    (default: 0)
  -F <min_fpkm>    minimum input transcript FPKM to include in the merge
                    (default: 1.0)
  -T <min_tpm>     minimum input transcript TPM to include in the merge
                    (default: 1.0)
  -f <min_iso>     minimum isoform fraction (default: 0.01)
  -g <gap_len>     gap between transcripts to merge together (default: 250)
  -i               keep merged transcripts with retained introns; by default
                   these are not kept unless there is strong evidence for them
  -l <label>       name prefix for output transcripts (default: MSTRG)

single sample assembly

Use the genome annotation file for each sample after sorting and conversion format ( .bam file) to generate .gtf for subsequent assembly ( note: the input file must be sorted. )

stringtie -p 3 -e -G /home/cyh/Desktop/hugene_dir/GCF_000001405.40_GRCh38.p14_genomic.gff -o ly1.gtf -i /home/cyh/Desktop/his_result_sample1/sample1_sorted.bam

-p 3: thread number 3
-G : genome annotation information (.gff can also be .gtf file)
-o : generate sample (.gtf)
-i : input sorted sample file (.bam file)

-e : If you don't need a new transcript, be sure to add the -e parameter,

  1. If the sample we are studying does not have good annotation information, there are few people studying it, and the existing annotation information is not perfect, then we need to reconstruct the transcript for annotation. At this time, there is no need to add the parameter -e.
  2. If the annotation information of the sample is very complete, such as a model organism such as Arabidopsis, we do not need to reconstruct new transcripts for annotation, only the existing reference genome annotation file is sufficient, then use the -e parameter, not Novel transcripts need to be predicted.

The -e parameter is also more important. Only after the -e parameter is used can the prepDE.py3 script be run to obtain the read count matrix (that is, to quantify).

This part of the recommended address:

1. Instructions for using Stringtie - Short Book (jianshu.com)

2、https://phantom-aria.github.io/2022/04/17/a.html

Assembly of multiple samples

Once a single transcript is assembled, multiple transcripts can be assembled

stringtie --merge -p 3 \
ly1.gtf \
ly2.gtf \
...(省略)\
lyn.gtf \
-G 
/home/cyh/Desktop/hugene_dir/GCF_000001405.40_GRCh38.p14_genomic.gff \
-o stringtied_merged.gtf

The input data are .gtf files assembled from individual transcripts

-G : genome annotation file

The output data is an assembled .gtf file (I named it here: stringtied_merged.gtf )

stringtie --merge [options] gtf.list : Transcriptome merge mode, in this mode, Stringtie can use an input gtf list and organize the transcripts in them non-redundantly. When processing multiple RNA-seq samples, due to the spatiotemporal specificity of the transcriptome, the transcriptome of each sample can be integrated non-redundantly. If -G provides a reference gtf file, it can be integrated together into a file, and finally output into a complete gtf file, which can be used for quantification.

The resulting stringtied_merged.gtf can be used to generate results for the Ballgown package, see the quantitative section

Quantitative

There are many ways to quantify

The first: ( not recommended )

This part of the results is used in the Ballgown package. Use the -B parameter to generate * .ctab files for differential expression analysis using the ballgown package. Taking the sample1 data as an example, 6 files will be generated (one .gtf , five * .ctab ), it is recommended to use a folder to install the results generated by each sample, otherwise the results of each sample will be overwritten. Then use the Ballgown package to read the results (the content of Rsudio is not explained here)

stringtie -e -B -p 4 -G stringtied_merged.gtf -o sample1-ballgown.gtf /home/cyh/Desktop/his_result_sample1/sample1_sorted.bam

Specify the gtf or gff file after -G . It is recommended to use the stringtied_merged.gtf file after –merge above

-o output .gtf file

In the output GTF format file, for each transcript, the following three expression levels
1, coverage
2, TPM
3, FPKM will be given

For example, in my script, there will be a folder for each sample result, because each sample has the same file name except for the .gtf file, and the result will be overwritten. I have three .bam files. I use the stringtie_merged.gtf after multi-sample assembly to quantify. A .gtf file and 5 .ctab files will be generated. The ctab file needs to be read by the Ballgown package.

for i in {
    
    1,2,3}
do
mkdir sample_ly${i}
cd ./sample_ly${i}
stringtie -e -B -p 20 -G /home/chenyh/ly_NT_RNAseq/stringtie_result/stringtie_merged.gtf -o ly${i}-ballgown.gtf /home/chenyh/ly_NT_RNAseq/samtools_result/ly${i}.bam
cd ../
done

Using the stringtie software, the *.ctab file generated after adding the -B parameter to each sample has five results for each sample, which are:

e_data.ctab: 外显子水平表达值
i_data.ctab:内含子水平表达值
t_data.ctab:转录组水平表达值
e2t.ctab:表中有两列,e_id和t_id,表示哪些外显子属于哪些转录本。这些id与e_data和t_data表中的id匹配。
i2t.ctab:表中有两列,i_id和t_id,表示哪些内含子属于哪些转录本。这些id与i_data和t_data表中的id匹配。

For how to use the Ballgown package for subsequent quantification, please see other tutorials.

The second: (recommended)

Use the python script that comes with stringTie to quantify

prepDE.py

Essentially, stringTie only provides expression at the transcript level, and the quantitative methods include TPM and FPKM values. In order to quantify the raw count, the official provides prepED.pya script that can calculate the expression of the raw count. The usage is as follows

python prepDE.py \
-i sample_list.txt  \
-g gene_count_matrix.csv  \
-o transcript_count_matrix.csv

The input file is sample_list.txt, which is \tdivided into two columns. The first column is the sample name, and the second column is the path of the quantitative gtf file. The example is as follows

sampleA A.stringtie.gtf
sampleB B.stringtie.gtf

The .gtf file of this part can be the result generated by the assembly of a single transcript.

The script simultaneously outputs raw count expression values ​​at the gene and transcript levels. Generate two results gene_count_matrix.csv and transcript_count_matrix.csv . Follow-up analysis can be performed using DEseq2 .


At this point, the content of this article is over. This article is based on my own study and practice, and I have referred to a lot of information.如若有大佬能指出错误,我将感激

Guess you like

Origin blog.csdn.net/qq_74093550/article/details/131915315