Sequence Alignment for Transcriptome Learning (Hisat2) [Easy-to-understand Version of Study Notes]

Sequence alignment for transcriptome learning (hisat2) [study notes easy-to-understand version]

data :2023.7.25

Recorder: CYH-BI

Special Note: This article is a study record of my own study, without any authority, it can only provide ideas and reference for beginners.
This article knows the address: https://zhuanlan.zhihu.com/p/645765898


Hisat2 introduction:

Hisat2 is a short sequence alignment tool, mainly used for the alignment of transcriptome data, and is an upgraded version of the Hisat alignment tool. Hisat2 optimizes the index building strategy and adopts a new comparison strategy, making it more sensitive and faster than Bowtie/TopHat2 and other software. Hisat2 supports the identification of splicing sites and the reconstruction of transcripts: Hisat2 can use known or discovered splicing site information for splicing alignment to improve the alignment rate and accuracy; at the same time, Hisat2 can be combined with software such as StringTie Perform transcript reconstruction and quantification to provide more comprehensive and accurate transcriptome information.

Principle (documentation) :

HISAT: a fast spliced aligner with low memory requirements | Nature Methods

Download Hisat2 to RNA sequence Maching

Method 1: Download and install the installation package from the official website

One: Use wget to download the Hista2 installation package (this address may change, for reference only)

 wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-Linux_x86_64.zip

Decompression (the compressed package is a .zip file, use unzip to decompress)

unzip unzip /home/cyh/biosoft/hisat2-2.1.0-Linux_x86_64.zip

Configure the environment and write the information to ~/.bashrc or /etc/profile

vim ~/.bashrc   #编辑该该文件
export PATH="/home/cyh/biosoft/hisat2-2.1.0: $PATH"  #在末尾输入这行内容,路径是你自己的
source ~/.bashrc  #重启文件,立即生效

Method 2: Install using conda

conda install -c bioconda hisat2

Note: The premise of using this method is that you have pre-installed miniconda3 or anaconda

Two: Download the genome file (.fan format) (take human as an example)

Download address: Genome List - Genome - NCBI (nih.gov) If it is a human being, search for human on this page
insert image description here

Click on Homo sapiens

insert image description here

wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz

-c This parameter prevents the download failure due to disconnection or other machine reasons, and the download continues after reconnecting to the network or troubleshooting.

Download Gene Annotation File

wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz

decompress

gunzip GCF_000001405.40_GRCh38.p14_genomic.fna.gz
gunzip GCF_000001405.40_GRCh38.p14_genomic.gff.gz

3. Manual indexing

extract_exons.py /home/chenyh/ly_NT_RNAseq/reference_GRCh38/GCF_000001405.40_GRCh38.p14_genomic.gff > GRCh38.p14_genome.exon    #(可选)

When constructing the HISAT2 index, you need to prepare a genome sequence file (usually a file with the extension .fa or .fasta) as the main input data. This file contains the DNA sequence information of the genome.

As for the .gff file, it is usually used to annotate the functional elements of the genome, such as genes, exons, splice sites, etc. When building the HISAT2 index, if you have the annotation information provided by the .gff file, you can use other tools (such as hisat2_extract_splice_sites.py and hisat2_extract_exons.py) to convert the .gff file into the exon (exon) and splice sites required by HISAT2 Point (splice site) information files, and then pass these converted files as parameters to the hisat2-build command.

Therefore, to build a HISAT2 index, you need at least a genome sequence file as input data, while .gff files require additional processing and conversion before they can be used for index building. (The following commands are required)

Contents of my script:

hisat2-build ./GCF_000001405.40_GRCh38.p14_genomic.fna GRCh38_index #index.sh

Run the script: (enter the return information into index-log.txt )

nohup bash index.sh >  2>&1 &

4. Sequence Alignment

hisat2 -p 4 --dta -x /home/cyh/Desktop/hugene_dir/GRCh38_index -1 /home/cyh/Desktop/fastq_dir/ly1_1.fa -2 /home/cyh/Desktop/fastq_dir/ly1_2.fa -S /home/cyh/Desktop/ly1_seq_mached.sam #脚本核心内容
nohup bash hisat2.sh > hisat2-log.txt 2>&1 &  #运行时间长,后台运行代码

Code explanation:

-p 4 cores;

–dta : --dtaThe parameter is used to generate the transcriptome assembly (transcriptome assembl; "TA" for short) file during RNA-Seq data analysis. When the parameters are set --dta, Hisat2 will record the detailed information of each alignment result during the alignment process, and store this information in the transcript assembly file.

-1, the first fq file; -2, the second fq file;

-S , the output file is in .sam format;


If you need to check other functions of hisat2 , please refer to the user manual or blogs of other bloggers

Official website address: Manual | HISAT2 (daehwankimlab.github.io)

Recommended blogger blog posts:

1. (4 messages) RNA-seq process study notes (7) - using Hisat2 for sequence alignment_Yaoyao Dad Love Learning Blog-CSDN Blog

2. Transcriptome Analysis | Using Hisat2 for Sequence Alignment - Tencent Cloud Developer Community - Tencent Cloud (tencent.com)


Extra content :

1、extract_splice_sites.py

extract_splice_sites.py genemo.gtf > genemo.ss

This is a command from the extract_splice_sites.py script 基因注释文件中提取剪接位点信息. The following is a detailed explanation of each parameter:

genemo.gtf : Specifies the path and file name of the gene annotation file, here the gene annotation file of the genemo genome is used .

genemo.ss : Output the extracted splice site information to the genemo.ss file.

In summary, the function of this command is to extract the splice site information from the gene annotation file of the genemo genome, and output the result to the genemo.ss file. This process can provide splice site information for subsequent RNA-Seq data analysis.

extract_splice_sites.py is a Python script for extracting splice site information from gene annotation files in GTF or GFF3 format. It can generate a file containing all splice site information for subsequent analysis of RNA sequencing data. When using extract_splice_sites.py, you need to specify the input gene annotation file path and file name, and the output file path and file name of the splice site information. .ss文件is the splice site information file extracted from the gene annotation file. This file contains all splice site information on the genome, including the chromosome, location, and splicing type of the splice site. This file is commonly used for splicing analysis in RNA-Seq data analysis and can help 识别不同的剪接事件并进行定量分析.

2、extract_exons.py

extract_exons.py genemo_data/genes/genemo.gtf >genemo.exon

This is a command to extract exon information from gene annotation files using the extract_exons.py script. The following is a detailed explanation of each parameter:

genemo_data/genes/genemo.gtf: Specify the path and file name of the gene annotation file, here is the gene annotation file of the genemo genome.

genemo.exon: Output the extracted exon information to the genemo.exon file.

.exon文件is the exon information file extracted from the gene annotation file. This file contains the exon information of all genes on the genome, including information such as the chromosome where the exon is located, the position, the number of the exon, and the length of the exon. This file is commonly used for exome quantification in RNA-Seq data analysis, 可以帮助识别外显子的边界和进行定量分析.

In summary, the function of this command is to extract exon information from the gene annotation file of the genemo genome, and output the result to the genemo.exon file. This process can provide exome information for subsequent RNA-Seq data analysis.

3、hisat2-build

hisat2-build --ss genemo.ss --exon genemo.exon genemo_data/genome/genemo.fa genemo_tran

This is a command to build a genome index using Hisat2 software. The following is a detailed explanation of each parameter:

–ss genemo.ss: Specifies the file path and file name containing the splice site information. Here, the splice site information file genemo.ss extracted from the gene annotation file of the genemo genome is used.

–exon genemo.exon: Specifies the path and file name of the file containing the exon information. Here, the exon information file genemo.exon extracted from the gene annotation file of the genemo genome is used.

genemo_data/genome/genemo.fa: Specify the path and file name of the genome sequence file, here the FASTA format sequence file of the genemo genome is used.

genemo_tran: Specifies the name of the output index file, where the output index file will contain transcript information, so that splicing events and exon information can be considered when comparing RNA sequencing data.

The exon information file genemo.exon extracted from the gene annotation file of the genemo genome is used here.

genemo_data/genome/genemo.fa: Specify the path and file name of the genome sequence file, here the FASTA format sequence file of the genemo genome is used.

genemo_tran: Specifies the name of the output index file, where the output index file will contain transcript information, so that splicing events and exon information can be considered when comparing RNA sequencing data.

In summary, the function of this command is to use Hisat2 software to construct a genome index, and use the file containing splice site information and exon information as input together with the genome sequence file to construct an index file genemo_tran containing transcript information. This index file can be used for subsequent comparison and analysis of RNA sequencing data.


At this point, the content of this article is over. This article is based on my own study and practice, and I have referred to a lot of information.如若有大佬能指出错误,我将感激

Guess you like

Origin blog.csdn.net/qq_74093550/article/details/131915068