2020 Transcriptome RNA-SEQ Upstream Analysis

Install and configure conda

Use Tsinghua source to download the sh script and install it

# 使用清华源下载sh脚本
wget -c  https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh

# 从官网下载最新版Miniconda3安装包,但速度较慢
wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Run the script file directly after downloading bash Miniconda3-latest-Linux-x86_64.sh. You need to enter yes and wait for the installation to complete
. After the final installation, you cannot use conda immediately. You need to source bashrc.

# 激活bashrc
source ~/.bashrc

NOTE⚠️:

  • conda will write a script in bashrc and connect to ssh to automatically enter the conda environment command. If not needed, you can run commands and perform performance configurationconda config --set auto_activate_base false
  • In addition, if you use tools such as zsh and zshrc is not automatically written, you can manually write it in the file.
  • If the conda command is not read, you can manually define environment variablesexport PATH="/home/super/miniconda3/bin:$PATH"

Set mirror source

# 下面这四行配置清华大学的bioconda的channel地址,国内用户推荐
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda/

## 官网默认
conda config --add channels r 
conda config --add channels conda-forge 
conda config --add channels bioconda

After setting the mirror or setting not to automatically enter the base, the config information will be automatically generated in the .condarc file. as follows:

$ cat .condarc 

auto_activate_base: false
channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda/
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
  - defaults

conda environment creation

Create a python2 environment management:

conda create -y -n rna_seq python=3

# -y        自动确认
# -n        新环境名字
# python=3  新环境中python=3

Activate and exit environments

conda activate <conda_name>     #激活某环境
conda decativate <conda>        #取消激活某环境

conda installation software

Install software using commands in the software environment

conda install -y sra-tools      #安装sra-tool软件,可以通过空格安装多个软件
conda install -y sra-tools fastqc trim-galore hisat2 subread multiqc samtools salmon fastp

The installation location of conda software is different from that of ordinary software. which <softname>Check the location of the software installed by conda.

Quality Assessment@fastQC

fastq format

FastQ format description: https://mp.weixin.qq.com/s/8g-oUjiEhV4cGMJNuhmISQ
FastQ format wiki: https://en.wikipedia.org/wiki/FASTQ_format
FastQ format literature: https://www.ncbi. nlm.nih.gov/pmc/articles/PMC2847217/

Concept
FastQ format is a common sequence format. It stores biological sequences and corresponding quality evaluations. The sequence and quality information are marked with an ASCII character. It was originally developed by Sanger to put FASTA sequences and quality data together. Together, it has now become the de facto standard for high-throughput sequencing results.

Format description
Each sequence in a FASTQ file usually has four lines:

  • 1. The first line: must start with "@", followed by a unique sequence ID identifier, and then optional sequence description content. The identifier and description content are separated by spaces;
  • 2. The second line: sequence characters (nucleic acid is [AGCTN]+, protein is amino acid characters);
  • 3. The third line: must start with "+", followed by an optional ID identifier and optional description content. If there is content after "+", the content must be the same as the content after "@" in the first line ;
  • 4. The fourth line: base quality characters. Each character corresponds to the quality of the base or amino acid at the corresponding position in the second line. This character can be converted into a base quality score according to certain rules. The base quality score can reflect the quality of the base. Error rate. The number of characters in this line must be the same as the number of characters in the second line.

FsatQC software

FastQC quality assessment software official website: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Attention ⚠️

  • fastqc can *.bam *.sam *.fq *.fq.gzperform quality assessment on.
  • Fastqc can -toperate by specifying multi-threading. Multi-threading processes multiple input files at the same time. Several threads can process several files at the same time. It seems meaningless to use multi-threading for a single file.
  • There seems to be no difference between using fastqc on bam quality assessment and on filtered, post-charge files
  • Batch processing in bash is relatively simple, but in zsh, it is different and needs to be used in command substitution.echo $list

Commonly used parameters:

# 常用参数
fastqc -o <out.dir> -t <thred_num> -f <input_format>  <input_file_1> <input_file_2> ...

# -o    设置输出目录
# -t    设置线程数
# -f    设置输入文件格式

Batch processing

# bash中
a=`ls *.fq`
fastqc -o ./fastqc_raw -t 10 $a

# zsh中
b=`ls -C *.fq`
fastqc -o ./fastqc_raw -t 10 `echo $b`
</

Guess you like

Origin blog.csdn.net/weixin_44452187/article/details/108422252