RNA-seq：【FastQC】

常见问题项分析

ATGC碱基在各个位置上的分布。统计read的每个位置ATCG分布，正常四条线平行且相近。当部分出现bias，提示有overrepresented sequence的。
如前10个位置，每种碱基频率有略微的差别，说明可能有污染【开头碱基比例跳动原因？】。
任一位置的A/T比例与G/C相差超过10%，报"WARN"；超过20%，报"FAIL"。一般AT含量高于CG，AT约28%，CG约22%。
It’s worth noting that some types of library will always produce biased sequence composition, normally at the start of the read. Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in 【most cases doesn't seem to adversely affect the downstream analysis】. It will however produce a warning or error in this module.

Overrepresented sequences
Biased fragmentation: Any library which is generated based on the ligation of random hexamers【六聚体】or through tagmentation should theoretically have good diversity through the sequence, but experience has shown that these libraries always have a selection bias in around the first 【12bp】 of each run. This is due to a biased selection of random primers, but doesn’t represent any individually biased sequences. 【Nearly all RNA-Seq libraries will fail this module because of this bias, but this is not a problem which can be fixed by processing, and it doesn't seem to adversely affect the ablity to measure expression.】
Biased composition libraries

Per Sequence GC Content Error

序列不同拷贝数的水平。测序深度越高，越容易产生一定程度的duplication，这是正常的现象，但如果duplication的程度很高，就提示我们可能有bias的存在。横轴duplication次数，纵轴duplicated reads数目，以unique reads总数作为100%。
原始数据很大，统计非常慢，fastqc中用fq数据的前200,000条reads统计其在全部数据中的重复情况。如重复数目大于等于10的reads被合并统计，我们会看到最右侧略有上扬【？】。当非unique的reads占总数比例大于20%，"WARN"；大于50%，"FAIL“。
帮助判断文库的复杂程度，如果PCR扩增次数太多或者起始扩增底物太少，都会降低文库的复杂度。
如果有大量的重复序列，也就是说文库复杂程度低，可能与某个基因的过表达有关

Overrepresented sequences，一条序列重复数，转录组中非常多的转录本，一条序列再怎么多也不太会占整个转录组的一小部分（比如1%），这种情况，不是这种转录本巨量表达，就是样品被污染。列出来大于全部转录组1%的reads序列。
某个序列大量出现，就叫over-represented。fastqc的标准是占全部reads的0.1%以上。和duplicate analysis一样，为计算方便，只取了fq数据的前200,000条reads统计，所以有可能over-represented reads不在里面。而且大于75bp的reads也是只取【50bp】。如果命令行中加入了-c contaminant file，出现的over-represented sequence会从contaminant_file里面找匹配的hit（至少20bp且最多一个mismatch）。
展示长度至少20bp，数量占总数0.1%以上的reads碱基组成，帮助判断污染(比如：载体、接头序列)
若GC含量分布图"挂了"，此表帮助判断来源，已知的载体或者接头，会列出来；如不是，可以复制序列blast。
blast发现是一个基因，则可以验证猜想：基因过表达
illumina Nova和Nextseq产生的数据容易产生PloyG序列，原因是这两个平台使用两个荧光信号，而没有信号时表示G。请在质量过滤的时候利用fastp进行去PloyG尾巴