Course Address : Bioinformatics, Shandong University
Article directory
5. Introduction to high-throughput sequencing technology (no dry goods)
5.1 Application of high-throughput sequencing technology in precision medicine
- Identify new disease genes Screening of disease genes
- Genomics-based diagnostics/screening Genetic disease diagnosis/screening
- Precision treatment Precision treatment
5.2 Challenges facing bioinformatics
- Data scale is huge
- complex data type
- Methodology is immature
- High technical threshold
- Not very repeatable
● Sequencing bias Sequencing bias/errors
-
454 Sequencing : Uncertainty for consecutively repeated homopolymers
-
Illumina : The number of clusters does not reach the ideal number , resulting in weak signals and inaccurate identification; sequencing reactions within a cluster are not synchronized , resulting in signal conflicts and base calling errors; high GC regions often have low sequencing coverage , which can also lead to sequencing bias .
-
PacBio : Long reads sequencing (5k-10k), low accuracy.
● Calibration deviation Possible solutions
- Deep sequencing
- Statistical evaluation
- Error correction
● Speed and RAM Calculation speed and memory
- The calculation and mining of massive data has become the main bottleneck
- CPU-intensive jobs (read mapping, metagenomics)
- RAM-intensive jobs (genome assembly)
- Computer Cluster (Public Computing Platform)
5.3 De novo sequencing
de novo sequencing : The obtained sequence is fragmented and fragmented (hundreds of bp), and the splicing of repeated sequences is not good.
5.4 Resequencing (no video)
5.5 Transcriptome sequencing mRNA-seq
5.6 Epigenomics ChIP-seq
5.7 Mammoth Genome Sequencing Project
5.8 Challenges faced by ancient genomics : DNA damage, genome is highly unstable
5.9 Bioinformatics in Paleogenomics Research
- De novo assembly
Find overlapping reads
Merge good pairs of reads into longer contigs
Link contigs to form supercontigs
Generate consensus sequences - Comparative assembly
Using a reference genome (existing elephant genome as reference genome) to assemble (or layout) the reads (or contigs) of the target genome.
6. Statistical basis and sequence algorithm (principle)
6.1 Bayesian formula and its biological application
Bayes formula
-
In general, the probability of event A occurring given the occurrence of event B is not the same as the probability of event B occurring given the occurrence of event A. However, there is a definite relationship between the two, and the Bayesian formula is a formula that describes the probability of this conditional relationship .
-
Let A and B be two events,
The probability of event A occurring given the occurrence of event BIt is P(A|B) =P(A∩B)/P(B);
similarly, under the condition that event A occurs, the probability of event B occurring is P(B|A)=P(A∩B) /P(A).
The joint probability formula of A and B is P(A∩B)= P(A|B) P(B)= P(B|4)P(A)
both sides of the above formula are divided by P(B), if P( B) is non-zero, we can get the Bayes formula :P(A|B)=P(B|A)P(A)/P(B)
-
Bayesian formula extension :
Applications of Bayesian formula
Biological Application of Bayesian Formula
- Reference video: Bayesian formula and its application in biology-03 P115
6.2 Sensitivity and specificity of binary predictions
- Sensitivity Sensitivity = TP/ (TP+FN) true positive rate (prefer to choose wrong or miss)
- Specificity = TN/ (TN+FP) true negative rate (rather choose not to choose wrong)
Examples of Sensitivity and Specificity in Biology
● Prediction of leucine-rich repeat sequences
-
Leucine-rich repeat sequence (
LRR
) is an amino acid fragment that widely exists in tens of thousands of known proteins in viruses, prokaryotes and eukaryotes, and often participates in protein-protein or (non-protein) interactions, in cells It plays a key role in adhesion, signal transduction, platelet aggregation, extracellular matrix aggregation, nervous system development, RNA processing, virus invasion and immune response. It is often repeated several times or even dozens of times in a protein molecule end to end . Of course, the sequence repeated each time is not exactly the same . -
LRR has a characteristic sequence template
LxxLxLxxNxL
.
-
More than 50,000 individual LRRs were precisely delineated semi-manually from all known Toll-like receptor protein sequences (>2500) . Using these more than 50,000 LRRs as a standard data set, a prediction model is constructed to describe the sequence characteristics of an LRR in detail and predict whether a protein sequence contains LRRs , and if so, where is the starting position of each LRR .
-
Construct a prediction model through the site-specific weighting matrix (
Position-Specific Weight Matrix
): list more than 50,000 LRRs sequences vertically, and obtain the frequency of occurrence of various amino acids on each site of LRR in all sequences, which is consistent with the characteristic sequence template ofLxxLxLxxNxL
LRR .
-
Predict whether a sequence contains an LRR sequence :
Score = the sum of the frequencies of the amino acid appearing at each point. So the higher the score, the more likely it is LRR.
Cutoff Score is obtained according to the sensitivity and specificity of binary prediction .
Try the Cutoff Score one by one within a certain range, and calculate the sensitivity and specificity of the model under different Cutoff Scores (for example, take the intersection point of the sensitivity and specificity curves as the Cutoff Score).
6.3 Basic Sequence Algorithms
- Sequence Algorithms : Algorithms with the lowest possible computational complexity developed for the study of biological sequences . For example, how to quickly and accurately find repetitive sequences from sequences.
- Biological sequence : including nucleic acid sequence, protein sequence or other digital strings or character strings transformed from biological problems.
suffix tree
- A suffix is a subsequence containing the last character. Add one after the last character
$
, indicating the end. - Suffix
$
is the shortest suffix of sequence S. - The number of all suffixes of a sequence is equal to
$
the length of the sequence including .
- Suffix tree : A tree composed of all suffixes contained in a sequence .
● Draw the suffix tree :
- 1. First draw the No. 1 sequence, draw a branch from the root to the leaf, and mark the sequence on the branch;
- 2. Draw No. 2 sequence, see if there is a branch starting with the initial letter D of No. 2 sequence, if not, create another branch.
- 3. Draw the No. 3 sequence, there is a sequence starting with SD, and then branch to finish writing the following sequence.
- 4. By analogy, draw all the suffixes of the sequence.
Functions of the suffix tree
String S=SDSDFSDFG
-
Function 1: Find whether the string s is in the string S (that is, determine whether s is a subsequence of S).
Method: start from the tree root, compare with the characters of s one by one. (The result can be obtained only by comparing the length of s .)
s 1 =DFSD (in!)
s 2 =SDFD (in or not?)
-
Function 2: Find the number of repetitions of the string s in the string S
Method: Start from the root of the tree, find s according to the method of function 1, and then see how many leaves there are after s , then repeat several times.
-
Function 3: Find the longest repeated subsequence in string S
Method: Find the substrings from the root of the tree to all internal nodes (non-leaves), and find the longest one .
-
$
The role of : If a suffix is a prefix of another suffix , then need to$
identify an independent blade.
Highest molecular order
● Shortest principle : When several subsequences have the highest score , if one is completely contained in the other, only the contained one will be returned. For example, the sequence in the figure below has 2 highest molecular sequences.
- Biological applications:
(1) Prediction of transmembrane regions (hydrophobic segments) of protein sequences . According to the different hydrophilicity and hydrophobicity of amino acids, the string sequence is converted into a real number sequence, hydrophobic amino acid [0,5], hydrophilic amino acid [-5,0].
(2) Predict GC-rich regions in DNA sequences . Such as looking for CpG islands.
- Naive algorithm: According to its algorithm principle, calculating f(i,j) once needs to calculate n 3 steps.
◆ usuallyThe computational complexity of an algorithm must be at least below n 2 before it can be practically applied. Otherwise, with the increase of n, the amount of calculation will exceed the current calculation capacity and acceptable calculation time. Therefore, the Naive algorithm cannot be used for the highest molecular sequence problem.
- More efficient algorithm: The total operation steps of
the dynamic algorithm are: O(n 2 ) The total operation steps of
the divide and conquer method are: O(nlogn) The total operation steps of
the smart algorithm are: O(n)