[Study Notes] Shandong University Bioinformatics-05 Introduction to High-throughput Sequencing Technology + 06 Statistical Basis and Sequence Algorithm (Principle)

Course Address : Bioinformatics, Shandong University


5. Introduction to high-throughput sequencing technology (no dry goods)

5.1 Application of high-throughput sequencing technology in precision medicine

  1. Identify new disease genes Screening of disease genes
  2. Genomics-based diagnostics/screening Genetic disease diagnosis/screening
  3. Precision treatment Precision treatment

5.2 Challenges facing bioinformatics

  • Data scale is huge
  • complex data type
  • Methodology is immature
  • High technical threshold
  • Not very repeatable

Sequencing bias Sequencing bias/errors

  • 454 Sequencing : Uncertainty for consecutively repeated homopolymers
    insert image description here

  • Illumina : The number of clusters does not reach the ideal number , resulting in weak signals and inaccurate identification; sequencing reactions within a cluster are not synchronized , resulting in signal conflicts and base calling errors; high GC regions often have low sequencing coverage , which can also lead to sequencing bias .
    insert image description here

  • PacBio : Long reads sequencing (5k-10k), low accuracy.

● Calibration deviation Possible solutions

  1. Deep sequencing
  2. Statistical evaluation
  3. Error correction

● Speed ​​and RAM Calculation speed and memory

  • The calculation and mining of massive data has become the main bottleneck
  • CPU-intensive jobs (read mapping, metagenomics)
  • RAM-intensive jobs (genome assembly)
  • Computer Cluster (Public Computing Platform)insert image description here

5.3 De novo sequencing
de novo sequencing : The obtained sequence is fragmented and fragmented (hundreds of bp), and the splicing of repeated sequences is not good.
insert image description here
5.4 Resequencing (no video)

5.5 Transcriptome sequencing mRNA-seq

5.6 Epigenomics ChIP-seq

5.7 Mammoth Genome Sequencing Project

5.8 Challenges faced by ancient genomics : DNA damage, genome is highly unstable

5.9 Bioinformatics in Paleogenomics Research

  • De novo assembly
    Find overlapping reads
    Merge good pairs of reads into longer contigs
    Link contigs to form supercontigs
    Generate consensus sequences
  • Comparative assembly
    Using a reference genome (existing elephant genome as reference genome) to assemble (or layout) the reads (or contigs) of the target genome.

6. Statistical basis and sequence algorithm (principle)

6.1 Bayesian formula and its biological application

Bayes formula

  • In general, the probability of event A occurring given the occurrence of event B is not the same as the probability of event B occurring given the occurrence of event A. However, there is a definite relationship between the two, and the Bayesian formula is a formula that describes the probability of this conditional relationship .

  • Let A and B be two events,
    The probability of event A occurring given the occurrence of event BIt is P(A|B) =P(A∩B)/P(B);
    similarly, under the condition that event A occurs, the probability of event B occurring is P(B|A)=P(A∩B) /P(A).
    The joint probability formula of A and B is P(A∩B)= P(A|B) P(B)= P(B|4)P(A)
    both sides of the above formula are divided by P(B), if P( B) is non-zero, we can get the Bayes formula :P(A|B)=P(B|A)P(A)/P(B)

  • Bayesian formula extension :
    insert image description here

Applications of Bayesian formula

insert image description here
insert image description here
insert image description here
insert image description here

Biological Application of Bayesian Formula

6.2 Sensitivity and specificity of binary predictions

  • Sensitivity Sensitivity = TP/ (TP+FN) true positive rate (prefer to choose wrong or miss)
  • Specificity = TN/ (TN+FP) true negative rate (rather choose not to choose wrong)

Examples of Sensitivity and Specificity in Biology

Prediction of leucine-rich repeat sequences

  • Leucine-rich repeat sequence ( LRR) is an amino acid fragment that widely exists in tens of thousands of known proteins in viruses, prokaryotes and eukaryotes, and often participates in protein-protein or (non-protein) interactions, in cells It plays a key role in adhesion, signal transduction, platelet aggregation, extracellular matrix aggregation, nervous system development, RNA processing, virus invasion and immune response. It is often repeated several times or even dozens of times in a protein molecule end to end . Of course, the sequence repeated each time is not exactly the same .

  • LRR has a characteristic sequence templateLxxLxLxxNxL .
    insert image description here

  • More than 50,000 individual LRRs were precisely delineated semi-manually from all known Toll-like receptor protein sequences (>2500) . Using these more than 50,000 LRRs as a standard data set, a prediction model is constructed to describe the sequence characteristics of an LRR in detail and predict whether a protein sequence contains LRRs , and if so, where is the starting position of each LRR .

  • Construct a prediction model through the site-specific weighting matrix ( Position-Specific Weight Matrix): list more than 50,000 LRRs sequences vertically, and obtain the frequency of occurrence of various amino acids on each site of LRR in all sequences, which is consistent with the characteristic sequence template of LxxLxLxxNxLLRR .
    insert image description here

  • Predict whether a sequence contains an LRR sequence :
    Score = the sum of the frequencies of the amino acid appearing at each point. So the higher the score, the more likely it is LRR.
    Cutoff Score is obtained according to the sensitivity and specificity of binary prediction .
    Try the Cutoff Score one by one within a certain range, and calculate the sensitivity and specificity of the model under different Cutoff Scores (for example, take the intersection point of the sensitivity and specificity curves as the Cutoff Score).
    insert image description here

6.3 Basic Sequence Algorithms

  • Sequence Algorithms : Algorithms with the lowest possible computational complexity developed for the study of biological sequences . For example, how to quickly and accurately find repetitive sequences from sequences.
  • Biological sequence : including nucleic acid sequence, protein sequence or other digital strings or character strings transformed from biological problems.

suffix tree

  • A suffix is ​​a subsequence containing the last character. Add one after the last character $, indicating the end.
  • Suffix $is ​​the shortest suffix of sequence S.
  • The number of all suffixes of a sequence is equal to $the length of the sequence including .
    insert image description here
  • Suffix tree : A tree composed of all suffixes contained in a sequence .

Draw the suffix tree :

  • 1. First draw the No. 1 sequence, draw a branch from the root to the leaf, and mark the sequence on the branch;
  • 2. Draw No. 2 sequence, see if there is a branch starting with the initial letter D of No. 2 sequence, if not, create another branch.
  • 3. Draw the No. 3 sequence, there is a sequence starting with SD, and then branch to finish writing the following sequence.
    insert image description here
  • 4. By analogy, draw all the suffixes of the sequence.
    insert image description here

Functions of the suffix tree

String S=SDSDFSDFG

  • Function 1: Find whether the string s is in the string S (that is, determine whether s is a subsequence of S).
    Method: start from the tree root, compare with the characters of s one by one. (The result can be obtained only by comparing the length of s .)
    s 1 =DFSD (in!)
    s 2 =SDFD (in or not?)
    insert image description here

  • Function 2: Find the number of repetitions of the string s in the string S
    Method: Start from the root of the tree, find s according to the method of function 1, and then see how many leaves there are after s , then repeat several times.
    insert image description here

  • Function 3: Find the longest repeated subsequence in string S
    Method: Find the substrings from the root of the tree to all internal nodes (non-leaves), and find the longest one .
    insert image description here

  • $The role of : If a suffix is ​​a prefix of another suffix , then need to $identify an independent blade.
    insert image description here

Highest molecular order

Shortest principle : When several subsequences have the highest score , if one is completely contained in the other, only the contained one will be returned. For example, the sequence in the figure below has 2 highest molecular sequences.
insert image description here

  • Biological applications:
    (1) Prediction of transmembrane regions (hydrophobic segments) of protein sequences . According to the different hydrophilicity and hydrophobicity of amino acids, the string sequence is converted into a real number sequence, hydrophobic amino acid [0,5], hydrophilic amino acid [-5,0].
    insert image description here
    (2) Predict GC-rich regions in DNA sequences . Such as looking for CpG islands.
    insert image description here
  • Naive algorithm: According to its algorithm principle, calculating f(i,j) once needs to calculate n 3 steps.
    ◆ usuallyThe computational complexity of an algorithm must be at least below n 2 before it can be practically applied. Otherwise, with the increase of n, the amount of calculation will exceed the current calculation capacity and acceptable calculation time. Therefore, the Naive algorithm cannot be used for the highest molecular sequence problem.
    insert image description here
  • More efficient algorithm: The total operation steps of
    the dynamic algorithm are: O(n 2 ) The total operation steps of
    the divide and conquer method are: O(nlogn) The total operation steps of
    the smart algorithm are: O(n)

Guess you like

Origin blog.csdn.net/zea408497299/article/details/125206786