[Study Notes] Shandong University Bioinformatics-02 Sequence Comparison

Course Address : Bioinformatics, Shandong University


2. Sequence comparison

2.1 Understanding sequence

sequence is a string string.

FASTA format:
first line: greater than sign plus name or other comments
After second line: 60 letters per line (there are also 80, not necessarily)

2.2 Sequence similarity

  • Similar sequence → similar structure → similar function

  • Predicts the structure and function of proteins of unknown structure and function

  • Sequence identity and similarity :

    Consistency (identity): If two sequences have the same length, their identity is defined as the percentage of the number of identical residues at their corresponding positions to the total length.

    Similarity : If two sequences have the same length, their similarity is defined as the number of similar and identical residues at their corresponding positions and the percentage of the total length.

    The quantified relationship of pairwise similarity of residues is defined by a substitution scoring matrix .

2.3 Replacement scoring matrix

Substitution score matrix (substitution matrix): A matrix that reflects the mutual substitution rate between residues , which describes the quantitative relationship between residues that are similar to each other. Divided into DNA substitution scoring matrix and protein substitution scoring matrix.

3 commondna sequenceThe replacement scoring matrix for

  • Equivalence matrix (unitary matrix): The simplest substitution scoring matrix, where the matching score between the same1 nucleotides is , and the substitution score between different0 nucleotides is . Because it does not contain physical and chemical information of bases and does not treat different substitutions differently, it is rarely used in actual sequence comparison .
  • Transition-transversion matrix (transition-transversion matrix): The bases of nucleic acids are divided into two types according to the ring structure characteristics, one is purine (A/G) with two rings; the other is pyrimidine (C/T ), with only one ring. If the substitution of DNA bases keeps the loop number constant , it is a transition ; if the loop number changes , it is a transversion . During evolution,Transitions occur much more frequently than transversions. To reflect this, usually the score for transitions in this matrix is -1​​, and the score for transversions is -5.
  • BLAST matrix : After a large number of actual comparisons, it is found that if the two nucleotides being compared havethe same score , and vice versa , the comparison effect is better. This matrix is ​​widely used for DNA sequence comparison .+5-4
    insert image description here

3 commonproteinSubstitution Score Matrix for Sequences

  • Equivalence matrix (unitary matrix): Same as the DNA equivalence matrix, the matching score between the same amino acids is 1. Substitutions between different amino acids were scored as 0. It is rarely used in actual sequence alignment.

  • PAM Matrix (Dayhoff Mutation Data Matrix): The PAM matrix is ​​based on evolutionary principles. If two amino acid substitutions are frequent , indicating that nature is easy to accept such substitutions, then this pair of amino acid substitutions should score high . PAM matrix is ​​currently one of the most widely used scoring methods in protein sequence comparison. The basic PAM-1 matrix reflects the value of one mutation per hundred amino acids produced by evolution (obtained by statistical methods). PAM-1 is multiplied by itself for n times, and PAM-n can be obtained, that is, more mutations have occurred . ( Choose a suitable PAM matrix according to the close relationship between the sequences to be compared. If the close relationship is far , that is, there are many mutations , the larger n is, otherwise the smaller n is. )
    PAM-250 Matrix: The value on the diagonal is the score of the matching amino acid; in other positions, a score ≥0 means that the corresponding amino acid pair is similar amino acid .
    insert image description here

  • BLOSUM matrix (blocks substitution matrix): The BLOSUM matrix obtains matrix elements through sequences with distant relationships. The PAM-1 matrix is ​​calculated based on the sequence alignment with a high similarity (>85%), and those matrices with a long evolutionary distance, such as PAM-250, are obtained by multiplying PAM-1 by itself. That is, the similarity of the BLOSUM matrix is​​generated based on real data , while the PAM matrix is ​​extrapolated through matrix self-multiplication . Like the PAM matrix, the BLOSUM matrix also has different numbers, such asBLOSUM-80indicating that the matrix is ​​calculated from sequences with a degree of consistency ≥ 80% . In the same way,it means that the matrixis ​​calculated from sequences with a degree of consistency ≥ 62% . BLOSUM-62
    BLOSUM-62: The value on the diagonal is the score of the matching amino acid; in other positions, a score ≥0 means that the corresponding amino acid pair is similar amino acid.
    insert image description here

Q1: Choose PAM-1 or PAM-250?

insert image description here

Q2: Choose PAM-? or BLOSUM-?

insert image description here

  • For the comparison between sequences that are far away , since PAM-250 is calculated , its accuracy is limited, and BLOSUM-45 has more advantages .
  • For the comparison between closely related sequences, there is little difference in the alignment results made with PAM or BLOSUM matrix .
  • Most commonly used : BLOSUM-62

★ 2 other typesproteinSubstitution Score Matrix for Sequence Alignment

  • Genetic code matrix (genetic code matrix, GCM): the genetic code matrix is ​​calculated byobtained by the number of codon changes required to convert one amino acid to another, the value of the matrix corresponds to the price paid accordingly .
    ◆ If changing one base can change the codon of one amino acid to another amino acid, the replacement cost of these two amino acids is 1;
    ◆ If two bases need to be changed, the replacement cost is 2;
    ◆ Then For example, if the three codons from Met to Tyr have to be changed, the cost is 3.
    ◆ The genetic code matrix is ​​often used to calculate the evolutionary distance , and its advantage is that the calculation result can be directly used to draw the evolutionary tree , but it is rarely used in the protein sequence alignment (especially the protein sequence alignment with a low degree of similarity) .
    insert image description here

  • Hydrophobic matrix : A scoring matrix is ​​obtained according to the change of hydrophobicity before and after the substitution of amino acid residues . If the hydrophobic properties of an amino acid substitution do not change much , the substitution score is high , otherwise the substitution score is low.
    insert image description here

2.4 Pairwise comparison of sequences: dot method

insert image description here

  • Dot method : the same dot.
  • Consecutive diagonals , parallels to diagonals , represent the same region in both sequences.
    insert image description here
  • Can useA sequence manages itself, so that repeated fragments in the sequence can be found . Such a dotted matrix must be symmetric and have a main diagonal. In the horizontal or vertical direction, the sequence segment corresponding to the short parallel line parallel to the main diagonal is the repeated part ; the number of occurrences of the parallel line including the main diagonal is the number of repetitions .
    insert image description here
  • Discover tandem repeats ( tandem repeat):
    such as Seq1: FASABCABCABCTHE
    repeat times: Within the half diagonal , the number of all equidistant parallel lines including the main diagonal .
    repeat unit: The sequence corresponding to the shortest parallel line .
    Short tandem repeat (short tandem repeat, STR) is also called microsatellite DNA , which is a kind of DNA tandem repeat widely present in eukaryotic genomes. It consists of a core sequence of 2-6bp , and the number of repetitions is usually 15-30 times . STR is highly polymorphic, that is, there are individual differences in the number of repetitions , and this difference generally follows the Mendelian co-dominant inheritance law in the process of genetic inheritance , so it is widely used in the fields of forensic individual identification and paternity testing.
    insert image description here
  • Dotlet online management tool : Dotlet needs to install java.
    See the video for details : Pairwise comparison of sequences: dot method-02 P34
    insert image description here

2.5 Pairwise comparison of sequences: sequence alignment method (quantitative)

  • Sequence alignment ( alignment), also called alignment, alignment, alignment, etc. A specific algorithm is used to find the space insertion and sequence permutation scheme that produces the largest similarity score between two or more sequences.
  • Comparison of sequences s and t : Arrange the two strings of s and t up and down, insert spaces ( spaces , gap) at certain positions, and then compare the matching of characters at each position in turn, so as to find out the The arrangement and the insertion of spaces in which the two sequences produce the maximum similarity score.
    insert image description here

Pairwise Alignment and Algorithms

2.6 Consistency and similarity

  • If two sequencesthe same length:
    Identity (identity) = (number of consistent characters/ global comparison length ) × 100%
    similarity (similarity) = (number of consistent and similar characters/global comparison length) × 100%
    insert image description here
  • If two sequencesdifferent length:
    Identity (identity) = (number of consistent characters/global comparison length) × 100%
    similarity (similarity) = (number of consistent and similar characters/global comparison length) × 100%
    insert image description here
  • Regardless of whether the length of the two sequences is the same, a global alignment of the two sequences must be performed first , and then their degree of identity and similarity are calculated based on the alignment results and alignment length.

2.7 Online double sequence alignment tool

EMBL global pairwise sequence alignment tool

  • For details, please refer to the video : Online Pair Sequence Alignment Tool-01 P40
    For details, please refer to the Video : Online Pair Sequence Alignment Tool-02 Gap Type and Score Setting P41

  • EMBL → Global Alignment → Needle → input/upload 2 sequences to be aligned
    insert image description here

  • Parameter settings More options :

    • MATRIX: Select BLOSUM-62 by default, or select by kinship.
    • GAP OPEN: The penalty value when the first vacancy occurs, the default is more than the penalty of GAP EXTEND.
    • GAP EXTEND: The penalty value when there are multiple consecutive gaps (except the first gap), the default penalty is less than GAP OPEN.
    • When the penalty of GAP OPEN is greater than that of GAP EXTEND , the vacancies are concentrated , and the cost of opening the first vacancy gap is high, but continuous vacancies are encouraged.
      Case : It is known that most of the two sequences are similar, and the functional region of one sequence is missing in the other sequence. It is necessary to find out the missing functional region through sequence comparison and select the concentrated vacancy.
    • The penalty of GAP OPEN is smaller than that of GAP EXTEND, the vacancies are scattered , and the cost of continuous vacancies is high, so short vacancies are encouraged.
      Case : Comparing homologous sequences, it is known that the two sequences are very similar, have similar structures and functions, and select scattered gaps.
    • If the result is not expected , just keep the default parameters.
    • END GAP PENALTY: The penalty at the end of GAP, the default is false.
      insert image description here
      insert image description here

EMBL local pairwise alignment tool

  • See the video for details : Online Dual Sequence Alignment Tool-03 P42
  • EMBL → Local Alignment → Water → Input/Upload 2 sequences to be aligned → Submit
  • The part that does not match (red) at both ends of sequence 1 is directly ignored in the comparison result; the part that is added at the end of
    sequence 2 is also directly ignored
    insert image description here
  • Global alignment vs. local alignment:
    insert image description here
  • Other online pairwise alignment tools
software name comparison type
EMBL Global/Local
PIR Global
Lalign Global/Local
LAGAN Global
AlignMe Alignment of Membrane Proteins
MCALIGN Alignment of non-coding DNA sequences
Biotools Global/Local

2.8 BLAST search

  • BLAST (Basic Local Alignment Search Tool) is the most commonly used database search program at present.
  • The point of BLAST is fragment pairs . The so-called fragment pair refers to a pair of subsequences in two given sequences, which are equal in length and can form a complete match without gaps.
  • The basic principle of BLAST : BLAST first finds all the sequence fragment pairs whose matching degree exceeds a certain threshold between the detection sequence and the target sequence, and then extends the fragment pairs according to a given similarity threshold to obtain a certain length of similarity fragments, and finally gives Generate high -scoring pairs (HSPs). Modified BLAST allows insertion of gaps.
    insert image description here

Types of BLAST

  • BLAST is actually a general term for a group of tools integrated together. It can not only be used to directly search protein sequence databases and nucleic acid sequence databases, but also can translate the searched nucleic acid sequences into protein sequences and then search, or vice versa. to improve search efficiency.
    insert image description here
  • Blastp: Search protein sequence databases with protein sequences (commonly used)
  • Blastn: Search nucleic acid sequence databases with nucleic acid sequences (commonly used)
  • Blastx: Search the protein sequence database after translating the nucleic acid sequence into protein sequence by 6 strands
  • tblastn:Use the protein sequence to search the nucleic acid sequence database, and the nucleic acid sequence in the database must be translated into protein sequences by 6 chains before searching.
  • tblastx: Search the nucleic acid sequence database after translating the nucleic acid sequence into a protein sequence according to 6 strands , and search the nucleic acid sequence in the database after translating the protein sequence into 6 strands . (For newly discovered sequences
  • According to the search algorithm : 标准 BLAST, PSI-BLAST, PHI-BLASTetc.

Standard BLAST

  • See the video for details : BLAST Search-03 Practical Operation P46
    insert image description here
    insert image description here
    insert image description here
  • BLAST results :
    insert image description here
    insert image description here
  • Total score(match score) and Query cover(coverage) determine the color and length of the matched sequence, respectively .
    insert image description here
  • E value(Expected value): The closer the E value is to zero , the more likely it is that the input sequence is the same sequence as the current sequence .
  • The matching results are sorted according to the E value from small to large . As the E value increases , the inverse ratioTotal score gradually decreases, but the degree of consistency is not completely inversely proportional to the E value (because BLAST did not perform a pairwise sequence alignment in order to improve the speed, sacrificing a certain Accuracy. The concordance in the table is obtained after the BLAST search is completed and the pairwise alignment of the 50 sequences found).Ident
    insert image description here

PSI-BLAST (Big Net Search)

  • Sometimes a basic BLAST search just isn't enough.For example, you want to collect a huge protein family through a protein sequence .If you run a basic BLAST search, you'll only find those sequences that are very close to the search sequence, and you won't find any other distant sequences .
  • PSI-BLAST(Position-Specific Iterated BLAST, Position-Specific Iterated BLAST )
    The feature of PSI-BLAST is to use the position-specific weight matrix (Position-Specific Scoring Matrix, PSSM) to search the database each time and then use the search results to rebuild the PSSM, and then use the new The PSSM searches the database again, and so on (iteration) until no new results are generated. (find friends of friends
  • See video for details : BLAST SEARCH-04 PSI BLAST P47
  • The first round of search results are the same as standard BLAST.
    insert image description here
  • Click Go for the second round of search (you can specify the first number of columns to list the search results)
    insert image description here
    insert image description here

PHI-BLAST (exact search)

  • See video for details : BLAST SEARCH-05 PHI-BLAST P48

  • PHI-BLAST(Pattern-Hit Initiated BLAST, Pattern Recognition BLAST ): A sequence that is similar to the input sequence and conforms to a specific pattern can be found.

  • For example, the N-glycosylation site motif (N-glycosylation site motif) always conforms to the following specific pattern: start with Asn(N), followed by any amino acid except Pro§, followed by Ser( S) or Thr(T), followed by any amino acid except Pro.

    • Search with regular expressions:N{P}[ST]{P}
    • A matching pattern written in regular expressions: {L}GEx [GAS] [LIVM]x(3,7)
    • {}Represents matching anything except inside curly braces ( except... )
    • []Represents matching any content in square brackets ( one of them )
    • xrepresent any character
    • x(3,7)Represents 3-7 x characters
    • For example : VGEAAMPRI conforms VGEAAYPRI does not conform
  • This sequence characteristic pattern may represent the occurrence site of a post-translational modification, or the active site of an enzyme, or the structural domain or functional domain of a protein family.
    insert image description here

  • PHI-BLAST and PSI-BLAST can be used in combination
    insert image description here

Other BLAST

  • See the video for details : BLAST Search-06 Other BLAST P49
    insert image description here
  • SmartBLAST : The condensed search results include the three most similar sequences in the database to the input sequence, and the two most similar sequences in the most well-studied species that can show some evolutionary relationship .
    insert image description here
    insert image description here
  • Free search tools on the Internet (use time difference to choose different BLAST tools)
Location server website link
USA NCBI http://www.ncbi.nlm.nih.gov/BLAST
Europe ExPASy http://web.expasy.org/blast
Europe Uniprot http://www.uniprot.org/blast/
Japan DDBJ http://blast.ddbj.nig.ac.jp
  • WU-BLASTWU stands for Washington University. It is more sensitive than NCBI-BLAST and more flexible in the algorithm of inserting gaps.
  • Smith and Waterman ( SSEARCH): A bit slower, but more accurate than BLAST .
  • FASTA: A bit slow, but more accurate than BLAST for comparison of DNA sequences .
  • BLAT: Used for searching small sequences (such as cDNA, etc.) in large genomes.

2.9 Introduction to Multiple Sequence Alignment

Multiple Sequence Alignment - Applications and Algorithms

  • Multiple alignment is a global alignment of two or more biological sequences.
    insert image description here

  • The main uses of multiple sequence alignment :

    1. Confirmation : Whether an unknown sequence belongs to a certain family .
    2. Establishment : Phylogenetic tree , view the relationship between species or sequences.
    3. Pattern recognition : Some particularly conserved sequence fragments often correspond to important functional regions , and these conserved fragments can be found through multiple sequence alignment .
    4. Push the unknown from the known : Make a model of sequence fragments known to have special functions through multiple sequence alignments , and then speculate whether unknown sequence fragments also have this function based on the model .
    5. Others: Predict protein/RNA secondary structure , etc.
  • Algorithms for Multiple Sequence Alignment : All current multiple sequence alignment tools are not perfect, and they all use an approximate algorithm. (Look at the trend and general position through multiple sequence alignment, sacrificing accuracy)
    insert image description here

  • Notes on multiple sequence alignment :

    1. Too many sequences can't stand it. Generally 10-15 sequences, preferably no more than 50 sequences.
    2. Sequences that are too far apart cannot be tolerated. For a group of sequences whose sequence similarity between two pairs is less than 30% , it will be troublesome to perform multiple sequence alignment.
    3. Sequences that are too closely related cannot be tolerated. For sequences with a sequence similarity greater than 90% , no matter how many there are, it is equal to only one.
    4. Short sequences can't stand it. Multiple sequence alignment supports a set of sequences that are about the same length , and individual very short sequences are troublemakers.
    5. Sequences with repeated domains are not tolerated. Most multiple sequence alignment programs will fail or even crash if sequences contain repetitive domains .
  • There are several suggestions for the name of the sequence :

    1. Do not have " space " in the name , use "_" instead of "space".
    2. Do not use special characters (such as Chinese, @, #, &, ^, etc.).
    3. The first name must not exceed 15 characters in length .
    4. In a set of sequences, do not have sequences with the same name .
    5. If you don't name it according to the above points, the multiple sequence alignment tool will modify your sequence name without notifying you .

2.10 Online Multiple Sequence Alignment Tool

  • Clustal's most commonly used multiple sequence alignment tool

  • One of TCOFFEE's latest multiple sequence alignment tools

  • One of MUSCLE's fastest multiple sequence alignment tools

  • Some websites that provide multiple sequence alignments online

website name server location website link
EBI Clustal-Omega http://www.ebi.ac.uk/Tools/msa/clustalo/
Expasy Clustal W http://www.ch.embnet.org/software/ClustalW.html
Sf-Clustal Clustal O/W2 http://www.clustal.org/ (download only)
EBI Tcoffee http://www.ebi.ac.uk/Tools/msa/toffee
TCC FFEE Tcoffee http://www.fofee.org/
EBI Muscle http://www.ebi.ac.uk/Tools/msa/muscle/
MUSCLE Muscle http://www.drive5.com/muscle/ (download only)

EMBL

  • See the video for details : Online Multiple Sequence Alignment Tool-01 EMBL P52

  • ORDER

    The sequence input automatically created during the aligned alignment outputs the results in the original sequence of the input sequence
    insert image description here

  • Download Alignment File
    insert image description here

  • Show Colors
    Red: Hydrophobic (red)
    Blue: Acidic (blue)
    Magenta: Basic (magenta)
    Green: Hydroxyl+amine+basic (green)
    Gray: Others (gray)
    insert image description here

  • At the end of each row of comparison results, there are dotted marks,The densely labeled regions are the conserved regions between these sequences

    symbol meaning
    * A column that is completely conserved, ie, the residues in this column are identical .
    : The residues in this row have approximately similar molecular size and the same hydrophilicity and hydrophobicity, that is, the residues in this row are either the same or similar .
    · During evolution, the molecular size and hydrophilicity and hydrophobicity of residues are preserved to a certain extent, but substitutions occur between dissimilar residues. ( similar and dissimilar )
    (blank) A column that is not conservative at all ( not similar at all ).

insert image description here

  • Result Summary
    insert image description here
  • Phylogenetic TreeNOTE: This is not a true phylogenetic tree .
    insert image description here
  • To get a phylogenetic tree, Alignmentssend the comparison results to software specialized in phylogenetic trees Send to ClustalW2_Phylogeny in .
    insert image description here

Tcoffee

Format for saving multiple sequence alignments


2.11 Editing and Publishing of Multiple Sequence Alignments

  • In order to display the results of multiple sequence alignment in color and manually edit them , a multiple sequence alignment result editor has been developed .
  • Jalview is a particularly commonly used editor. http://www.jalview.org
  • See video for details : Editing and Publishing of Multiple Sequence Alignments-01-02 Jalview P55-56
  • Quickly launch JalView from EMBL multiple sequence alignment results . But Jalview with quick start is not fully functional !
    insert image description here
  • Download to local installation (need java)
    insert image description here
  • Import multiple sequence alignment result clustal file
    insert image description here
  • Color color
    insert image description here
    common Clustal series color scheme
    insert image description here
  • Repair local defects : manual adjustment for local
    insert image description here
  • Auto wrap, set font
    insert image description here
  • Turn on/off comment lines
    insert image description here

Basic analysis function

  1. Sorting according to various rules, and doing a global alignment of pairs of sequences for any pair of sequences
    insert image description here
  2. Create a phylogenetic tree for a selected set of sequences
    insert image description here
  3. Predict the secondary structure of a protein sequence
    insert image description here
  4. Save the sequence alignment as a picture
    insert image description here
  • Multiple Sequence Alignment Beautification Tool
name url features
JalView http://www.jalview.org JAVA, embeddable in web pages
Box shade http://www.ch.embnet.org/software/BoX_form.html Good at black and white drawing
ESPript http://lespript.ibcp.fr/ESPript/ESPript Powerful, awesome
MView http://bio-mview.sourceforge.net Good at converting to HTML source code

2.12 Finding Conserved Regions

Sequence ID map

  • See the video for details : Finding Conserved Regions-01 P57
  • Sequence logo (sequence logo) is a graphical way to sequentially draw the residues appearing at each position in the sequence alignment. The accumulation of residues at each position reflects the identity of the residues at that position . The size of the glyph corresponding to each residue is proportional to the frequency of occurrence of the residue at that position . butGraphic character size does not equal frequency percentage(Otherwise each column should be the same height), but the result of transformation after simple statistical calculation.
    insert image description here
    insert image description here
  • Graphic character size does not equal frequency percentage, otherwise the total height of each column of letters should be the same, because the entropy value is involved in the calculation of the letter height, the more chaotic a column of letters appears, the greater the entropy value, and the shorter the letters; the more regular, the smaller the entropy value, and the shorter the letters high .

WebLogo 3

  • A popular software for creating sequence logo diagrams: WebLogo 3 http://weblogo.threeplusone.com/
  • Create a WebLogo such as entering multiple promoter sequences
    insert image description here
    insert image description here

Sequence motif: MEME

  • See the video for details : Finding Conserved Regions-02 MEME P58
  • There are sequence fragments with specific patterns in nucleic acid/protein sequences , and these fragments are called sequence motifs ( motif). Sequence motifs are closely related to biological functions.
  • MEME is a software that can automatically discover sequence motifs from a set of related DNA or protein sequences . http://meme-suite.org
  • Upload the original sequence , no need to do multiple sequence comparisons in advance
    insert image description here
  • Various formats for returning results
    insert image description here
    insert image description here
  • Click morethe arrow below to see the enlarged sequence logo to get specific motif information
    insert image description here
  • The arrow on the right can submit the motif to other software or databases for sequence similarity search based on the motif .
    insert image description here

PRINTS fingerprint database

  • See the video for details : Finding Conserved Regions-03 PRINTS P59

  • A protein fingerprint (Prints) is a set of conserved sequence motifs used to characterize the characteristics of protein families . These motifs are obtained from multiple sequence alignment results, and they are not adjacent in amino acid sequence, but in the three-dimensional structure, they may be closely combined.

  • PRINTS http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/ is a protein sequence fingerprint database , which stores the fingerprints of most protein families that have been discovered so far. For an unfamiliar protein, simply looking at whether its sequence fits a family map can classify it and predict its function .

  • Direct PRINTS access: There are many ways to find the protein fingerprint
    insert image description here
    insert image description here

    • TRANSFERRINFingerprint information
      insert image description here

    • View alignmentView the multiple sequence alignments used to create the fingerprint
      insert image description here

    • View StructureTaking the structure of a certain protein in the family as an example, the position of the motif in the three-dimensional structure is displayed online! [Insert picture description here](https://img-blog.csdnimg.cn/382c942eed77489e85d7994241b2d8c7.png#pic_center=600x)

  • PRINTS search

    • FPScan Fingerprint matching : search for fingerprints that match a sequence
      insert image description here
      insert image description here
      insert image description here
      insert image description here
      insert image description here

Guess you like

Origin blog.csdn.net/zea408497299/article/details/125103668