PSP - Structure template (Template) search and feature logic of protein structure prediction AlphaFold2

Welcome to follow my CSDN: https://spike.blog.csdn.net/This
article address: https://spike.blog.csdn.net/article/details/132427617

Template
The structure template (Template) is a known protein structure that can be used as a reference for AlphaFold2 protein structure prediction. AlphaFold2 can search and select the most appropriate template from multiple databases, or use a customized template. AlphaFold2 uses a deep learning-based approach that leverages template and sequence alignment information to predict the three-dimensional structure of proteins and can also predict the structure of multimeric proteins, which are complex proteins composed of multiple subunits.

The result file of Monomer template search is pdb_hits.hhr, while the result file of Multimer ends in sto format, which is caused by different search algorithms.

The test case is T1104-D1_A117, derived from CASP15. The processing logic of template search results is located alphafold/data/pipeline.pyin:

  • templates_resultIt is the template feature, which is one of the three major features of AF2 (Sequence | MSA | Template).

Right now:

# ...
# 加载模版文件
if os.path.isfile(pdb_hits_out_path):  # avoid to search templates
  with open(pdb_hits_out_path, "r") as f:
    pdb_templates_result = f.read()
  logging.info("[CL] use saved template. %s", pdb_hits_out_path)
# ...
# 获取pdb命中结果
pdb_template_hits = self.template_searcher.get_template_hits(
    output_string=pdb_templates_result, input_sequence=input_sequence)
# ...
# 解析模版结果
templates_result = self.template_featurizer.get_templates(
    query_sequence=input_sequence,
    hits=pdb_template_hits)
# ...

log:

[CL] use saved template. mydata/T1104-D1_A117/msas/pdb_hits.hhr

1. get_template_hits() — parse hhr file

Obtain the hit result of the template in the PDB library, mainly parsing pdb_hits.hhrthe file. The logic is located at alphafold/data/tools/hhsearch.py, that is:

def get_template_hits(self,
                      output_string: str,
                      input_sequence: str) -> Sequence[parsers.TemplateHit]:
  """Gets parsed template hits from the raw string output by the tool."""
  del input_sequence  # Used by hmmseach but not needed for hhsearch.
  return parsers.parse_hhr(output_string)

Call the function that parses hhr parse_hhr(), the logic is located alphafold/data/parsers.py, that is

def parse_hhr(hhr_string: str) -> Sequence[TemplateHit]:
  """Parses the content of an entire HHR file."""
  lines = hhr_string.splitlines()

  # Each .hhr file starts with a results table, then has a sequence of hit
  # "paragraphs", each paragraph starting with a line 'No <hit number>'. We
  # iterate through each paragraph to parse each hit.

  block_starts = [i for i, line in enumerate(lines) if line.startswith('No ')]

  hits = []
  if block_starts:
    block_starts.append(len(lines))  # Add the end of the final block.
    for i in range(len(block_starts) - 1):
      hits.append(_parse_hhr_hit(lines[block_starts[i]:block_starts[i + 1]]))
  return hits

Insert logging logic, that is:

from absl import logging
logging.set_verbosity(logging.INFO)
for hit in hits:
  logging.info(f"[CL] hit: {
      
      hit}")

The log is as follows:

[CL] hit: TemplateHit(index=5, name='2KW0_A CcmH protein; oxidoreductase, cytochrome c maturation; NMR {Escherichia coli}', aligned_cols=38, sum_probs=30.8, query='AVAKGLEEMYANGVTEDNFKNYVKNNFAQQEISSVEEE', hit_sequence='DLRQKVYELMQEGKSKKEIVDYMVARYGNFVTYDPPLT', indices_query=[8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45], indices_hit=[43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80])

Among them, pdb_hits.hhras follows:

Query         A 
Match_columns 117
No_of_seqs    22 out of 29
Neff          2.94667
Searched_HMMs 80799
Date          Tue May  9 10:08:54 2023
Command       /root/miniconda3/envs/alphafold/bin/hhsearch -i /tmp/tmp7qe78wbc/query.a3m -o /tmp/tmp7qe78wbc/output.hhr -maxseq 1000000 -d /nfs_baoding/chenlong/af2_data_dir/pdb70/pdb70 

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 2HL7_A Cytochrome C-type bioge  31.3      41 0.00051   24.1   0.0   34    9-42     47-80  (84)
  2 3D3M_A Eukaryotic translation   26.6      56 0.00069   24.7   0.0   33    9-41      3-35  (168)
  3 3JVN_A Acetyltransferase (E.C.  25.8      59 0.00073   21.0   0.0   31   70-100    94-124 (166)
  4 3OAO_A uncharacterized protein  24.7      64 0.00079   23.7   0.0   43   11-53     56-98  (147)
  5 2KW0_A CcmH protein; oxidoredu  23.4      70 0.00087   23.3   0.0   38    9-46     44-81  (90)
  6 3F0A_A N-ACETYLTRANSFERASE; N-  22.9      73 0.00091   20.4   0.0   39    2-40     11-53  (162)
  7 5LXU_A Transcription factor LU  22.4      76 0.00094   20.3   0.0   26   11-36     29-54  (57)
  8 2DXQ_A Acetyltransferase; acet  22.0      78 0.00097   19.6   0.0   25   75-99     92-116 (150)
  9 2DXQ_B Acetyltransferase; acet  22.0      78 0.00097   19.6   0.0   25   75-99     92-116 (150)
 10 5UN0_3 Proteasome Activator; A  19.9      92  0.0011   25.3   0.0   24   78-101    88-111 (251)

No 1
>2HL7_A Cytochrome C-type biogenesis protein CcmH; Three-helices bundle, OXIDOREDUCTASE; HET: PG4; 1.7A {
    
    Pseudomonas aeruginosa}
Probab=31.28  E-value=41  Score=24.06  Aligned_cols=34  Identities=9%  Similarity=0.188  Sum_probs=27.6  Template_Neff=6.200

Q A                 9 AVAKGLEEMYANGVTEDNFKNYVKNNFAQQEISS   42 (117)
Q Consensus         9 ~VA~~LEkMF~nGVse~Nf~~Yv~~Nfs~~EIs~   42 (117)
                      ..-.++.+|...|-|++-+.+|+.+.|++.=+..
T Consensus        47 ~~R~~I~~~l~~G~s~~eI~~~~v~~YG~~IL~~   80 (84)
T 2HL7_A           47 DLRKQIYGQLQQGKSDGEIVDYMVARYGDFVRYK   80 (84)
T ss_dssp             HHHHHHHHHHHHTCCHHHHHHHHHHHHTTTCEEC
T ss_pred             HHHHHHHHHHHCCCCHHHHHHHHHHHHCccceeC
Confidence            3445678899999999999999999998765443


No 2
>3D3M_A Eukaryotic translation initiation factor 4; HEAT repeat domain, Structural Genomics; 1.9A {
    
    Homo sapiens}
Probab=26.65  E-value=56  Score=24.68  Aligned_cols=33  Identities=12%  Similarity=0.157  Sum_probs=27.2  Template_Neff=8.500

Q A                 9 AVAKGLEEMYANGVTEDNFKNYVKNNFAQQEIS   41 (117)
Q Consensus         9 ~VA~~LEkMF~nGVse~Nf~~Yv~~Nfs~~EIs   41 (117)
                      .|-.+|++++..|-+.+.+.+|+.+|.+++...
T Consensus         3 ~~~~~L~~~l~~~~~~~~i~~wi~~~v~~~~~~   35 (168)
T 3D3M_A            3 KLEKELLKQIKLDPSPQTIYKWIKDNISPKLHV   35 (168)
T ss_pred             HHHHHHHHHHhhCCCHHHHHHHHHHhCCHHHcC
Confidence            355689999999999999999999998876543

# ...

2. get_templates() — Extract template features

The core logic is located alphafold/data/templates.pyin , get_templates()loop processing hit, calling _process_single_hit(), that is:

def get_templates(
    self,
    query_sequence: str,
    hits: Sequence[parsers.TemplateHit]) -> TemplateSearchResult:
  """Computes the templates for given query sequence (more details above)."""
	# ...
  for hit in sorted(hits, key=lambda x: x.sum_probs, reverse=True):
    # We got all the templates we wanted, stop processing hits.
    if num_hits >= self._max_hits:
      break
    result = _process_single_hit(
        query_sequence=query_sequence,
        hit=hit,
        mmcif_dir=self._mmcif_dir,
        max_template_date=self._max_template_date,
        release_dates=self._release_dates,
        obsolete_pdbs=self._obsolete_pdbs,
        strict_error_check=self._strict_error_check,
        kalign_binary_path=self._kalign_binary_path)
#...

The core logic is located _process_single_hit()in and generates a single template feature. Specifically, in _extract_template_features()the function, the feature is generated, that is:

# ...
mapping = _build_query_to_hit_index_mapping(
    hit.query, hit.hit_sequence, hit.indices_hit, hit.indices_query,
    query_sequence)
# ...
template_sequence = hit.hit_sequence.replace('-', '')
# ...
cif_path = os.path.join(mmcif_dir, hit_pdb_code + '.cif')
# ...
parsing_result = mmcif_parsing.parse(
    file_id=hit_pdb_code, mmcif_string=cif_string)
# ...
# 核心逻辑
features, realign_warning = _extract_template_features(
    mmcif_object=parsing_result.mmcif_object,
    pdb_id=hit_pdb_code,
    mapping=mapping,
    template_sequence=template_sequence,
    query_sequence=query_sequence,
    template_chain_id=hit_chain_id,
    kalign_binary_path=kalign_binary_path)
# ...

Insert logging logic, that is:

from absl import logging
logging.set_verbosity(logging.INFO)
logging.info(f"[CL] hit_pdb_code: {
      
      hit_pdb_code}, hit_chain_id: {
      
      hit_chain_id}, cif_path: {
      
      cif_path}, query_sequence: {
      
      query_sequence}")

The log is as follows:

[CL] hit_pdb_code: 2kw0, hit_chain_id: A, cif_path: af2_data_dir/pdb_mmcif/mmcif_files/2kw0.cif, query_sequence: QLEDSEVEAVAKGLEEMYANGVTEDNFKNYVKNNFAQQEISSVEEELNVNISDSCVANKIKDEFFAMISISAIVKAAQKKAWKELAVTVLRFAKANGLKTNAIIVAGQLALWAVQCG

Guess you like

Origin blog.csdn.net/u012515223/article/details/132427617