PSP - 蛋白质结构预测 AlphaFold2 的结构模版 (Template) 搜索与特征逻辑

欢迎关注我的CSDN：https://spike.blog.csdn.net/
本文地址：https://spike.blog.csdn.net/article/details/132427617

结构模版 (Template) 是一种已知的蛋白质结构，可以作为 AlphaFold2 蛋白质结构预测的参考，AlphaFold2 可以从多个数据库中搜索和选择最合适的模板，也可以使用自定义的模板。AlphaFold2 使用了一种基于深度学习的方法，利用模板和序列对齐信息来预测蛋白质的三维结构，还可以预测多聚体蛋白质的结构，这是一种由多个亚基组成的复杂蛋白质。

Monomer 模版搜索的结果文件是 pdb_hits.hhr，而 Multimer 是以 sto 格式结尾，搜索算法不同导致的。

测试 Case 是 T1104-D1_A117，来源于 CASP15。模版搜索结果的处理逻辑，位于 alphafold/data/pipeline.py 中：

templates_result 就是模版特征，是 AF2 的三大特征 (Sequence | MSA | Template)之一。

即：

# ...
# 加载模版文件
if os.path.isfile(pdb_hits_out_path):  # avoid to search templates
  with open(pdb_hits_out_path, "r") as f:
    pdb_templates_result = f.read()
  logging.info("[CL] use saved template. %s", pdb_hits_out_path)
# ...
# 获取pdb命中结果
pdb_template_hits = self.template_searcher.get_template_hits(
    output_string=pdb_templates_result, input_sequence=input_sequence)
# ...
# 解析模版结果
templates_result = self.template_featurizer.get_templates(
    query_sequence=input_sequence,
    hits=pdb_template_hits)
# ...

日志：

[CL] use saved template. mydata/T1104-D1_A117/msas/pdb_hits.hhr

1. get_template_hits() — 解析 hhr 文件

获取模版在 PDB 库的命中结果，主要是解析 pdb_hits.hhr 文件。逻辑位于 alphafold/data/tools/hhsearch.py，即：

def get_template_hits(self,
                      output_string: str,
                      input_sequence: str) -> Sequence[parsers.TemplateHit]:
  """Gets parsed template hits from the raw string output by the tool."""
  del input_sequence  # Used by hmmseach but not needed for hhsearch.
  return parsers.parse_hhr(output_string)

调用解析 hhr 的函数 parse_hhr()，逻辑位于alphafold/data/parsers.py，即

def parse_hhr(hhr_string: str) -> Sequence[TemplateHit]:
  """Parses the content of an entire HHR file."""
  lines = hhr_string.splitlines()

  # Each .hhr file starts with a results table, then has a sequence of hit
  # "paragraphs", each paragraph starting with a line 'No <hit number>'. We
  # iterate through each paragraph to parse each hit.

  block_starts = [i for i, line in enumerate(lines) if line.startswith('No ')]

  hits = []
  if block_starts:
    block_starts.append(len(lines))  # Add the end of the final block.
    for i in range(len(block_starts) - 1):
      hits.append(_parse_hhr_hit(lines[block_starts[i]:block_starts[i + 1]]))
  return hits

插入日志逻辑，即：

from absl import logging
logging.set_verbosity(logging.INFO)
for hit in hits:
  logging.info(f"[CL] hit: {
      
      hit}")

日志如下：

[CL] hit: TemplateHit(index=5, name='2KW0_A CcmH protein; oxidoreductase, cytochrome c maturation; NMR {Escherichia coli}', aligned_cols=38, sum_probs=30.8, query='AVAKGLEEMYANGVTEDNFKNYVKNNFAQQEISSVEEE', hit_sequence='DLRQKVYELMQEGKSKKEIVDYMVARYGNFVTYDPPLT', indices_query=[8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45], indices_hit=[43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80])

其中，pdb_hits.hhr 如下：

Query         A 
Match_columns 117
No_of_seqs    22 out of 29
Neff          2.94667
Searched_HMMs 80799
Date          Tue May  9 10:08:54 2023
Command       /root/miniconda3/envs/alphafold/bin/hhsearch -i /tmp/tmp7qe78wbc/query.a3m -o /tmp/tmp7qe78wbc/output.hhr -maxseq 1000000 -d /nfs_baoding/chenlong/af2_data_dir/pdb70/pdb70 

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 2HL7_A Cytochrome C-type bioge  31.3      41 0.00051   24.1   0.0   34    9-42     47-80  (84)
  2 3D3M_A Eukaryotic translation   26.6      56 0.00069   24.7   0.0   33    9-41      3-35  (168)
  3 3JVN_A Acetyltransferase (E.C.  25.8      59 0.00073   21.0   0.0   31   70-100    94-124 (166)
  4 3OAO_A uncharacterized protein  24.7      64 0.00079   23.7   0.0   43   11-53     56-98  (147)
  5 2KW0_A CcmH protein; oxidoredu  23.4      70 0.00087   23.3   0.0   38    9-46     44-81  (90)
  6 3F0A_A N-ACETYLTRANSFERASE; N-  22.9      73 0.00091   20.4   0.0   39    2-40     11-53  (162)
  7 5LXU_A Transcription factor LU  22.4      76 0.00094   20.3   0.0   26   11-36     29-54  (57)
  8 2DXQ_A Acetyltransferase; acet  22.0      78 0.00097   19.6   0.0   25   75-99     92-116 (150)
  9 2DXQ_B Acetyltransferase; acet  22.0      78 0.00097   19.6   0.0   25   75-99     92-116 (150)
 10 5UN0_3 Proteasome Activator; A  19.9      92  0.0011   25.3   0.0   24   78-101    88-111 (251)

No 1
>2HL7_A Cytochrome C-type biogenesis protein CcmH; Three-helices bundle, OXIDOREDUCTASE; HET: PG4; 1.7A {
    
    Pseudomonas aeruginosa}
Probab=31.28  E-value=41  Score=24.06  Aligned_cols=34  Identities=9%  Similarity=0.188  Sum_probs=27.6  Template_Neff=6.200

Q A                 9 AVAKGLEEMYANGVTEDNFKNYVKNNFAQQEISS   42 (117)
Q Consensus         9 ~VA~~LEkMF~nGVse~Nf~~Yv~~Nfs~~EIs~   42 (117)
                      ..-.++.+|...|-|++-+.+|+.+.|++.=+..
T Consensus        47 ~~R~~I~~~l~~G~s~~eI~~~~v~~YG~~IL~~   80 (84)
T 2HL7_A           47 DLRKQIYGQLQQGKSDGEIVDYMVARYGDFVRYK   80 (84)
T ss_dssp             HHHHHHHHHHHHTCCHHHHHHHHHHHHTTTCEEC
T ss_pred             HHHHHHHHHHHCCCCHHHHHHHHHHHHCccceeC
Confidence            3445678899999999999999999998765443


No 2
>3D3M_A Eukaryotic translation initiation factor 4; HEAT repeat domain, Structural Genomics; 1.9A {
    
    Homo sapiens}
Probab=26.65  E-value=56  Score=24.68  Aligned_cols=33  Identities=12%  Similarity=0.157  Sum_probs=27.2  Template_Neff=8.500

Q A                 9 AVAKGLEEMYANGVTEDNFKNYVKNNFAQQEIS   41 (117)
Q Consensus         9 ~VA~~LEkMF~nGVse~Nf~~Yv~~Nfs~~EIs   41 (117)
                      .|-.+|++++..|-+.+.+.+|+.+|.+++...
T Consensus         3 ~~~~~L~~~l~~~~~~~~i~~wi~~~v~~~~~~   35 (168)
T 3D3M_A            3 KLEKELLKQIKLDPSPQTIYKWIKDNISPKLHV   35 (168)
T ss_pred             HHHHHHHHHHhhCCCHHHHHHHHHHhCCHHHcC
Confidence            355689999999999999999999998876543

# ...

2. get_templates() — 提取模版特征

核心逻辑位于 alphafold/data/templates.py 中，get_templates() 循环处理 hit，调用 _process_single_hit()，即：

def get_templates(
    self,
    query_sequence: str,
    hits: Sequence[parsers.TemplateHit]) -> TemplateSearchResult:
  """Computes the templates for given query sequence (more details above)."""
	# ...
  for hit in sorted(hits, key=lambda x: x.sum_probs, reverse=True):
    # We got all the templates we wanted, stop processing hits.
    if num_hits >= self._max_hits:
      break
    result = _process_single_hit(
        query_sequence=query_sequence,
        hit=hit,
        mmcif_dir=self._mmcif_dir,
        max_template_date=self._max_template_date,
        release_dates=self._release_dates,
        obsolete_pdbs=self._obsolete_pdbs,
        strict_error_check=self._strict_error_check,
        kalign_binary_path=self._kalign_binary_path)
#...

核心逻辑位于 _process_single_hit() 中，生成单个模版特征，具体在 _extract_template_features() 函数中，生成特征，即：

# ...
mapping = _build_query_to_hit_index_mapping(
    hit.query, hit.hit_sequence, hit.indices_hit, hit.indices_query,
    query_sequence)
# ...
template_sequence = hit.hit_sequence.replace('-', '')
# ...
cif_path = os.path.join(mmcif_dir, hit_pdb_code + '.cif')
# ...
parsing_result = mmcif_parsing.parse(
    file_id=hit_pdb_code, mmcif_string=cif_string)
# ...
# 核心逻辑
features, realign_warning = _extract_template_features(
    mmcif_object=parsing_result.mmcif_object,
    pdb_id=hit_pdb_code,
    mapping=mapping,
    template_sequence=template_sequence,
    query_sequence=query_sequence,
    template_chain_id=hit_chain_id,
    kalign_binary_path=kalign_binary_path)
# ...

插入日志逻辑，即：

from absl import logging
logging.set_verbosity(logging.INFO)
logging.info(f"[CL] hit_pdb_code: {
      
      hit_pdb_code}, hit_chain_id: {
      
      hit_chain_id}, cif_path: {
      
      cif_path}, query_sequence: {
      
      query_sequence}")

日志如下：

[CL] hit_pdb_code: 2kw0, hit_chain_id: A, cif_path: af2_data_dir/pdb_mmcif/mmcif_files/2kw0.cif, query_sequence: QLEDSEVEAVAKGLEEMYANGVTEDNFKNYVKNNFAQQEISSVEEELNVNISDSCVANKIKDEFFAMISISAIVKAAQKKAWKELAVTVLRFAKANGLKTNAIIVAGQLALWAVQCG