PSP - AlphaFold2's HHBlits search exception "maximum number of residues 32763 exceeded in sequence UniRef100"

Welcome to my CSDN: https://spike.blog.csdn.net/
Address of this article: https://blog.csdn.net/caroline_wendy/article/details/131229922

HHBlits

HHBlits is an efficient protein sequence alignment tool that can quickly search for homologous sequences in large databases. The principle of HHBlits is to use Hidden Markov Models (HMM) to represent protein families, thereby improving the sensitivity and accuracy of alignments. The advantage of HHBlits is that a search can be done in minutes and queries with up to thousands of sequences can be handled. The disadvantage of HHBlits is that a pre-built HMM database is required, and for very rare or novel protein sequences, it may not be possible to find enough homologous sequences.

Error source:
AlphaFold2 uses the HHBlits tool to search the BFD and UniRef30 libraries, and an exception occurs, namely:

I0615 03:59:25.956802 140204622899008 hhblits.py:129] Launching subprocess "/root/miniconda3/envs/alphafold/bin/hhblits -i mydata/gly-fasta-211/7n28_X.fasta -cpu 128 -oa3m /tmp/tmp335fuu56/output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d af2_data_dir/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d af2_data_dir/uniref30/UniRef30_2021_03"
...
HHblits failed. HHblits stderr begin:
...
maximum number of residues 32763 exceeded in sequence UniRef100_UPI001401E5A6_consensus
...

The reason is that the search volume of HHblits in some cases is too large, resulting in abnormalities, just reduce the search volume, modify: alphafold/data/tools/hhblits.py:

...
class HHBlits:
  """Python wrapper of the HHblits binary."""

  # BUG: maximum number of residues 32763 exceeded in sequence UniRef100_UPI001401E5A6_consensus
  # 降低参数,以及去除异常
  def __init__(self,
               *,
               binary_path: str,
               databases: Sequence[str],
               # n_cpu: int = 4,
               n_cpu: int = 64,  # 根据服务器设定
               n_iter: int = 3,
               e_value: float = 0.001,
               # maxseq: int = 1_000_000,
               maxseq: int = 2_00_000,
               # realign_max: int = 100_000,
               realign_max: int = 50_000,
               # maxfilt: int = 100_000,
               maxfilt: int = 50_000,
               min_prefilter_hits: int = 1000,
               all_seqs: bool = False,
               alt: Optional[int] = None,
               p: int = _HHBLITS_DEFAULT_P,
               z: int = _HHBLITS_DEFAULT_Z):
...

Guess you like

Origin blog.csdn.net/u012515223/article/details/131229922