DeepAllo: First Use of Protein Language Models and Multi-Task Learning for Allosteric Site Prediction}

DeepAllo integrates protein language models with multi-task learning to accurately predict allosteric sites, advancing drug discovery and protein engineering with innovative AI techniques.

DeepAllo: First Use of Protein Language Models and Multi-Task Learning for Allosteric Site Prediction}

图片

Editor | Cabbage Leaf

Allosteric effects refer to ligand binding at sites distant from the active site that modulate protein activity. This mechanism is increasingly important in drug development due to its potential for fewer side effects.

Identifying allosteric pockets (sites) is highly challenging. Existing techniques include machine learning methods that predict allosteric pockets based on static and pocket features.

Researchers from Koç University in Turkey developed DeepAllo, the first study to combine fine-tuned protein language models (pLMs) with FPocket features to improve the accuracy of allosteric site prediction.

The study, titled DeepAllo: allosteric site prediction using protein language model (pLM) with multitask learning, was published in Bioinformatics in June 2025.

图片

Allosteric regulation involves ligand binding at sites outside the active site to modulate protein activity. Most drugs target active sites, but allosteric drugs can offer fewer side effects.

While all proteins have potential allosteric sites, many remain undiscovered due to lack of data. Existing ML methods use pocket features from FPocket, which provides feature vectors for each pocket, trained via binary classification to identify allosteric pockets.

However, these methods do not fully leverage the power of pretrained protein language models (pLMs) or large protein language models (pLLMs).

Incorporating pLM ideas

In recent research, Koç University researchers fine-tuned the ProtTrans series’ ProtBERT-BFD (ProtBERT-Big Fantastic Database) pLM on the Allosteric Site Database (ASD). They further extended it via multitask learning (MTL), using two prediction heads: (A) allosteric residues, and (B) secondary structure residues.

Task A is the primary focus, aiming to learn allosteric residues by leveraging secondary structure information, which helps compensate for limited allosteric data. The fine-tuned pLM features are combined with FPocket pocket features to train XGBoost and AutoML models.

图片

Diagram: Architecture and methodology overview (Source: Paper)

Specifically, the protein structure and sequence are input into FPocket and the fine-tuned ProtBERT pLM. FPocket extracts pockets with PDB coordinates and a 19-dimensional feature vector. The pLM generates 1024-dimensional residue features, which are aggregated into a single vector per pocket.

These vectors are concatenated into a 1043-dimensional feature and fed into XGBoost and AutoML classifiers to predict whether a pocket is allosteric.

Evaluation

To assess how pLM features influence prediction performance, results with and without pocket features are compared. The correlation between XGBoost and AutoML results is high (0.9768). The MTL-based pLM outperforms other methods.

图片

Figure: Ranking of allosteric pockets based on AutoML with MTL-pLM features. The top 1%, 3%, 5%, and 10% are shown. 90% of the top-ranked pockets are within the top 10% of predictions.

The results demonstrate that the model effectively ranks true allosteric pockets higher, with the combined MTL-pLM features outperforming pocket-only models. The model’s confidence in top predictions exceeds 90.5% for the top 3 pockets, surpassing previous methods.

Case study: The team tested their model on a protein (TOXIN B, PDB ID: 3PEE) not in the training set, which has a known allosteric pocket. Using AutoMLMTL, they ranked pockets by probability, selecting the top 3.

图片

Illustration: Predicted allosteric sites: site 1 (red), site 2 (orange), site 3 (purple), and regulator (green). The visualization shows the predicted pockets close to the regulator, with the third pocket far from the regulator, indicating the model’s ability to identify relevant allosteric sites considering spatial relationships.

The pLMMTL features provide geometric information that helps distinguish residues near the regulator from those farther away, improving prediction accuracy.

Conclusion

This study demonstrates that fine-tuning pLMs with multitask learning can improve allosteric site prediction beyond existing methods. The approach leverages secondary structure information and large-scale pretraining.

The case study confirms that the model can accurately identify relevant allosteric pockets with high confidence. Future work may involve using more advanced pLMs like ProtT5 to further enhance performance.

Source code: https://github.com/MoaazK/deepallo

Paper link: https://academic.oup.com/bioinformatics/article/41/6/btaf294/8132950

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe