By Insights Team in AI — 27 Jun 2025

University of Sheffield and AstraZeneca Introduce MapDiff: Precisely Capturing Protein Inverse Folding Structures and Residue Interactions}

Researchers from Sheffield and AstraZeneca developed MapDiff, a diffusion-based model that accurately captures protein inverse folding structures and residue interactions, outperforming existing methods.

Editor:萝卜皮

Inverse protein folding (IPF) generates amino acid sequences that can fold into desired 3D structures, enabling applications in therapeutic protein engineering, lead compound optimization, and antibody design.

Despite rapid progress with AI, challenges remain, such as predicting elements with high structural uncertainty, including disordered regions.

To address low-confidence residue predictions, researchers from the University of Sheffield and AstraZeneca proposed MapDiff, a mask-prior guided denoising diffusion framework that accurately captures structural information and residue interactions in proteins.

MapDiff is a discrete diffusion probabilistic model that iteratively generates low-noise amino acid sequences conditioned on given protein backbone structures. Evaluations on four challenging sequence design benchmarks show that MapDiff significantly outperforms state-of-the-art methods. Additionally, the sequences generated by MapDiff exhibit physicochemical and structural features similar to natural proteins across different families and architectures.

This study, titled “Mask-prior-guided denoising diffusion improves inverse protein folding”, was published on June 16, 2025, in Nature Machine Intelligence.

Proteins are complex 3D structures folded from linear amino acid sequences, playing vital roles in metabolism, immune response, and cell cycle regulation.

IPF is a fundamental problem in computational biology and medicine, but current physics-based methods are costly and limited in accuracy, while machine learning approaches struggle with high-uncertainty residues.

MapDiff Architecture

Researchers introduced MapDiff, a mask-prior guided denoising diffusion framework that models the structure-to-sequence mapping as a denoising process, gradually removing noise from amino acid sequences conditioned on structural information.

During denoising, the network iteratively refines noisy, randomly sampled amino acid sequences based on 3D structural data, capturing the complex distribution of native sequences and improving prediction accuracy.

It employs an adaptive mask-prior strategy, where low-confidence residues are masked based on entropy and uncertainty, then predicted with a pre-trained masked sequence generator that leverages structural and sequence priors.

Figure: Performance comparison and sensitivity analysis across different scenarios on the CATH dataset (source: paper).

Structural and residue interaction information guides the denoising process, reducing errors in low-confidence predictions. The network uses non-autoregressive decoding for efficient sequence generation.

To accelerate denoising and uncertainty estimation, the authors combine DDIM with Monte Carlo dropout, skipping multiple denoising steps for faster inference and using dropout to generate multiple stochastic samples, reducing uncertainty.

Performance Evaluation

MapDiff was compared with leading IPF methods across multiple benchmarks. Results show it consistently outperforms other models, even those incorporating external knowledge, with generated sequences highly similar to natural proteins.

Using AlphaFold2 to fold the generated sequences, the resulting 3D structures closely resemble native proteins, even at lower sequence recovery rates.

Figure: Comparison of structures for proteins with PDB IDs 1NI8, 2HKY, and 2P0X, and their designed sequences (source: paper).

The authors also conducted ablation studies to analyze the importance of different components, demonstrating that MapDiff maintains transferability and robustness even with limited training data.

Future Directions

Future work will explore applying MapDiff to de novo antibody design and protein engineering, integrating structural predictions as external data, incorporating physical constraints, and leveraging sequence evolution knowledge for residue refinement. Folding simulations and molecular dynamics will further validate the designed sequences.

Paper link: https://www.nature.com/articles/s42256-025-01042-6

Subscribe to QQ Insights