--- headline: "TDFold Reimagines Protein Structure Prediction as 2D Diffusion Task" slug: tdfold-protein-structure-2d-diffusion category: research story_number: 13 date: 2026-05-18 ---

# TDFold Reimagines Protein Structure Prediction as 2D Diffusion Task

A team of Chinese researchers has reframed one of biology's grand computational challenges -- predicting the 3D shape of a protein from its amino acid sequence -- as something closer to generating an image. Their method, called TDFold, treats the problem as a two-dimensional diffusion task, and early benchmarks suggest it can outpace heavyweights like AlphaFold2, AlphaFold3 and ESMFold while running on a fraction of the hardware.

Published in April 2026 in Nature Machine Intelligence, the work arrives at a moment when the field of protein structure prediction is simultaneously mature and restless. AlphaFold2 cracked the CASP competition in 2020. AlphaFold3, ESMFold and OmegaFold have since pushed accuracy and usability further. But these models share a common constraint: they either demand expensive multiple sequence alignments (MSAs) drawn from evolutionary databases, or they sacrifice accuracy when forced to work from a single protein sequence alone.

TDFold, developed by Xudong Wang, Tong Zhang, Zhen Cui, Wenming Zheng and colleagues at Southeast University in Nanjing and Beijing Normal University, attacks precisely that gap. The model operates in two stages. First, a 2D geometric template diffusion module -- adapted from the Stable Diffusion architecture originally built for text-to-image generation -- produces high-quality pairwise distance and orientation maps between amino acid residues. These maps look, computationally speaking, like images. Second, a lightweight Sequence-template Co-evolved Learning (SCL) network refines those geometric templates into full three-dimensional coordinates.

The intellectual move is striking: rather than reasoning directly in 3D space, TDFold collapses the problem into a 2D representation that can exploit the enormous efficiency gains the computer vision community has already wrung out of diffusion models. The geometric templates capture inter-residue relationships -- distances and orientations -- in matrix form, making them natural candidates for image-style processing.

Benchmarks That Matter

On standard benchmarks, TDFold delivers competitive or superior results compared with leading single-sequence methods. On CASP14, TDFold achieved a TM-score of 0.73, edging out ESMFold at 0.71 -- a modest but meaningful gain in a field where incremental improvements translate to real structural insight. The model also posted strong results on CASP15 and CASP16 targets.

Where TDFold truly distinguishes itself is on so-called orphan proteins -- sequences with no detectable homologs in existing databases. These are the hardest cases for MSA-dependent tools like AlphaFold2 and RoseTTAFold, which rely on evolutionary signals that simply do not exist for orphan sequences. On the Orphan and Orphan25 benchmark datasets, TDFold achieved state-of-the-art performance, demonstrating that the diffusion-based geometric templates can compensate for the absence of evolutionary context.

Speed and Efficiency

The efficiency numbers are perhaps even more impressive than the accuracy gains. TDFold predicts the structure of a 500-residue protein in roughly 10 seconds. By comparison, ESMFold takes about 100 seconds, AlphaFold3 requires approximately 240 seconds, and AlphaFold2 and RoseTTAFold each need close to 1,000 seconds for the same task.

GPU memory consumption tells a similar story. TDFold occupies approximately 7 GB of GPU memory, compared with 12 GB for AlphaFold2, 16 GB for RoseTTAFold and 20 GB for ESMFold. Perhaps most remarkably, the entire TDFold system -- including fine-tuning the Stable Diffusion backbone and training the SCL network from scratch -- can be completed in one week on a single NVIDIA RTX 4090 GPU. That is consumer-grade hardware, a stark contrast to the cluster-scale resources typically associated with frontier protein folding models.

Why It Matters

The implications extend beyond benchmark tables. A model that requires only a single sequence as input, runs on modest hardware and delivers competitive accuracy could democratize structural biology research in resource-constrained labs. It could also accelerate high-throughput applications in drug discovery and protein engineering, where predicting structures for thousands of novel sequences is routine.

The approach also carries a broader lesson for computational biology: the architectural innovations flowing out of the generative AI boom -- diffusion models, in particular -- are not confined to art and text. By recognizing that pairwise residue geometries share structural properties with images, the TDFold team has opened a bridge between two seemingly distant domains.

"By combining geometric precision, collaborative learning, and efficient resource utilization, TDFold sets a new benchmark for single-sequence-based modeling," the authors wrote, noting that their approach makes "protein structure prediction faster, more accessible, and accurate."

The paper further states that TDFold "presents three key advantages compared with existing protein language models and homology-based methods: better single-sequence-based prediction performance, lower resource consumption and higher efficiency in inference."

Still, limitations remain. TDFold's accuracy on well-studied proteins with abundant homologs does not yet match the best MSA-powered methods. The model is a single-sequence tool by design, and for targets where rich evolutionary data exists, AlphaFold2 and its successors retain an edge. The question going forward is whether hybrid approaches -- combining diffusion-generated templates with evolutionary signals -- can capture the best of both worlds.

For now, TDFold stands as a compelling proof of concept: the language of image generation can speak fluently about the shapes of life's molecular machinery, and it can do so faster and cheaper than many thought possible.

“TDFold presents three key advantages compared with existing protein language models and homology-based methods: better prediction performance, lower resource consumption and higher efficiency.”
— Wang, Zhang, Cui, Zheng et al., Paper authors
~10 seconds
Inference time per protein
7 GB
GPU memory needed
0.73
TM-score on CASP14
1 week
Training on single RTX 4090