GhostFold: Predicting Protein Structure Beyond Evolution's Reach

GhostFold: AI-driven protein structure prediction from a single sequence, bypassing the need for evolutionary data.

Ailurus Press
October 27, 2025
5 min read

The Evolutionary Bottleneck in a Golden Age of Biology

The field of structural biology has been revolutionized by deep learning, most notably with AlphaFold2 [2]. Its ability to predict protein structures with near-experimental accuracy has accelerated research across life sciences. The success of this and similar models, however, hinges on a critical input: a deep Multiple Sequence Alignment (MSA). By analyzing the co-evolutionary patterns among homologous sequences, these models infer the structural constraints that define a protein's fold.

This reliance on evolutionary history creates a fundamental bottleneck. For vast classes of proteins—such as "orphan" proteins with no known relatives, rapidly evolving regions like antibody loops, or entirely novel de novo designed proteins—a meaningful MSA simply does not exist. While protein language models (pLMs) like ESMFold [3] have emerged as a powerful alternative by learning structural information directly from massive, unaligned sequence databases to enable single-sequence prediction, a gap often remains. MSA-based methods are typically more accurate when sufficient homologs are available, while pLMs excel in their absence but may struggle with complex topologies. This dichotomy has left a crucial question unanswered: how can we achieve the highest accuracy for any protein, regardless of its evolutionary context?

A Key Breakthrough: Synthesizing Evolution from Structure

A recent paper from researchers at Scripps Research Institute introduces GhostFold, a groundbreaking method that elegantly resolves this dilemma [1]. Instead of choosing between MSA-based and pLM-based approaches, GhostFold forges a third path: it generates a synthetic, structurally-consistent pseudo-MSA from a single sequence, effectively providing the powerful AlphaFold2 architecture with the input it needs, even when nature provides none.

The innovation of GhostFold lies in its clever inversion of the typical prediction pipeline. The process can be broken down into three key steps:

  1. From Sequence to Structural Priors: The process begins with a single query sequence. This sequence is fed into a protein language model (ProstT5) which translates the amino acid information into a sequence of structural tokens from the 3Di alphabet. This step distills the implicit structural propensities encoded within the language model's parameters into a concrete, albeit coarse, structural representation.
  2. Hallucinating a Structural Family: GhostFold then uses this structural representation as a blueprint. It iteratively performs a "reverse translation," generating a diverse set of new amino acid sequences that are all predicted to fold into the same underlying structure. This is the core of the "hallucination" process, creating a family of sequences that are not related by evolution, but by a shared structural destiny.
  3. Inference with a Pseudo-MSA: This collection of synthetic sequences forms a pseudo-MSA. Despite lacking any genuine evolutionary history, this alignment is rich with the co-evolutionary signals—the statistical couplings between residue positions—that are the hallmark of a stable fold. When this pseudo-MSA is supplied to a structure prediction model like AlphaFold2, the model can leverage its full inferential power, just as it would with a natural MSA.

The results are remarkable. On benchmark datasets of orphan proteins, GhostFold consistently matches or outperforms both traditional MSA-based and leading pLM-based predictors. It shows exceptional strength in predicting notoriously difficult structures, such as complex beta-sheet topologies and amyloid proteins. One of the most compelling demonstrations is its application to antibody modeling, where it dramatically improves the accuracy and consistency of predicting the hypervariable CDRH3 loop—a task that has long been a major challenge in computational immunology.

A Paradigm Shift in Speed, Design, and Discovery

Beyond its accuracy, GhostFold represents a paradigm shift in both efficiency and conceptual approach. The computationally expensive, often hours-long search for homologous sequences is completely eliminated. The entire GhostFold process, from single sequence to final structure, can be completed in minutes. The generated pseudo-MSA is so information-dense that even a small number of sequences (e.g., 16) is sufficient for high-quality predictions, a testament to the quality of the synthesized structural information.

The deeper implication of GhostFold is that the critical information for protein folding is not evolutionary history itself, but the underlying structural and physicochemical constraints that evolution encodes. By learning to generate these constraints synthetically, we are moving from passive data mining to intelligent, active sequence generation.

This opens up exhilarating new possibilities for protein engineering and de novo design. Scientists can now design a completely novel protein sequence and, before committing to expensive and time-consuming wet-lab synthesis, use GhostFold to generate a pseudo-MSA. This allows for a high-confidence in silico validation of whether the designed sequence is likely to adopt the intended fold. This capability drastically accelerates the Design-Build-Test-Learn (DBTL) cycle that is central to modern synthetic biology. To fully capitalize on this, the next step requires bridging the gap between digital design and physical validation. High-throughput platforms that can rapidly test thousands of designs, such as self-selecting vector systems like Ailurus vec, will be crucial for closing this loop and turning AI-generated hypotheses into functional biomolecules at scale.

GhostFold redefines how we encode structural priors for deep learning models, proving that we can learn the language of protein folding so well that we can begin to write our own evolutionary stories. As this technology matures, it promises to further democratize structural biology and unleash a new wave of innovation in protein design, therapeutics, and beyond.


References

  1. Ruffolo, J. A., Sulam, J., & Gray, J. J. (2024). GhostFold: A fast and accurate single-sequence protein structure predictor that learns a pseudo-multiple sequence alignment. bioRxiv.
  2. Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature.
  3. Lin, Z., et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science.

About Ailurus

Ailurus Bio is a pioneering company building biological programs, genetic instructions that act as living software to orchestrate biology. We develop foundational DNAs and libraries, transforming lab-grown cells into living instruments that streamline complex research and production workflows. We empower scientists and developers worldwide with these bioprograms, accelerating discovery and diverse applications. Our mission is to make biology the truly general-purpose technology, as programmable and accessible as modern computers, by constructing a biocomputer architecture for all.

For more information, visit: ailurus.bio
Share this post
Authors of this post
Ailurus Press
We care about your data in our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form. Please contact us at support@ailurus.bio