The advent of AI has profoundly reshaped biological research, with models like AlphaFold2 solving the 50-year-old protein folding problem and revolutionizing structural biology [2]. These tools have granted scientists unprecedented power to predict the three-dimensional structures of proteins from their amino acid sequences. However, this revolution has largely operated within a fundamental constraint: the 20 canonical amino acids that constitute the building blocks of nearly all natural proteins. This limitation is akin to writing a dictionary with only 20 letters, artificially constraining the complexity and functionality of what can be created. The critical challenge, therefore, has been to expand this "alphabet" to unlock new biochemical functions.
The success of models like AlphaFold2 was built upon vast datasets of natural protein sequences and their experimentally determined structures. By design, their architectures were optimized to understand the interplay between the 20 canonical amino acids. While transformative, this focus inadvertently created a barrier to incorporating noncanonical amino acids (NCAAs)—a diverse group of over 300 molecules found in nature or synthesized in labs. NCAAs offer tantalizing properties for therapeutic design, such as enhanced stability against degradation, improved target specificity, and reduced immunogenicity [1]. Yet, their unique chemical structures and interaction patterns were foreign to existing AI models, rendering computational design and structure prediction for NCAA-containing proteins a near-impossible task. This gap between potential and capability has been a major bottleneck in next-generation protein and peptide engineering.
A recent paper published in Science by Qiuzhen Li, Patrick Bryant, and colleagues introduces RareFold, a deep learning model that directly confronts this limitation [1]. RareFold represents a pivotal step forward, enabling accurate structure prediction and design for proteins that incorporate both canonical amino acids and a diverse set of NCAAs.
The Core Problem: Existing models could not accurately predict the geometry and interactions of NCAAs because they were not part of the training data or the underlying architectural assumptions. This forced researchers into slow, low-throughput experimental screening, stifling innovation.
The Innovative Solution: RareFold's innovation lies in its "tokenization" approach. Drawing inspiration from large language models, it treats each of the 20 canonical and 29 selected NCAAs as a distinct "token." By building on the powerful EvoFormer architecture of AlphaFold2, RareFold learns the unique atomic interaction patterns and stereochemical constraints specific to each of these 49 residues. This allows the model to accurately interpret and place NCAAs within a protein structure, respecting their individual chemical properties rather than treating them as generic unknowns.
Key Results and Validation: The power of RareFold extends beyond mere prediction. The authors developed EvoBindRare, a framework that leverages RareFold for inverse design—designing a sequence to fit a desired structure and function. To validate their approach, they designed novel linear and cyclic peptide binders incorporating NCAAs to target a ribonuclease. Subsequent experimental validation confirmed that these AI-generated peptides achieved micromolar (μM) binding affinity, demonstrating that the model can successfully design functional, novel biomolecules with an expanded chemical vocabulary [1].
RareFold is more than an incremental improvement; it signals a paradigm shift in protein engineering. By expanding the accessible chemical space for AI-driven design, it opens the door to creating a new generation of peptide therapeutics with superior drug-like properties. The ability to computationally design proteins with enhanced stability and reduced immunogenicity could overcome long-standing challenges in drug development.
However, translating these in silico designs into tangible, validated molecules at scale remains a significant hurdle. The process of synthesizing the corresponding DNA, optimizing expression for novel sequences, and testing thousands of candidates is a major bottleneck. This is where emerging platforms that integrate high-throughput DNA construction and self-selecting expression systems, such as Ailurus vec, could help bridge the gap, accelerating the Design-Build-Test-Learn cycle for these novel proteins.
The open-source release of RareFold democratizes access to this powerful technology, inviting the global research community to build upon it [3]. Future work will likely involve expanding the library of supported NCAAs, refining the model's accuracy, and integrating it into broader automated lab workflows. As these computational design tools mature and couple with high-throughput experimental platforms, we are moving closer to an era of truly programmable biology, where the language of life is no longer limited by nature's alphabet.
Ailurus Bio is a pioneering company building biological programs, genetic instructions that act as living software to orchestrate biology. We develop foundational DNAs and libraries, transforming lab-grown cells into living instruments that streamline complex research and production workflows. We empower scientists and developers worldwide with these bioprograms, accelerating discovery and diverse applications. Our mission is to make biology the truly general-purpose technology, as programmable and accessible as modern computers, by constructing a biocomputer architecture for all.