In the landscape of modern biotechnology, protein and peptide engineering stands as a cornerstone for developing novel therapeutics, catalysts, and materials. Generative AI models like AlphaFold have revolutionized our ability to predict and design structures based on the 20 canonical amino acids that form the building blocks of life. Yet, this very alphabet, honed by billions of years of evolution, represents a fundamental constraint. To unlock the next frontier of molecular function, we must look beyond nature’s vocabulary to the vast, untapped chemical space of non-canonical amino acids (ncAAs).
The challenge, however, has been computational. Mainstream protein design tools are trained on and architected for the standard 20 amino acids. They lack the representational capacity and training data to model the unique geometries and chemical properties of hundreds of potential ncAAs. This has created a critical bottleneck: how can we rationally design proteins with components that our best models do not understand?
The journey to solve this problem has been incremental. The advent of deep generative models marked a paradigm shift, but initial diffusion and autoregressive models were still largely tied to the canonical residue space. A significant theoretical breakthrough came with Flow Matching, a framework for training continuous normalizing flows that proved highly effective for generating complex 3D structures like molecules [2]. This innovation paved the way for a new class of models.
Early applications like PepFlow [3] and PPFlow [4] demonstrated the power of flow matching for designing all-atom peptides that target specific protein receptors. Concurrently, models such as FlowPacker [5] applied similar techniques to the nuanced task of protein side-chain packing. While powerful, these pioneering efforts primarily focused on the canonical alphabet, leaving the integration of arbitrary ncAAs as an open and formidable challenge. The core issue remained: with ncAA-containing structures representing a mere 0.02% of the Protein Data Bank (PDB), a data-driven approach seemed unfeasible.
A recent preprint by Lee and Kim introduces NCFlow, a generative model that directly confronts this data scarcity and representational challenge to enable the design of peptides with virtually any ncAA [1]. Their work provides a systematic solution by rethinking how the model learns and what it learns from.
At its core, NCFlow abandons the residue-level tokenization common to models like AlphaFold. Instead, it operates directly on an atomic graph representation. This allows it to process any molecular structure, defined by its atoms and bonds, without being restricted to a predefined vocabulary. The model's architecture, inspired by AlphaFold3's Pairformer and Atom Transformer blocks, is designed to learn the language of atomic interactions in 3D space.
The central innovation of NCFlow is its three-stage training strategy, which ingeniously circumvents the lack of ncAA data:
This hierarchical training regimen allows NCFlow to build a robust, generalizable understanding of molecular structure that is not over-fitted to the canonical amino acids.
NCFlow is more than just a structure generator; it is the engine of a complete in silico design-and-validation pipeline. The process begins with a known protein-peptide complex and proceeds through a high-throughput workflow:
The results are compelling. NCFlow significantly outperforms AlphaFold3-based methods in predicting the structure of unseen ncAAs, achieving an average RMSD of 1.43 Å compared to over 3.2 Å [1]. More importantly, when applied to four different protein-peptide test cases, the pipeline identified ncAA-containing variants with predicted binding affinity improvements of up to -7.0 kcal/mol. Structural analysis of these top hits revealed that the affinity gains were driven by chemically intuitive mechanisms, such as the formation of new polar contacts, π-π stacking interactions, or improved solvent interactions—features that were impossible to achieve with the original canonical residues.
The NCFlow paper marks a significant step towards a new paradigm in protein engineering—one where the design space is no longer confined to a fixed alphabet but is an open, continuous chemical universe. It provides a blueprint for how to build generative models that can reason about arbitrary molecular structures, a capability with implications far beyond peptide design.
However, challenges remain. The accuracy of scoring functions is still a major hurdle, and the current framework is limited to single-point mutations, leaving the vast combinatorial space of multi-ncAA variants unexplored. The ultimate validation, of course, will come from synthesizing and testing these in silico designs in the wet lab. Validating these thousands of designs at scale will require a new generation of platforms that integrate high-throughput DNA construction and screening, such as those employing self-selecting expression vectors to create a rapid design-build-test-learn cycle. Services that streamline the synthesis of AI-generated DNA constructs will also be critical in closing this loop.
By providing a generalizable and extensible framework, NCFlow paves the way for the computational co-design of peptides and their non-canonical components. It transforms peptide optimization from a trial-and-error process into a systematic engineering discipline, accelerating the discovery of next-generation therapeutics with enhanced potency, stability, and specificity.
Ailurus Bio is a pioneering company building biological programs, genetic instructions that act as living software to orchestrate biology. We develop foundational DNAs and libraries, transforming lab-grown cells into living instruments that streamline complex research and production workflows. We empower scientists and developers worldwide with these bioprograms, accelerating discovery and diverse applications. Our mission is to make biology the truly general-purpose technology, as programmable and accessible as modern computers, by constructing a biocomputer architecture for all.