Expanding the Proteomic Alphabet with AI and Flow Matching

NCFlow: A flow matching model for designing peptides with non-canonical amino acids, expanding protein engineering beyond nature's 20-amino-acid limit.

Ailurus Press
September 25, 2025
5 min read

The Bottleneck in a Golden Age of Protein Design

In the landscape of modern biotechnology, protein and peptide engineering stands as a cornerstone for developing novel therapeutics, catalysts, and materials. Generative AI models like AlphaFold have revolutionized our ability to predict and design structures based on the 20 canonical amino acids that form the building blocks of life. Yet, this very alphabet, honed by billions of years of evolution, represents a fundamental constraint. To unlock the next frontier of molecular function, we must look beyond nature’s vocabulary to the vast, untapped chemical space of non-canonical amino acids (ncAAs).

The challenge, however, has been computational. Mainstream protein design tools are trained on and architected for the standard 20 amino acids. They lack the representational capacity and training data to model the unique geometries and chemical properties of hundreds of potential ncAAs. This has created a critical bottleneck: how can we rationally design proteins with components that our best models do not understand?

The Rise of Flow Matching and the Path to NCFlow

The journey to solve this problem has been incremental. The advent of deep generative models marked a paradigm shift, but initial diffusion and autoregressive models were still largely tied to the canonical residue space. A significant theoretical breakthrough came with Flow Matching, a framework for training continuous normalizing flows that proved highly effective for generating complex 3D structures like molecules [2]. This innovation paved the way for a new class of models.

Early applications like PepFlow [3] and PPFlow [4] demonstrated the power of flow matching for designing all-atom peptides that target specific protein receptors. Concurrently, models such as FlowPacker [5] applied similar techniques to the nuanced task of protein side-chain packing. While powerful, these pioneering efforts primarily focused on the canonical alphabet, leaving the integration of arbitrary ncAAs as an open and formidable challenge. The core issue remained: with ncAA-containing structures representing a mere 0.02% of the Protein Data Bank (PDB), a data-driven approach seemed unfeasible.

A Key Breakthrough: NCFlow's Generalizable Design Framework

A recent preprint by Lee and Kim introduces NCFlow, a generative model that directly confronts this data scarcity and representational challenge to enable the design of peptides with virtually any ncAA [1]. Their work provides a systematic solution by rethinking how the model learns and what it learns from.

A Universal Atomic Representation

At its core, NCFlow abandons the residue-level tokenization common to models like AlphaFold. Instead, it operates directly on an atomic graph representation. This allows it to process any molecular structure, defined by its atoms and bonds, without being restricted to a predefined vocabulary. The model's architecture, inspired by AlphaFold3's Pairformer and Atom Transformer blocks, is designed to learn the language of atomic interactions in 3D space.

Overcoming Data Scarcity with Multi-Stage Training

The central innovation of NCFlow is its three-stage training strategy, which ingeniously circumvents the lack of ncAA data:

  1. Pre-training on Small Molecules: The model is first trained on millions of small molecule structures from the PubChem3D database. This phase teaches the model the fundamental rules of chemistry and 3D conformational preferences, independent of a protein context.
  2. Pre-training on Protein-Ligand Complexes: Next, NCFlow is fine-tuned on a large dataset of protein-ligand complexes. This step teaches the model how to place a molecular entity within the complex environment of a protein binding pocket, learning the nuances of intermolecular interactions.
  3. Fine-tuning on Native ncAAs: Only in the final stage is the model fine-tuned on the sparse set of native ncAA structures from the PDB. By this point, the model has already learned the general principles of chemistry and binding, allowing it to effectively generalize from the limited examples.

This hierarchical training regimen allows NCFlow to build a robust, generalizable understanding of molecular structure that is not over-fitted to the canonical amino acids.

A Systematic Pipeline for Peptide Design and Validation

NCFlow is more than just a structure generator; it is the engine of a complete in silico design-and-validation pipeline. The process begins with a known protein-peptide complex and proceeds through a high-throughput workflow:

  1. Candidate Generation: An in silico deep mutational scan is performed, where each position on the peptide is considered for replacement by a library of commercially available ncAAs.
  2. Conformer Generation: For each proposed mutation, NCFlow generates the most likely 3D conformation of the ncAA within the protein's binding pocket.
  3. Multi-Tiered Scoring: A rigorous filtering cascade is applied to the thousands of generated variants. This involves:
    • Structural Confidence (pLDDT): Filtering for high-confidence predictions from the model itself.
    • AI-based Affinity Prediction (AEV-PLIG): A machine learning model estimates the change in binding affinity.
    • Physics-based Validation (ATM): The most promising candidates are further evaluated using alchemical free energy calculations, a computationally intensive but more physically rigorous method.

The results are compelling. NCFlow significantly outperforms AlphaFold3-based methods in predicting the structure of unseen ncAAs, achieving an average RMSD of 1.43 Å compared to over 3.2 Å [1]. More importantly, when applied to four different protein-peptide test cases, the pipeline identified ncAA-containing variants with predicted binding affinity improvements of up to -7.0 kcal/mol. Structural analysis of these top hits revealed that the affinity gains were driven by chemically intuitive mechanisms, such as the formation of new polar contacts, π-π stacking interactions, or improved solvent interactions—features that were impossible to achieve with the original canonical residues.

Broader Impact and the Future of Programmable Peptides

The NCFlow paper marks a significant step towards a new paradigm in protein engineering—one where the design space is no longer confined to a fixed alphabet but is an open, continuous chemical universe. It provides a blueprint for how to build generative models that can reason about arbitrary molecular structures, a capability with implications far beyond peptide design.

However, challenges remain. The accuracy of scoring functions is still a major hurdle, and the current framework is limited to single-point mutations, leaving the vast combinatorial space of multi-ncAA variants unexplored. The ultimate validation, of course, will come from synthesizing and testing these in silico designs in the wet lab. Validating these thousands of designs at scale will require a new generation of platforms that integrate high-throughput DNA construction and screening, such as those employing self-selecting expression vectors to create a rapid design-build-test-learn cycle. Services that streamline the synthesis of AI-generated DNA constructs will also be critical in closing this loop.

By providing a generalizable and extensible framework, NCFlow paves the way for the computational co-design of peptides and their non-canonical components. It transforms peptide optimization from a trial-and-error process into a systematic engineering discipline, accelerating the discovery of next-generation therapeutics with enhanced potency, stability, and specificity.

References

  1. Lee, J.S., & Kim, P.M. (2025). Design of peptides with non-canonical amino acids using flow matching. bioRxiv. https://www.biorxiv.org/content/10.1101/2025.07.31.667780v1
  2. Lipman, Y., et al. (2022). Flow Matching for Generative Modeling. arXiv. https://arxiv.org/abs/2210.02747
  3. Bo-Kyeong, K., et al. (2024). PepFlow: A Multimodal Deep Generative Model for Target-Specific Peptide Design with Flow Matching. arXiv. https://arxiv.org/abs/2406.00735
  4. Zhang, Z., et al. (2024). Target-aware Peptide Design with Torus-based Flow Matching. bioRxiv. https://www.biorxiv.org/content/10.1101/2024.03.07.583831v5
  5. Yim, J., et al. (2024). Protein Side-Chain Packing with Torsional Flow Matching. bioRxiv. https://www.biorxiv.org/content/10.1101/2024.07.05.602280

About Ailurus

Ailurus Bio is a pioneering company building biological programs, genetic instructions that act as living software to orchestrate biology. We develop foundational DNAs and libraries, transforming lab-grown cells into living instruments that streamline complex research and production workflows. We empower scientists and developers worldwide with these bioprograms, accelerating discovery and diverse applications. Our mission is to make biology the truly general-purpose technology, as programmable and accessible as modern computers, by constructing a biocomputer architecture for all.

For more information, visit: ailurus.bio
Share this post
Authors of this post
Ailurus Press
Subscribe to our latest news
We care about your data in our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form. Please contact us at support@ailurus.bio