Abstract
Formulation development of protein biopharmaceuticals has become increasingly challenging due to new modalities and higher target drug substance concentrations. The limited amount of drug substance available during development, coupled with extensive analytical requirements, restrict the number of excipients that can be empirically screened. There is a strong need for in silico tools to optimize excipient pre-selection before wet lab experiments. Here, we introduce Excipient Prediction Software (ExPreSo), a supervised machine learning algorithm that suggests excipients based on the properties of the protein drug substance and target product profile. ExPreSo was trained on a dataset comprising 335 regulatory-approved peptide and protein drug products. Predictive features included protein structural properties, protein language model embeddings, and drug product characteristics. ExPreSo showed good performance for the nine most prevalent excipients in biopharmaceutical formulations and minimal overfitting. A fast variant of ExPreSo using only sequence-based input features showed similar prediction power to slower variants that relied on molecular modeling. Notably, an ExPreSo variant using only protein-based input features also showed good performance, indicating resilience to the influence of platform formulations. To our knowledge, this is the first machine learning algorithm to suggest biopharmaceutical excipients based on the dataset of regulatory-approved drug products. Overall, ExPreSo shows great potential to reduce the time, costs, and risks associated with excipient screening during formulation development.
Introduction
Formulation development is the process in which inactive ingredients are chosen to be added to a drug substance in order to stabilize it for manufacturing, distribution, storage, and patient usage. For protein and peptide biopharmaceuticals, these inactive ingredients, known as excipients, must prevent protein degradation and configuration changes which might affect efficacy and safety [1]. From a regulatory and operational point of view, it is desirable that formulations have an assigned shelf-life at targeted storage temperature (e.g. 5°C ± 3°C or 25°C ± 2°C at 60 % ± 5 % relative humidity) of at least 12 months. This has to be enabled by the careful selection of the formulation pH and a limited number of excipients, typically no more than five. The large number of possible excipients and potential liabilities make formulation development a highly complex field. Biologics license applications also need to outline qualitative and quantitative aspects regarding the use of each excipient contained in the drug product (EMEA/CHMP/QWP/396951/2006).
To simplify the challenges of formulation development, some companies utilize platform formulations as a starting point for their biopharmaceutical products [2], [3]. Typically, this comprises a standard buffer and a predefined set of excipients that are tested against each individual drug substance. This process is combined with pre-screening to select drug substance candidates with a low number of chemical liabilities and regions prone to aggregation, and presumably, compatibility with the preferred platform formulation. Incremental improvements are usually made by exchanging individual excipients with alternatives that have a similar mechanism of action, while keeping the remaining excipients constant. Although this approach saves time and effort [2], it limits the exploration of a broader range of excipients that might be better suited to the therapeutic protein.
There is currently a strong interest in the development of biologic drugs with high drug concentrations [4], [5], [6], [7]. This is driven by the increase in biologics with subcutaneous application [8], which offers many advantages to patients but usually requires a smaller injection volume when compared to intravenous application. Formulation development of high concentration protein drugs is extremely challenging due to problems with aggregation and viscosity/syringeability [5]. Furthermore, there are other market trends that offer challenges in formulation development, such as the increasing development of bispecific antibodies [9], [10] and antibody drug conjugates (ADCs) [11]. As a result, more than ever, formulation design must be tailored to the specific characteristics and liabilities of the drug substance under study. This applies to new drug substances under investigation and to the reformulation of existing products, for instance, transitioning from lyophilized to liquid formulations or changing the route of administration from intravenous to subcutaneous.
In the development of biopharmaceuticals, there are in silico prediction tools supporting many processes before and after formulation development, but few for the stage of formulation development itself. In early drug development of therapeutic proteins, there are a growing number of in silico prediction tools supporting lead development, particularly for antibodies [12], and for developability assessments aiming to select stable candidates [13], [14], [15]. In later stages of biopharmaceutical development, physics-based methods such as digital twins are well established for real-time process optimization [16]. However, within the domain of biopharmaceutical formulation development there is a lack of robust, validated in silico methods for excipient pre-screening. Molecular docking has been explored to identify excipients binding to a protein [17], however most docking algorithms are designed to identify strong binders to act as inhibitors, rather than transient interactions.
The long-term stabilization of a drug substance involves complex stochastic interactions of many elements, such as protein-protein interactions, excipient-protein interactions, and excipient-excipient interactions among others. In order to predict stabilizing excipients using physics-based methods, all these elements need to be modeled simultaneously, for extremely long time scales. Taking all these interactions into account is a computationally intensive task. The computational costs are further increased by the large simulation sizes necessary: for example, most detergents in liquid formulations exist in solution as large micelles, and the simulation size should also be large enough to account for indirect excipient effects such as preferential exclusion (for example by sucrose) [18], [19]. So far, this type of calculations have only been published in an academic context for a handful of excipients [20], [21], [22], [23], [24]. Approaches to improve performance include coarse-grained molecular dynamics (MD) simulations [25], AI powered MDs that utilize machine learning-refined force-fields [26], and machine learning models trained to predict the outcome of MD simulations [27].
An alternative strategy to predict excipient binding in silico is to conduct short all-atom MD simulations with the protein surrounded by a high concentration of the desired excipient. This technique, known as fragment mapping, can be used to rank excipients according to their affinity to a protein [28], [29], [30], [31]. Every method that looks at overall protein-excipient interactions relies on the assumption that excipients with increased protein binding are also more likely to increase protein stability. However, it has been shown that stronger excipient binding to the whole protein does not correlate with improved stability [32]. A more targeted strategy would involve screening excipients predicted to bind the exact sites of protein-protein interaction or transient unfolding that lead to aggregation, but these critical regions are usually unknown.
An alternative to physics-based modeling is a knowledge-based approach, particularly the development of machine learning algorithms to predict stable formulations. The primary challenge with such methods lies with data availability, as effective machine learning models require diverse datasets encompassing a large number of excipients and drug substances. While several machine learning algorithms have been created to assist small molecule formulations, such as those involving solid dispersion and cyclodextrin formation [33], [34], no such equivalent exists for biologics. Since formulation data during drug development is kept confidential, published datasets are very small, and few companies have a sufficiently large and diverse drug substance portfolio to support machine learning development. On the other hand, the final formulations of biopharmaceutical drugs approved by regulatory authorities are publicly accessible. This database of formulations is growing rapidly [7], [35] and it has already enabled the first quantitative analyses of stabilizing excipients [7], [35], [36], [37], and trends over time [3]. To our knowledge, this data remains an untapped resource for the prediction of stabilizing excipients.
This machine learning approach is further empowered by recent advances in computational tools such as AlphaFold2 and protein language models (pLMs). AlphaFold2 is a machine learning algorithm that provided a breakthrough in the de-novo prediction of protein structures [38]. These improved structural predictions enable the extraction of more reliable protein properties, which serve as inputs for downstream machine learning applications. pLMs are large language models that have been trained explicitly to predict protein sequences. A byproduct of pLMs are protein embeddings, which are the vectorial representations in the language model of each amino acid in a sequence [39]. Once a pLM is trained, the protein embeddings can be rapidly generated from any input protein sequence. These embeddings encode information about the amino acid and their surroundings in the sequence. Their use has provided a leap forward in predictive power for different machine learning tasks such as the prediction of structure, function, and epitopes [40], [41], [42], [43].
In this study, we created the Excipient Prediction Software (ExPreSo), a set of machine learning models trained on a database of formulations from 335 approved biopharmaceutical products. Each model predicts the probability that a specific excipient is included in a stable formulation, given inputs such as the drug substance sequence, pH, stock keeping unit (liquid or lyophilized), and drug substance concentration. ExPreSo has predictive power for nine commonly used excipients, offers interpretability regarding the most important predictive features, and can generate results within seconds, enabling its use in early-stage formulation development.
Download the full article as PDF here Machine learning driven acceleration of biopharmaceutical formulation development using Excipient Prediction Software (ExPreSo)
or continue reading here
Estefania Vidal-Henriquez, Thomas Holder, Nicholas Franciss Lee, Cornelius Pompe, Mark George Teese, Machine learning driven acceleration of biopharmaceutical formulation development using Excipient Prediction Software (ExPreSo), Computational and Structural Biotechnology Journal, Volume 27, 2025, Pages 4517-4525, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2025.10.026.
Read also the interesting article:
Machine learning driven acceleration of biopharmaceutical formulation development using Excipient Prediction Software (ExPreSo)

















































