A structured oral formulation database for machine learning: Uncovering data-informed design strategies to facilitate effective formulation development

Abstract

Oral dosage form development still relies heavily on empirical trial-and-error, while the high prevalence of poorly soluble drug candidates increases the need for structured data support. To address this limitation, we constructed the Computational Pharmaceutics Intelligent Manufacturing Database (CPIMD), an ML-oriented database that integrates physicochemical properties of active pharmaceutical ingredients, qualitative excipient compositions, release categories, and in vitro dissolution data from 683 marketed oral dosage forms approved by the PMDA of Japan. A standardized workflow for data cleaning, feature encoding and dissolution profile digitalization was used to transform raw information into a machine learning ready dataset. Using CPIMD, we applied unsupervised clustering to characterize four major formulation pattern clusters defined by drug properties, release categories, and associated excipient combinations. Together, these clusters summarize a formulation pattern matrix linking API physicochemical properties, release objectives, and associated functional excipients within the current dataset. A proof-of-concept random forest model further showed that binary excipient features could predict release type with 97.1% test accuracy, supporting the utility of CPIMD for downstream ML applications. CPIMD provides a structured data foundation for predictive modeling, preliminary excipient screening, and data-informed oral formulation development.

Highlights

CPIMD: an oral formulation database structured for ML-oriented drug development.
Unsupervised clustering on CPIMD characterized four major formulation pattern clusters associated with drug properties, release categories, and excipient usage.
Formulation design templates linking drug properties, release goals, and functional excipients were proposed.
CPIMD enables data-informed formulation development and model training.

Introduction

Oral administration remains one of the most common drug delivery routes due to its convenience, high patient compliance, and suitability for large-scale manufacturing (Alqahtani et al., 2021, Bannigan et al., 2020). As a core component of the pharmaceutical industry, the development of oral formulations plays a crucial role in transforming candidate compounds into safe, effective, and quality-controlled medicines. However, this process has long relied on empirical knowledge and iterative trial-and-error experimentation, resulting in prolonged development cycles, high costs, and limited success rates (Gao et al., 2021b, Treherne and Langley, 2021, Yang et al., 2019). Under such an experience-driven paradigm, drug development typically requires 10–15 years and investments of several billion US dollars (DiMasi et al., 2016, Kuentz et al., 2016, Schlander et al., 2021).

Approximately 40% of approved drugs and nearly 90% of drug candidates exhibit poor water solubility (Loftsson and Brewster, 2010, Xie et al., 2024). For BCS class II drugs, dissolution and solubilization are often major determinants of oral absorption, whereas for BCS class IV drugs systemic exposure is influenced by both solubilization-related and permeability-related processes. In this context, dissolution behavior represents an important, but not exclusive, factor affecting the oral performance of poorly soluble compounds, adding complexity to oral formulation development (Tsume et al., 2020, ANayak et al., 2025). In vitro dissolution behavior therefore provides an important formulation-level readout for evaluating release characteristics and potential in vivo performance, while also reflecting the combined influence of drug physicochemical properties, excipient selection, and experimental conditions (Patel et al., 2025, ANayak et al., 2025). However, because dissolution behavior is shaped by multiple interdependent factors, its prediction and control remain major challenges in oral formulation development.

Machine learning (ML) refers to algorithms that learn patterns from data to make predictions or decisions. This capability is particularly valuable for deciphering the complex multivariate relationships inherent in pharmaceutical formulations (Bannigan et al., 2021, Bao et al., 2023, Yang et al., 2019). For example, one study employed multiple ML models on a dataset of nearly 2000 tablet formulations to predict disintegration time, using molecular, physical, and compositional features as inputs; the Sparse Bayesian Learning model achieved a test R2 of 0.96 (Ghazwani and Hani, 2025). In another investigation, ML models were trained on 377 direct compression formulations comprising 20 APIs and 80 excipients to predict entire drug release profiles under dynamic dissolution conditions, with random forest achieving a five-fold cross-validation R2 of 0.635 (Protopapa et al., 2025). A further study combined design of experiments with artificial neural networks to predict the dissolution kinetics of extended-release tablets, using formulation and process parameters as inputs to model a first-order release constant, achieving a root mean square error of prediction of 0.0011 s−1 (Lourenço et al., 2025). These studies illustrate that ML can support concrete tasks such as formulation classification and performance-oriented prediction, but their broader application still depends on the availability of structured, standardized datasets.

One of the major bottlenecks in pharmaceutics, however, lies in the lack of high-quality, standardized, and reusable data resources (Bannigan et al., 2021, Bao et al., 2023). For formulation machine learning, robust datasets should ideally contain not only API descriptors and formulation composition, but also quantitative excipient levels, process parameters, batch-level variability, records of unsuccessful formulations, and independent external validation data. In practice, however, much of the currently available information—particularly dissolution profiles and formulation compositions—remains scattered in unstructured formats across regulatory documents, scientific publications, and patents, making it difficult to directly use for model training and validation (Dong et al., 2023, Gao et al., 2021a, Yanes et al., 2025). This fragmentation, together with the absence of unified digitalization standards, severely constrains the application of ML to formulation and performance prediction.

To address this challenge, we established the CPIMD, a structured resource integrating oral dosage forms approved by the PMDA of Japan with active pharmaceutical ingredient physicochemical properties, qualitative excipient compositions, release categories, and in vitro dissolution profiles. The current version encompasses 683 marketed oral formulations, covering 83 active pharmaceutical ingredients, 127 excipients, and 3,419 digitized dissolution profiles measured under various experimental conditions. By organizing heterogeneous regulatory information into standardized formulation-level and profile-level records, CPIMD provides a computable basis for systematic formulation analysis and proof-of-concept machine learning applications. The present work focuses on the construction and characterization of this database, with the clustering analysis and predictive modeling serving as illustrative examples to demonstrate its potential utility. CPIMD thus provides a useful data foundation for future studies in oral formulation analysis, predictive modeling, and effective formulation development.

Continue reading here

Jie Zhou, Conghui Li, Peng Zan, Zengming Wang, Baoqing Wang, Yanpeng Zhao, Xiuli Gao, Aiping Zheng, A structured oral formulation database for machine learning: Uncovering data-informed design strategies to facilitate effective formulation development, International Journal of Pharmaceutics, 2026, 126957, ISSN 0378-5173, https://doi.org/10.1016/j.ijpharm.2026.126957.