Abstract
The development of robust oral tablet formulations remains time-consuming, often limited by scarce data and the difficulty of incorporating categorical formulation variables into predictive models. Traditional regression methods are interpretable but struggle with nonlinear interactions, whereas modern machine learning approaches offer higher predictive power at the expense of transparency.
Highlights
- Word embeddings enable neural networks to encode categorical formulation variables (APIs).
- Empirically-guided output functions ensure physically consistent prediction of tablet properties.
- Predicts tensile strength, density, ejection force, and dosing height from minimal inputs.
- Shapley analysis reveals that low-API formulations provide high informational value.
In this study, we present a neural network framework that employs word embedding layers to represent categorical formulation factors, such as active pharmaceutical ingredients (APIs), as trainable semantic vectors. These embeddings are integrated with empirically-guided output functions and a deep ensemble strategy to predict tablet quality attributes, including tensile strength and density as well as ejection force, and dosing height, based solely on formulation composition, compression pressure, and tablet weight. The model achieved predictive accuracy comparable to or exceeding classical regression while reliably avoiding physically implausible outputs.
Analysis of the learned embedding vectors revealed meaningful clustering of APIs, enabling transfer learning across materials and robust predictions even for APIs with few or no training data. Furthermore, information gain analysis demonstrated that low-concentration formulations can substantially enhance predictive accuracy, supporting more material-efficient experimental designs. These results highlight embedding-based, empirically-guided neural networks as explainable and practical tools that could accelerate pharmaceutical formulation development in the future.
Download the full article as PDF here Data-efficient prediction in tableting using word embeddings and empirically-guided neural networks
or continue reading here
Materials
The materials employed are listed in Table 1. Here, the term API refers not only to drug substances with actual pharmacological activity, but also to surrogate compounds that serve solely to introduce diversity in physicochemical properties and process behavior within the database. IBUG represents a pre-granulated grade of ibuprofen, specifically designed for direct compression and containing additional excipients. The material Ibuprofen 50 was included in the study twice (IBUP and IBUP rep.), with both datasets analyzed separately.
Table 1. Excipients and APIs used in tablet formulations.
| Brand name | Chemical Identity | Purpose in the formulation | Abbreviation |
|---|---|---|---|
| Brand name | Chemical Identity | Purpose in the formulation | Abbreviation |
| – | Acetylsalicylic acid | API | ASA |
| Kollidon® CL-F | Crospovidone | Disintegrant | – |
| DI-CAFOS A 12 | Dibasic calcium phosphate anhydrous | API surrogate | DCPAA12 |
| – | Efavirenz | API | EV |
| Ibuprofen 50 | Ibuprofen | API | IBUP and IBUP rep. |
| Ibuprofen DC85 W | Ibuprofen | API | IBUG |
| FlowLac® 100 | α-lactose monohydrate | Filler | – |
| – | Lopinavir | API | LPV |
| Magnesium stearate Pharma VEG | Magnesium stearate | Lubricant | – |
| – | Metformin | API | MFM |
| VIVAPUR® 101 | Microcrystalline cellulose | Binder | – |
| – | Paracetamol | API | AAP |
| AEROSIL® 150 V | Silicon dioxide | Glidant | – |
| – | Sodium benzoate | API | SoBe |
Najeeb Abdelrahman, Stefan Klinken-Uth, Data-efficient prediction in tableting using word embeddings and empirically-guided neural networks, International Journal of Pharmaceutics: X, 2025, 100458, ISSN 2590-1567, https://doi.org/10.1016/j.ijpx.2025.100458.
Are you looking for excipients in commercial quantities?

















































