Abstract
Creating high-quality datasets for training machine learning models in specialized domains like pharmaceutical research is often constrained by the manual effort required to extract and compute critical parameters from heterogeneous literature. A novel deep prompt-engineering framework was developed to transform GPT-4 into a robust tool for automated and accelerated generation of structured datasets. Using a multi-set prompt strategy, GPT-4 analysed 70 full-text articles from literature on pharmaceutical inkjet printing to extract and compute 22 domain-relevant variables. These variables were organized into three main parameter groups: (i) printing parameters, (ii) rheological properties, and (iii) drug dose parameters, which were analysed using dedicated prompts. The outputs were benchmarked against a human-curated dataset compiled over four months by four domain experts, previously used to train machine learning models for predicting inkjet printability. Iterative prompt engineering yielded an overall accuracy of 0.942 across 4,217 individual variable-level data points, with computed variables reaching 0.983 accuracy despite multi-step calculations and unit conversions. Inter-day reproducibility was stable, and sensitivity, specificity, and predictive values all exceeded 0.900. Remarkably, the workflow reduced processing time from hours of human effort to <3.5 min per article. This novel prompt-engineering approach enabled GPT-4 to generate reliable, high-quality, literature-derived datasets, dramatically reducing manual effort while maintaining expert-level accuracy. Ultimately, this novel strategy facilitates the scalability of machine learning in pharmaceutical, and other data-intensive domains.
Introduction
Artificial intelligence (AI) has emerged as a transformative force across scientific disciplines, enabling major advances in diagnostics (Elemento et al., 2021, Ghaffar Nia et al., 2023), drug discovery (Chan et al., 2019, Chen et al., 2023), drug-target interaction (Ye et al., 2021), personalized medicine (Blasiak et al., 2020, Rezayi et al., 2022 (2022)), pharmacokinetics prediction (Schneckener et al., 2019, Obrezanova et al., 2022) and critical decision-making processes (Loftus et al., 2020), such as drug repurposing (Liu et al., 2021) or clinical trial design (Kavalci and Hartshorn, 2023). Increasingly embedded within scientific workflows, AI systems are not only virtual—enhancing data analysis and computational modelling—but also physical, driving innovations through carebots (Palmer and Schwan, 2022, Yew, 2021), surgical robots (Panesar et al., 2019, Knudsen et al., 2024), and intelligent medical devices (Muehlematter et al., 2021, Benjamens et al., 2020).
Among the wide range of AI systems, Large Language Models (LLMs) are gaining considerable attention over the past few years. These models, trained on massive corpora of text, exhibit impressive capabilities in generating, interpreting, and reasoning with natural language (Chang et al., 2024). LLMs such as GPT-4 (OpenAI), Gemini (Google DeepMind), Copilot (Microsoft), and DeepSeek-R1 (DeepSeek) have demonstrated proficiency in a wide range of tasks including data summarization (Zhang et al., 2025, Zhang et al., 2024), information retrieval (Ram et al., 2023, Dagdelen et al., 2024), question answering (Jiang et al., 2021, Singhal et al., 2025), and code generation (Jiang et al., 2024). Notably, GPT-4 (or lower versions) has distinguished itself through its superior contextual reasoning, mathematical competence, and prompt adaptability in domain-specific applications (Carou-Senra et al., 2025).
These capabilities have sparked growing interest in leveraging LLMs for scientific and medical research (Yang et al., 2023). Applications include automated clinical documentation (Mustafa et al., 2025), medical chatbots (Huo et al., 2025) or decision support tools (Giordano et al., 2021). A common thread across these use cases is the LLMs’ ability to transform unstructured information from the scientific literature into structured, actionable knowledge—tasks traditionally requiring extensive domain expertise and manual effort.
This capacity is particularly significant given a persistent and well-recognized bottleneck in the development of effective machine learning (ML) models in many specialized scientific fields, the limited availability of high-quality, domain-specific datasets. While several research areas, such as biological fields, benefit from mature, publicly available datasets and established data standards, other remain data-scarce and highly dependent on manual literature curation. In these domains, scientific datasets often require manual construction from a vast and heterogeneous body of literature rather than direct mining from public repositories or web-scale resources. This task is complicated by diverse reporting: information is often buried in narrative text, dispersed across tables, figures and supplementary files, and expressed in inconsistent units or jargon. Moreover, reporting conventions vary widely even within the same field, making standardization and interpretability more difficult. The lack of a unified data schema or machine-readable format further hinders systematic extraction, forcing researchers to rely on labour-intensive, expertise-dependent curation workflows that are difficult to scale and reproduce.
Pharmaceutical additive manufacturing (AM) exemplifies this challenge as an emerging research field that is progressively moving toward data-centric ML approaches. Numerous 2D and 3D printing technologies have emerged as promising tools in the pharmaceutical field, particularly in the context of personalized medicine. Techniques such as semisolid extrusion (SSE), fused deposition modelling (FDM), and inkjet printing enable the design of unique dosage forms (He et al., 2025, Wang et al., 2022), patient-specific dose adjustments (Carou-Senra et al., 2025, Arafat et al., 2018), improved treatment adherence (Rodríguez-Pombo et al., 2024, Alzoubi et al., 2023), and even the combination of multiple drugs within a single formulation (Khaled et al., 2015, Korsgaard Andreasen et al., 2025). AI systems, particularly ML models, have already been applied to the optimization of diverse AM technologies. Models have been trained and employed for predicting printing outcomes (Carou-Senra et al., 2023, Elbadawi et al., 2020), optimizing drug dosage form designs (Rezapour Sarabi et al., 2022, Shi et al., 2018), achieving desired release profiles (Castro et al., 2021, Mazur et al., 2023), tuning microstructures to reach target mechanical properties (Gan et al., 2019), and reducing energy consumption during the 3D printing process (Garg et al., 2016). However, developing such models require structured datasets that capture a complex array of interdependent variables—including printing process parameters, calculated operational values, and physicochemical properties of formulations. These data points are essential for model accuracy yet are typically buried in unstructured and inconsistently reported formats across the literature. Their extraction, computation and normalization demand not only significant human effort, but also deep domain expertise, making large-scale dataset generation particularly challenging in this field.
This study aims to develop and validate a methodological framework for structured dataset generation through data extraction and computation from full-text scientific articles using LLMs, with a specific application to pharmaceutical inkjet printing literature. Rather than assessing GPT-4 as a standalone extraction tool, the study focused on the design, implementation, and evaluation of a structured multi-prompt set approach. Using 70 heterogeneous peer-reviewed articles, the framework was evaluated by comparing the LLM-generated dataset with a human-curated reference dataset previously developed by four domain experts over four months and used to train ML models predicting inkjet printing outcomes (Carou-Senra et al., 2023). A central objective was to quantify how iterative prompt refinement, variable grouping, and decision-tree-based logic influence extraction reliability, progressing from a single baseline prompt to a fully structured multi-prompt architecture.
The study was systematically assessed in terms of accuracy to quantify the impact of prompt refinement on extraction quality. Additionally, the evaluation included metrics of sensitivity, specificity, positive predictive value and negative predictive value. Time efficiency and inter-day reproducibility were also assessed to benchmark the consistency and practical feasibility of this new methodology for GPT-4 dataset generation. Together, these analyses aim to demonstrate that robust extraction and computation from full-text articles can be achieved with LLMs through methodological design to create high-quality datasets, while reducing time, effort and cost.
Continue reading here
Paola Carou-Senra, Lucía Rodríguez-Pombo, Carmen Alvarez-Lorenzo, Alvaro Goyanes, Accelerating dataset generation for machine learning using large language models: a pharmaceutical additive manufacturing case, International Journal of Pharmaceutics, 2026, 126587, ISSN 0378-5173, https://doi.org/10.1016/j.ijpharm.2026.126587.
Read more interesting articles on Machine Learning here:
- Machine learning driven acceleration of biopharmaceutical formulation development using Excipient Prediction Software (ExPreSo)
- Machine learning real-time control of continuous granulation process
- Recent Techniques to Improve Amorphous Dispersion Performance with Quality Design, Physicochemical Monitoring, Molecular Simulation, and Machine Learning











































All4Nutra








