Government Funding Lapse

Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit  cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at opm.gov.

U.S. flag

An official website of the United States government

Dot gov

The .gov means it's official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Automating Data Extraction from Scientific Literature and General PDF Files Using Large Language Models and KNIME: An Application in Toxicology

Authors: José T. Moreira-Filho, Dhruv Ranganath, Ricardo S. Tieghi, Robert Patton, Vicki Sutherland, Charles Schmitt, Andrew A. Rooney, Vickie Walker, Jennifer Fostel, Trey Saddler, David Reif, Kamel Mansouri, and Nicole Kleinstreuer

DOI: https://doi.org/10.22427/NTP-DATA-502-002-002-000-1


Publication


Abstract

The increasing volume of scientific publications and reports presents a challenge in accessing and utilizing data due to their unstructured nature. Toxicology, in particular, depends on structured data for study evaluation, weight-of-evidence chemical assessments, and validation of new approach methodologies (NAMs). Manual data extraction is labor-intensive and fails to meet the demand for structured information. This work presents an automated data extraction workflow using large language models (LLMs) within the KNIME platform. The workflow integrates document parsing tools with LLMs to extract variables from scientific publications and general PDF files. Two execution modes are available: text mode and image mode. Text mode applies tools for extracting text and tables, while image mode uses multimodal LLMs to process non-linear layouts and graphical content. The workflow achieves 81.14% accuracy in text mode for scientific publications and up to 98.54% in image mode for general PDF files. The KNIME platform ensures accessibility through a user-friendly interface, allowing non-experts to use advanced data extraction methods. This automated approach facilitates toxicological research by improving the retrieval of structured data. By democratizing access to LLM-powered workflows, this approach paves the way for significant advancements in knowledge synthesis to support biomedical research.

Supplementary Material


Supporting Information

Scientific Publications

General PDFs