Research Paper Volume 15, Issue 18 pp 9293—9309

Biomedical generative pre-trained based transformer language model for age-related disease target discovery

Diana Zagirova1, , Stefan Pushkov1, , Geoffrey Ho Duen Leung1, , Bonnie Hei Man Liu1, , Anatoly Urban1, , Denis Sidorenko1, , Aleksandr Kalashnikov2, , Ekaterina Kozlova1, , Vladimir Naumov1, , Frank W. Pun1, , Ivan V. Ozerov1, , Alex Aliper1,2, , Alex Zhavoronkov1,2, ,

  • 1 Insilico Medicine Hong Kong Ltd., Hong Kong Science and Technology Park, New Territories, Hong Kong, China
  • 2 Insilico Medicine AI Limited, Level 6, Unit 08, Block A, IRENA HQ Building, Masdar City, Abu Dhabi, UAE

Received: June 15, 2023       Accepted: August 20, 2023       Published: September 22, 2023      

https://doi.org/10.18632/aging.205055
How to Cite

Copyright: © 2023 Zagirova et al. This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abstract

Target discovery is crucial for the development of innovative therapeutics and diagnostics. However, current approaches often face limitations in efficiency, specificity, and scalability, necessitating the exploration of novel strategies for identifying and validating disease-relevant targets. Advances in natural language processing have provided new avenues for predicting potential therapeutic targets for various diseases. Here, we present a novel approach for predicting therapeutic targets using a large language model (LLM). We trained a domain-specific BioGPT model on a large corpus of biomedical literature consisting of grant text and developed a pipeline for generating target prediction. Our study demonstrates that pre-training of the LLM model with task-specific texts improves its performance. Applying the developed pipeline, we retrieved prospective aging and age-related disease targets and showed that these proteins are in correspondence with the database data. Moreover, we propose CCR5 and PTH as potential novel dual-purpose anti-aging and disease targets which were not previously identified as age-related but were highly ranked in our approach. Overall, our work highlights the high potential of transformer models in novel target prediction and provides a roadmap for future integration of AI approaches for addressing the intricate challenges presented in the biomedical field.

Abbreviations

AI: Artificial Intelligence; GO: Gene Ontology; HGNC: HUGO Gene Nomenclature Committee; iPSC: Induced pluripotent stem cell; LLM: Large language model; NLTK: Natural Language Toolkit.