Identification of combined biomarkers for predicting the risk of osteoporosis using machine learning

Osteoporosis is a severe chronic skeletal disorder that affects older individuals, especially postmenopausal women. However, molecular biomarkers for predicting the risk of osteoporosis are not well characterized. The aim of this study was to identify combined biomarkers for predicting the risk of osteoporosis using machine learning methods. We merged three publicly available gene expression datasets (GSE56815, GSE13850, and GSE2208) to obtain expression data for 6354 unique genes in postmenopausal women (45 with high bone mineral density and 45 with low bone mineral density). All machine learning methods were implemented in R, with the GEOquery and limma packages, for dataset download and differentially expressed gene identification, and a nomogram for predicting the risk of osteoporosis was constructed. We detected 378 significant differentially expressed genes using the limma package, representing 15 major biological pathways. The performance of the predictive models based on combined biomarkers (two or three genes) was superior to that of models based on a single gene. The best predictive gene set among two-gene sets included PLA2G2A and WRAP73. The best predictive gene set among three-gene sets included LPN1, PFDN6, and DOHH. Overall, we demonstrated the advantages of using combined versus single biomarkers for predicting the risk of osteoporosis. Further, the predictive nomogram constructed using combined biomarkers could be used by clinicians to identify high-risk individuals and in the design of efficient clinical trials to reduce the incidence of osteoporosis.


INTRODUCTION
As a common health threat, osteoporosis is characterized by reduced bone mineral density (BMD) and bone architecture deterioration, consequently weakening the bones and conferring a higher fracture risk [1]. Currently, osteoporosis is not considered to be only a natural phenomenon occurring in older women, as it occurs throughout all stages of life, regardless of age or sex [2].
Various genetic components and environmental factors may contribute to the pathogenesis of osteoporosis [3,4]. Preventing low BMD during early menopause is a crucial concern for decreasing the risk of osteoporosis [5]. Age and obesity have been proposed as risk factors of fracture, and clinical nomograms have been constructed including these factors to predict the risk of osteoporosis and fracture [6,7]. Further, BMD is a significant factor of fracture risk, and is thus widely used in clinical practice as an indicator of osteoporosis [8][9][10]. However, the detailed pathogenesis of osteoporosis has yet to be elucidated, and there is still no effective therapeutic strategy. Identifying a novel therapeutic target for osteoporosis may help to establish a new therapeutic strategy [8]. Toward this end, microarray gene expression analysis could be used to identify essential targets and related signaling pathways involved in the pathogenesis of osteoporosis [11].
Artificial intelligence (AI) simulates human intelligence using machines, especially computer systems. AI can be used to analyze and improve the predictive performance of models in various research areas. Machine learning (ML), a major branch of AI, has been used in conjunction with bioinformatic functional analysis to identify predictive markers of osteoporosis [12,13]. Kim et al. [14] and Shim et al. [15] developed machine learning models to accurately identify the risk of osteoporosis in postmenopausal women.
In the current study, we analyzed public gene expression data related to osteoporosis to identify putative combined biomarkers for the prediction of osteoporosis risk. We also performed functional annotation of osteoporosis-related genes to establish a systematic approach to discover new molecular targets for the treatment of osteoporosis. Further, we constructed a nomogram for the prediction of osteoporosis risk in clinical practice, which can be used as an objective guideline for assessing a high risk of osteoporosis. We further anticipate that the identification of individuals at high risk of osteoporosis will allow for the design of more efficient therapeutic trials to ultimately reduce the incidence of osteoporosis.

RESULTS
We merged three microarray gene expression datasets from postmenopausal women with high and low BMD (GSE56815, GSE13850, and GSE2208) (see Materials and Methods for details). The merged dataset contained gene expression data for 6354 genes from 45 postmenopausal women with high BMD and 45 postmenopausal women with low BMD. We compared the predictive accuracies of the ML algorithms for predicting the risk of osteoporosis using the identified combined biomarkers. The study overview is schematically shown in Figure 1.

Identification of differentially expressed genes
We used the limma package (see Materials and Methods) to detect 378 differentially expressed genes between the high and low BMD groups. The expression patterns of 6354 genes and the identified 378 differentially expressed genes are shown in Figure 2A, 2B, respectively.

Gene ontology (GO) analysis of differentially expressed genes
We next performed GO term annotation and pathway enrichment analysis of the differentially expressed genes at the functional level using the Database for Annotation, Visualization and Integrated Discovery (DAVID) (https://david.ncifcrf.gov/) tool. The results are summarized in Table 1.

Identification of combined predictive markers of osteoporosis risk
To select the optimal number of combined biomarkers for risk prediction, we tested random sets of 1-5 genes, with 1000 replicates, and evaluated associations between the number of genes and predictive model accuracy ( Figure 3).
The prediction accuracy indicates the probability of concordance between predicted and observed responses. The accuracy increased with an increasing number of combined genes. We focused on identifying a prediction model based on the least number of genes and selected a two-gene set for further analysis. Among the various two-gene sets, the set including PLA2G2A and WRAP73 showed the highest accuracy, at almost 0.9. Ten combined biomarker sets of two or three genes each identified by simulations shown in Figure 3 are listed in Table 2.

Performance comparison of risk predictive models for osteoporosis
We then compared the performance of the risk predictive models for osteoporosis using different ML  algorithms. For this experiment, the dataset was randomly split into training (70% of data) and testing (30% of data) datasets. Random dataset spilt was processed repeatedly 100 times, and the model performance is summarized according to mean values and standard deviations calculated for all processing cycles in Table 3. We compared the performance of two model types: one predicting the risk probability of osteoporosis based on a single gene and the other predicting risk based on combined biomarkers (two or three genes). The performance of models based on combined biomarkers was superior to that of models based on single genes (Table 3). For single-gene models, the predictive accuracies were 0.667-0.999 with the training dataset and 0.603-0.662 with the testing dataset. For models based on combined biomarkers, random forest (RF) was the best-performing model with the training dataset (accuracy = 1.0). Performances with the test dataset tended to depend on the combined biomarkers used. Although RF exhibited the best performance with the training dataset, its performance with the testing dataset was worse than that of other models. When two genes were considered in a model, the best predictive gene set was PLA2G2A and WRAP73. When three genes were considered, the best predictive gene set was LPN1, PFDN6, and DOHH.

Nomogram construction
A nomogram was constructed using the gene set of PLA2G2A and WARP73 ( Figure 4A), as the best combination of two genes for predicting the risk of osteoporosis ( Table 3).
The risk probability of osteoporosis increased when the calculated point total decreased. For the point total of 95, the risk probability was 9.2% and for the point total of 40, the risk probability was 95% ( Figure 4A). For practical application, we constructed the nomogram in Table 2. Overview of the 10 sets of combined genes (two or three genes) tested. Hypertext Markup Language (HTML) format and populated it with the calculated total scores and probabilities ( Figure 4B). The calculated point total can also be used for stratification according to the risk probability of osteoporosis.

DISCUSSION
In this study, we analyzed merged microarray datasets of gene expression in postmenopausal women with high and low BMD. Using ML methods, we identified PLA2G2A and WRAP73 as the optimal combined biomarker set for predicting the risk of osteoporosis, and constructed a related nomogram for practical use. The devised nomogram will help clinicians to identify patients at high risk of osteoporosis, allowing timely treatment or a prevention strategy. Further, the obtained data provide insights into the development of osteoporosis.
In addition to biomarker set identification, the current study sheds light on the molecular processes involved in osteoporosis. We identified 378 genes that are significantly differentially expressed in postmenopausal women with high and low BMD. These include APOE, PRKAA1, and MAP3K1, which were previously reported to be associated with Alzheimer's disease (AD) [16][17][18]. As a common degenerative disease, AD and osteoporosis mainly occur in the elderly population [19,20]. Woodman [21] and Xiong et al. [22] reported that both decreased BMD and fractures are common phenomena in AD patients, and that AD target genes might be risk factors for osteoporosis. These observations coincide with the osteoporosis-related genes identified herein, which could also be AD biomarkers.
Among the combined gene sets identified in this study (Table 2), PLA2G2A and WRAP73 constituted the optimal identified biomarker set. PLA2G2A is associated with osteosarcopenia, and PLA2G2A overexpression is reported to be a valuable finding for the clinical management of sarcopenia in elderly women with osteoporosis [23,24]. Further, PLA2G2A influences osteoclastic bone resorption by facilitating the production of prostaglandin, a key modulator of bone remodeling [25,26]. WRAP73 encodes a member of the WD repeat protein family that is implicated in many  essential biological functions and pathological processes. Specifically, BIG-3, Wdr5, and Wdr8, members of the WD repeat protein family, have been implicated in osteoblast differentiation and osteogenesis in vivo [27][28][29][30]. We therefore propose that PLA2G2A and WRAP73 may influence the development of osteoporosis by regulating bone remodeling.
Considering the other identified genes, targeting OXTR, which encodes a protein that mediates anabolic skeletal recovery [31,32], has been suggested as a possible therapeutic target for osteoporosis and obesity. Indeed, Tamma et al. [33] reported that high levels of circulating oxytocin can activate osteoclast OXTR to prevent bone resorption by mature osteoclasts. Another differentially expressed gene, FURIN, was reported as a hub gene in postmenopausal women with low BMD, as indicated by the analysis of regulatory patterns of genes potentially associated with osteoporosis risk uncovered using Bayesian network analysis [34]. PDGFB, another gene identified herein, encodes a well-known growth factor required for various crucial biological processes such as embryonic development [35]. In mouse models, bone strength was increased under hematopoietic stem cellbased PDGFB therapy [36]. Moreover, as a homodimer of PDGFB, plasma PDGF-BB levels are maintained by estrogen in healthy young women and play a major role in postmenopausal osteoporosis [37]. Furthermore, several studies highlighted PDGFB as possible therapeutic target for osteoporosis [38][39][40]. Finally, Ye et al. [41,42] reported that PAFAH1B1 (LIS1), another gene identified to be differentially expressed in the current study, can promote osteoclastogenesis via regulating both the differentiation and survival of osteoclast progenitors.
In the current study, phosphorylation was the pathway that was the most significantly enriched in osteoporosisrelated genes, according to the GO DAVID annotation. Phosphorylation plays an essential role in bone metabolism in humans. For instance, phosphorylation of extracellular bone matrix proteins has been suggested as a risk factor of bone fragility [43]. Further, osteopontin, one of the key representative phosphoproteins in the bone matrix, is considered an early diagnostic biomarker of osteoporosis in postmenopausal women [44].
Although ML methods easily identify trends and patterns in a dataset, they require massive datasets to train on. Considering the small sample size, the first limitation of the current study is that the dataset used does not represent the entire population of individuals with osteoporosis. A model trained on a random sample of a dataset might have poor generalizability and perform poorly outside of that sample. Indeed, the use of large training and testing sets yields predictions that are more accurate and reliable than those obtained using small datasets [45]. The second limitation of the current study is that the subjects were all postmenopausal women. Osteoporosis and its major complication, osteoporotic fracture, affect both men and women, and cause substantial morbidity and mortality worldwide. Although the risk of osteoporosis is higher for women than for men, men suffer greater morbidity and mortality rates following osteoporotic fractures, especially at an advanced age, than women [46].
We presented the implemented a prediction model in the form of a nomogram, i.e., a graphical representation of a statistical model that indicates the probability of a particular clinical outcome. The nomogram constructed herein can serve as an objective guideline for the assessment of osteoporosis in high-risk individuals. The identification of individuals at high risk of osteoporosis would facilitate the design of efficient clinical trials to reduce the incidence of osteoporosis. The constructed nomogram could be used as a test version. As the next step, predictive model incorporating clinical variables and based on specific gene expression in a large dataset relevant to an aging population should be devised.

Data preparation
Three publicly available gene expression datasets for blood monocytes (GSE56815, GSE13850, and GSE2208) were used in the current study. These datasets are accessible from the public Gene Expression Omnibus (GEO) microarray database (https://www.ncbi.nlm.nih.gov/geo/). GSE56815 and GSE13850 each consist of data for 20 postmenopausal women with high BMD and 20 postmenopausal women with low BMD (for a total of 22,283 probes). GSE2208 includes data for five postmenopausal women with high BMD and five postmenopausal women with low BMD (for a total of 8623 probes). The three datasets were acquired using the same platform, GPL96 Affymetrix GeneChip Human Genome U133 Array Set. The GSE13850 and GSE56815 datasets contain expression data for 13,516 unique genes, and GSE2208 contains expression data for 8623 unique genes. Since these expression datasets contain duplicated gene symbols, mean expression values for duplicated genes in each dataset were combined and used for downstream analysis. The three datasets were merged according to gene name, to obtain a combined dataset for 45 postmenopausal women with high BMD and 45 postmenopausal women with low BMD with a total of 6354 gene symbols. The study design is shown in Figure 1.

GO functional annotation
The GO [47] project is a model structured to address the molecular function, biological process, and cellular component for individual genes by large-scale gene annotation. DAVID is a comprehensive functional annotation tool to provide a functional interpretation of a large gene set derived from genomic studies by a clustering algorithm [48]. In the current study, the DAVID online tool was used for pathway enrichment analysis of differentially expressed genes.
A modified Fisher's exact test p-value was used in this study. DAVID can be used to examine thousands of gene sets and test multiple hypotheses. DAVID provides a Benjamini-Hochberg false-discovery rate (FDR)-adjusted p-value, with a smaller p-value indicating more significant enrichment. The p-value and Benjamin-Hochberg FDR were both used to determine the significance of term enrichment for each annotation.

ML methods
The following ML algorithms were used in the current study [49].

Linear discriminant analysis (LDA)
LDA is a generalization of Fisher's linear discriminant, which is used to find a linear combination of features that characterize or separate two or more classes of objects or events. The resulting combination might be used as a linear classifier or for dimensionality reduction before classification. LDA is a dimensionreduction technique that is commonly applied to supervised classification problems. It is mainly used to model differences between groups (i.e., separating two or more groups from each other) [50].

k-Nearest neighbors (KNN) algorithm
The KNN algorithm is one of the simplest techniques used in ML [9], which is used for both classification and regression. The KNN algorithm works by finding the distance between data points based on the Euclidean distance. KNN computes the distance between each data point and the test data and then calculates the probability that the points are similar to the test data, finally classifying the data points based on the shared highest probabilities [51].

Support vector machine (SVM)
As one of the representative supervised learning models for pattern recognition and data analysis, SVM is mainly used for classification and regression analysis [52]. The objective of the SVM algorithm is to find a hyperplane in N-dimensional space (where N is the number of features) that distinctly classifies data points. Support vectors are data points that are closer to the hyperplane, which influence the position and orientation of the hyperplane. Using these support vectors, the margin of the classifier is maximized. Deleting the support vectors changes the position of the hyperplane.

RF
RF is an ensemble learning method for classification, regression, and other tasks. It operates by constructing a multitude of decision trees in the training phase. The output is the mode of classes (classification) or mean/ average prediction (regression) of individual trees [53]. The RF algorithm can be used to solve both regression and classification problems.
All ML models were implemented using the R programming language, version 4.1.0 (R Foundation for Statistical Computing, Vienna, Austria) [54], including the GEOquery and limma packages for downloading GEO datasets and identification of significant differentially expressed genes. The nomogram for predicting the risk of osteoporosis was created based on the selected genes.

AUTHOR CONTRIBUTIONS
Ki-Yeol Kim and Zhenlong Zheng designed the study and analyzed the data; Bong-Kyeong Oh contributed to the first draft of the manuscript; Xianglan Zhang and Ki-Yeo Kim supplied materials, created the figures, and prepared the final version of the manuscript.

CONFLICTS OF INTEREST
Zhenlong Zheng, Xianglan Zhang, Bong-Kyeong Oh, and Ki-Yeol Kim declare that they have no conflicts of interest related to this work.