Predictive accuracy of the models
We have created a DrugAge dataset specifically for studying the classification of compounds into the classes “increase lifespan” or “do not increase lifespan”, depending on each compound’s effect when administered to C. elegans. In this dataset, each compound to be classified belongs to one of the two just-mentioned classes, and is described by a large set of chemical descriptors and biological GO term features.
We use the random forest method as the classification algorithm to analyse this dataset. This type of method was chosen because it is particularly popular in bioinformatics [21,22], it is robust to overfitting in datasets where the number of features is much larger than the number of instances (as with our dataset) [22,23], it is relatively simple to understand and to use, and finally, in contrast to other state-of-the-art classification methods like support vector machines, random forests produce interpretable results based on a variable (feature) importance measure, an interpretation mechanism also exploited in this paper.
Predictive accuracy for the models developed was evaluated by Area Under the ROC curve (AUC). This is a measure between 0 and 1, with 1 indicating perfect (no error) class predictions. The reported predictive accuracy used is the median over the 10 test sets of the external cross-validation. We report the median accuracy, rather than the mean, because the former is more robust to outliers. The median AUC results from each of the different versions of the DrugAge dataset (using either chemical and/or biological descriptors), where for each dataset version we optimised the parameters ntrees and mtry of the random forest method as described in the Methods section.
The AUC results are reported in Table 1. Comparing the AUC values across the dataset versions (last column in Table 1), it is clear that, in general, the set of chemical descriptors have a greater ability to predict a compound’s class than the set of GO terms. More precisely, the dataset using only chemical descriptors as features has substantially larger AUC than the one using only GO terms as features (0.781 vs. 0.716, respectively). However, the GO term features still offer some positive contribution to the predictive accuracy of random forests, since the dataset version leading to the highest AUC value in Table 1 (0.800) was the one using both GO terms and chemical descriptors as features.
Table 1. Predictive accuracy (median AUC values on 10-fold cross validation) obtained by random forest with parameters optimized for each DrugAge dataset version (each with a different feature type combination).
Dataset features | RF’s optimized parameters | Median AUC |
ntrees | mtry |
GO terms only | 300 | 52 | 0.716 |
Chemical descriptors only | 100 | 16 | 0.781 |
GO terms and chemical descriptors | 900 | 210 | 0.800 |
Biological and chemical features for the prediction of longevity compounds in C. elegans
One of the benefits of utilizing the random forest method, as well as it being a highly predictive technique, is that for each feature an importance measure can be calculated. This importance measure (often called variable importance) offers the opportunity to interpret the relevance of each feature in the model produced. In this work, using the Boruta and Ranger R packages [21,24] and computing the importance of features in the best model (built using both GO terms and chemical descriptors as features), 93 features – 73 chemical descriptors and 20 GO terms – were selected as statistically significant features (full table in Supplementary Material). Recall that the GO term features are derived from the proteins which are targeted by each compound.
The 20 GO terms selected as significant mainly make up biological process GO terms (14 out of 20), five molecular function terms and one defining a cellular component term. Biological process GO terms describe a series of processes as well as specific biological processes such as macromitophagy and macroautophagy, which are among the features with the highest importance in this work. Molecular function GO terms describe specific activities that occur at the molecular level such as isomerase activity and protein disulfide isomerase activity. Finally, cellular component GO terms describe locations in the cell, e.g. at the level of organelles or macromolecular complexes such as the mitochondrial proton-transporting ATP synthase complex, highlighted as the only significant cellular component GO term feature in this work.
Chemical molecular descriptors are calculated from the chemical structure and are normally used to build predictive models to study the relationship between a compound’s chemical structure and its biological and pharmacokinetic properties such as drug distribution and absorption [25,26]. This paper is the first use of chemical molecular descriptors (as well as GO terms) to study the relationship between longevity and the chemical structure of compounds that may affect longevity.
Chemical molecular descriptors can be broadly categorized into three main groups, which describe a compound’s chemical structure and its main properties. These groups are: hydrophobic, electronic and steric (size and/or shape) descriptors. Hydrophobicity descriptors describe the hydrophobic character of a chemical compound and how easily it can cross cell membranes, and they may also be important for receptor interactions. Electronic molecular descriptors describe the electron distribution in a chemical compound and its electrostatic interactions, therefore they give an indication of how strongly (in terms of affinity) and how specifically a chemical compound binds to specific receptors. Finally, steric descriptors describe the size and shape of the chemical compound. The size and shape of a compound may influence its binding with an enzyme or receptor binding sites and can also affect other psychochemical properties. Note that a chemical molecular descriptor can belong to more than one of the categories described above.
The top 20 selected features with the highest median variable importance are shown in Table 2. Considering just the top 20 features as shown in Table 2, there are slightly more GO terms (12 out of 20) than chemical molecular descriptors (8 out of 20). Those 12 GO terms include terms related to mitochondrial processes, terms related to enzymatic and immunological processes and terms related to metabolic and transport processes. Furthermore, the eight chemical molecular descriptors in the top 20 features contain descriptors related to electronic and steric (size and shape) effects, but not to hydrophobic effects directly.
Table 2. Top 20 selected features with the highest median variable importance.
Median Variable Importance | Feature | Feature type | Feature Description |
14.4 | a_nN | MD | Number of nitrogen atoms in the molecule |
12.8 | isomerase activity | GO | Catalysis of the geometric or structural changes within one molecule |
11.8 | macromitophagy | GO | Degradation of a mitochondrion by macroautophagy |
11.6 | macroautophagy | GO | Process in which cellular contents are degraded by lysosomes |
11.1 | protein disulfide isomerase activity | GO | Catalysis of the rearrangement of both intrachain and interchain disulfide bonds in proteins. |
11.0 | dipeptidase activity | GO | Catalysis of the hydrolysis of a dipeptide. |
9.72 | pyruvate metabolic process | GO | The chemical reactions and pathways involving pyruvate |
9.47 | PEOE_VSA+4 | MD | Total positive van der waals surface area of atoms with atomic charge in the range of 0.20-0.25. |
9.31 | fatty acid transport | GO | The directed movement of fatty acids into, out of or within a cell, or between cells |
8.79 | mitochondrial electron transport, NADH to ubiquinone | GO | The transfer of electrons from NADH to ubiquinone mediated by the multisubunit enzyme known as complex I |
8.64 | vsurf_Wp2 | MD | Polar volume at -0.5, a descriptor reflecting the polarizability of a molecule |
8.57 | isotype switching | GO | The switching of activated B cells from IgM biosynthesis to biosynthesis of other isotypes |
8.40 | translation | GO | The cellular metabolic process in which a protein is formed |
8.18 | Q_RPC- | MD | Relative negative partial charge, defined as the most negative atomic charge divided by the sum of all negative atomic charges in the molecule. |
8.09 | aerobic respiration | GO | The enzymatic release of energy from inorganic and organic compounds |
7.98 | a_IC | MD | Atom information content (total), defined as the entropy of the element distribution in the molecule multiplied by the number of atoms. |
7.95 | PEOE_VSA_FPPOS | MD | Fractional polar positive vdw surface area |
7.86 | triglyceride mobilization | GO | The release of triglycerides from storage within cells or tissues, making them available for metabolism. |
7.79 | chi1v | MD | Valence corrected molecular connectivity index (order 1) |
7.70 | bpol | MD | Sum of the absolute value of the difference between atomic polarizabilities of all bonded atoms in the molecule |
GO: Gene ontology term; MD: Chemical Molecular descriptor |
It can be seen from the list of important features that the vast majority of the most important features are very specific molecular and biological processes. However, these specific processes are generic in their applicability and occur across many tissues and organs. For example “isomerase activity” covers a broad range of various enzymes that catalyze reactions across many biological processes, such as in glycolysis and carbohydrate metabolism. Although it is evident that isomerase activity is relevant to metabolism (amongst other processes) and hence ageing, this feature is not specific enough to suggest practical targets for pharmacological intervention. In spite of this, some of the specific features have been linked with longevity and ageing processes.
GO terms related to metabolism encompass the vast majority of the GO term features listed in Table 2. These GO terms range from very general metabolism-related properties such as aerobic respiration to more specific processes such as dipeptidase activity, pyruvate metabolic process, fatty acid transport and mitochondrial electron transport from NADH to ubiquinone. Given the involvement of metabolic factors in several theories of ageing such as the free radical theory of ageing, as well as the well-established effect of calorie-restriction on longevity, it is expectable that the compounds that affect ageing do so by interacting with these pathways and processes, as evidenced also by the importance of such features in the random forest model.
One apparent group of features that can be related to longevity and ageing are the GO terms related to autophagy (macroautophagy and macromitophagy) and mitochondrial processes. Macroautophagy is the process where cellular contents are degraded by lysosomes or vacuoles and recycled, and this process controls cytosolic protein and organelle degradation [27,28]. Whereas macromitophagy is the degradation of mitochondrion by macroautophagy and controls mitochondrial quality and quantity [29]. It is known that autophagy in general is associated with ageing processes. This can be evidenced by the occurrence of degenerative changes in mammalian tissues, similar to changes seen with ageing, as a result of genetic inhibition of autophagy. Moreover, pharmacological or genetic manipulations that increase life span in model organisms often stimulate autophagy. In the same way, there is a decrease in autophagy with increasing age in organisms, which leads to accumulation of damage [30] which is thought to be responsible for the functional loss in many biological and physiological processes as ageing occurs [31,32]. In addition to macroautophagy, mitophagy is specifically implicated in ageing. Mitophagy has been shown to be a selective, “non-random” process [33] that is governed by several biological pathways (see [34] for a review of the molecular mechanisms).
Mitochondrial respiration, and in particular electron transport chain, is the main source of reactive oxygen species. As a result, mitochondrial homeostasis is particularly affected by ageing, as ROS generation in mitochondria leads to mitochondrial protein and mtDNA damage [34]. Therefore, mitophagy can be regarded as a defense against oxidative stress, mitochondrial dysfunction, and ageing. This is supported by findings that along with mitochondrial biogenesis pathways, a key mediator of mitophagy and longevity assurance under conditions of stress in C. elegans (DCT-1) is upregulated when mitophagy is impaired [35]. It is therefore not unexpected to find in this work that chemical compounds that modulated mitophagy are also important promoters of longevity. It is interesting to note that in model organisms such as C. elegans disruption of mitochondrial electron transport chain processes can lead to increases in longevity, through genetic [36] or pharmacological interventions [37]. Finally, a related property, aerobic respiration, was also selected by the random forest model. Although aerobic respiration is a very broad term encompassing many processes that lead to the production of cellular energy, it is very well-associated with ageing through the known impact of mitochondrial function and caloric restriction.
Other GO features with links to longevity and ageing processes are protein disulfide isomerase activity and translation. Protein disulfide isomerase activity refers to the activity of isomerases that are involved in protein folding via formation and breakage of disulfide bonds within proteins in the endoplasmic reticulum (ER) [38,39]. The activity of this enzyme is key to protein folding and quality control in the ER. A number of studies have demonstrated that the levels of disulfide isomerase and their catalytic activity diminish with age [40]. Misfolding of proteins and ER stress are alleviated by the signalling pathway known as the ER stress response or the unfolded protein response, which involves protective measures to limit the protein load. These include up-regulation of ER chaperones involved in the refolding of proteins, activation of pathways leading to reduction of protein translation and degradation of misfolded proteins. Where ER stress cannot be reversed, cellular functions deteriorate and apoptosis will occur [41]. There is evidence in the literature to suggest that disruption of protein disulfide isomerase activity leads to ER stress and accumulation of misfolded proteins, which can give rise to age-related disease pathology [42]. Finally, the GO term translation has a clear biological relevance, since it is well-known that translation inhibition extends lifespan in C. elegans [43]. Translation has also been highlighted as a prime category in age-related genes in C. elegans in a recent paper by Fernandes et al (2016) [44]. It is therefore evident that pathways involved in protein translation and folding may be a target of anti-ageing compounds, hence the significance of GO terms such as “translation” and “disulphide isomerase” in the random forest model.
The molecular descriptors in Table 2 indicate the molecular properties that impact the longevity effect of the compounds. From the eight molecular descriptors listed in the table, the majority are electrostatic descriptors such as PEOE_VSA+4, vsurf_Wp2, Q_RPC-, PEOE_VSA_FPPOS and bpol. These electrostatic parameters also carry information regarding the topology of the molecule, and along with steric parameters such as chi1v and a_IC explain the interaction and binding of the compounds with their target sites. These targets/processes are in addition to those already described in the model by the biological features (GO terms).
Overall, even though the used dataset (like any other biological dataset) is somewhat biased by the fact that some genes have been much more studied than others [44], some of the most important features shown in Table 2 can be related to important and known biological processes of ageing and longevity, such as those related to autophagy and mitochondrial processes. Furthermore, the other selected biological and chemical features are a good starting point that warrants further investigation, to further link the chemical and biological features of chemical compounds with longevity and underlying biological ageing processes.
Predictions of novel potential life-extending compounds
The best model built from the DrugAge dataset (using GO terms and chemical descriptors) was used to predict the probability of the class “increase lifespan” for over 6,000 compounds from the DGIdb database v2 [45], where the class label of each compound is unknown. By using the predicted class probabilities we can rank and prioritise those compounds with the highest probability of increasing the lifespan of C. elegans. The list of all compounds predicted from the DGIdb dataset and their associated class probabilities can be found in the Supplementary Material, and the class probabilities for the top 20 compounds can be found in Table 3.
Table 3. Top 20 chemical compounds with the highest lifespan-increase class probability from the external screening dataset.
Chemical Compound Name | Predicted Probability |
acrolein | 0.691 |
valspodar | 0.683 |
ganirelix | 0.674 |
acetaldehyde | 0.669 |
mmk-1 | 0.667 |
rdp-58 | 0.665 |
cetrorelix | 0.657 |
gal-b5 | 0.656 |
m40 | 0.654 |
DB03393 | 0.650 |
bortezomib | 0.650 |
ro 25-1392 | 0.650 |
gv1001 | 0.650 |
lactose | 0.650 |
ergotamine | 0.650 |
cardiolipin | 0.642 |
dactinomycin | 0.642 |
abt-510 | 0.640 |
aplyronine a | 0.637 |
valinomycin | 0.637 |
As shown in Table 3 the highest predicted class probability for a compound in DGIdb was 0.69. Although not close to 1, this can be considered a relatively high probability, considering that the baseline probability (relative frequency) of the class “lifespan increase” in the DrugAge dataset used to build the model was only 0.20. In this section, we focus on the 50 “top hit” DGIdb compounds, with the highest values of probabilities for the predicted class “lifespan increase”. In general, the top hit compounds predicted to have longevity enhancing effects fall into four groups: compounds affecting mitochondria, compounds used in treatments for cancer, anti-inflammatories, and compounds used in gonadotropin-releasing hormone therapies.
Compounds related to mitochondrial processes
Acrolein (lifespan increase class probability = 0.69) was the top hit in our screening dataset. Acrolein is a highly reactive electrophile and a building block to many other chemical compounds, including the amino acid methionine. This compound has been shown to be an electron transport chain inhibitor, leading to mitochondrial dysfunction [46]. Acrolein is implicated in pathways such as p53 and the NF-κB inflammation pathway [47]. Acrolein is toxic at high concentrations [46], but at lower doses in vitro exposure to acrolein inhibits NF-κB activation, suggesting that inhibition of NF-κB gives rise to acrolein’s anti-inflammatory properties – however, the evidence is conflicting [48,49]. Therefore, the high probability of lifespan increase predicted by our model, despite the known toxicity of acrolein, may result from the contribution of a large diversity of the pathways affected by this compound, some of which are desirable for longevity.
Other compounds affecting mitochondrial processes include valinomycin and cardiolipin (both with lifespan increase class probability = 0.64). Valinomycin is a potassium ionophore and causes mitochondrial dysfunction by uncoupling oxidative phosphorylation in the electron transport chain [50]. Cardiolipin is a dimeric phospholipid found in the inner mitochondrial membrane (IMM), where it plays a major role in oxidative phosphorylation. Alterations in the content and composition, and peroxidation of cardiolipin leads to mitochondrial dysfunction [51,52]. Decrease in cardiolipin content has been observed in ageing brain, and in several pathologies including myocardial ischemia, heart failure and Parkinson’s disease [53]. Therefore, it is expectable that cardiolipin administration is predicted to promote longevity.
Anti-cancer drugs and longevity
Anti-cancer compounds from our top 50 hits in the DGIdb dataset include drugs such as temsirolimus, valspodar and bortezomib. Interestingly, temsirolimus (lifespan increase class probability = 0.62) is a derivative and pro-drug of sirolimus – also known as rapamycin. Rapamycin was the first pharmacological compound shown to extend lifespan in both genders in mice models [54,55], C. elegans [56] and D. melanogaster [57]. Numerous studies indicate that inhibition of the TOR (Target of Rapamycin) kinase is implicated in lifespan control [58,59]. Temsirolimus also inhibits mTOR, and this compound has been shown to improve certain cellular phenotypes in accelerated ageing models via increasing autophagy [60].
Valspodar (lifespan increase probability = 0.68), the second top-hit in our screening dataset, is an experimental chemosensitizer drug. Valspodar desensitizes tumor cells making them more vulnerable to anti-cancer drugs, due to its ability to inhibit P-glycoprotein (P-gp), which is overexpressed in many cancer cells. However, possibly of more relevance is the apoptotic effect of valspodar (and its structurally related compound, cyclosporine A) that stems from their disruption of mitochondrial membrane potential leading to mitochondrial dysfunction [61].
Bortezomib (lifespan increase probability = 0.65) is a proteasome inhibitor, and studies have shown that the inhibition of proteasome activity by bortezomib is associated with enhanced apoptosis due to inhibition of NF-κB activity [62,63]. However, this compound also leads to the accumulation of misfolded proteins and ER stress followed by unfolded protein response (UPR) and macroautophagy [64], which may potentially lead to longevity promotion.
Dactinomycin (lifespan increase probability = 0.64) interferes with ribosome biogenesis through the inhibition of RNA polymerase I [65], which leads to the activation of p53 [66]. Inhibition of the mTOR pathway leads to a reduction of ribosome biogenesis and increases lifespan in several species [54,57,67]. mTOR and p53 signalling pathways are connected by a number of different mechanisms, highlighting a complex relationship [66,68,69]. Considering that there are similar signaling molecules involved in both cancer and ageing [70,71], such as mTOR [72], p53 [69] and NF-kB [73], it is not unexpected to find anti-cancer drugs in our list of top hit compounds. However, this could be due to research bias, where anti-cancer drugs may be overrepresented in datasets (including DrugAge) due to the extensive study of cancer therapies.
Chemical compounds with anti-inflammageing effects
Ageing has been characterized by chronic, low-grade inflammation, also labeled as “inflammageing” [74]. Human studies have shown that suppression of chronic inflammation is a major determinant of successful longevity, over a very wide age range up to extreme old age [75,76].
The compound rdp-58 (lifespan increase class probability = 0.67), tested for the treatment of the inflammatory disorder ulcerative colitis [77,78], leads to a reduction of proinflammatory (tumor necrosis factor alpha) TNF-α and interleukins (ILs) such as interferon-γ, IL-2, IL-6, and IL-12 [79].
Ergotamine (lifespan increase probability = 0.65), a vasoconstrictor used for the treatment of migraines, has also been shown to reduce the level of proinflammatory TNF-α [80]. Dihydroergotamine methanesulfonate increases longevity in C.elegans [18] and was used to build our models. Dihydroergotamine methanesulfonate is a derivative of ergotamine, so this can explain the predicted pro-longevity effects for ergotamine.
The compound ro 25-1392 (lifespan increase probability = 0.65) is a type II vasoactive intestinal peptide receptor (VIPR2) agonist [81]. ro 25-1392 is an analogue of vasoactive intestinal peptide (VIP), which binds to both VIPR1 and VIPR2, leading to protection in models of inflammatory and autoimmune conditions [82,83].
Reproductive hormone factors and longevity
Gonadotropin-releasing hormone (GnRH) is responsible for the release of follicle-stimulating hormone (FSH) and luteinizing hormone (LH) in the pituitary gland, promoting the production of testosterone and estrogen. It is a part of the hypothalamic–pituitary–gonadal axis, which helps in the regulation of reproductive and immune systems [84].
In our list of top hit compounds there are examples of GnRH antagonists, such as ganirelix [85] and cetrorelix [86] (lifespan increase class probabilities 0.67 and 0.66, respectively); and agonists such as nafarelin [87] and histrelin [88,89] (lifespan increase class probabilities 0.63 and 0.62, respectively). Both antagonists and agonists (whose continued use leads to desensitisation of GnRH receptors) of GnRH receptors lead to the reduction of FSH and LH.
The decline in GnRH has been shown to contribute to ageing-related changes such as bone fragility and reduced neurogenesis in mice. Zhang [90] showed in mice that activation of NF-κB in the hypothalamus led to a reduced production of GnRH by neurons and that continued activation led to accelerated ageing, whereas GnRH treatment reduced neurogenesis and decelerated ageing. These findings suggest a link between inflammation and ageing related to GnRH. However, whether this relationship involving GnRH applies to humans and primates is questionable, as it appears that female primates have higher levels of GnRH with increasing age [91], whereas in Norway rats GnRH levels decreased with increasing age [92]. It is therefore apparent that GnRH has some role in longevity independent of its role in reproduction.