Male-specific age estimation based on Y-chromosomal DNA methylation

Athina Vidaki, Diego Montiel González, Benjamin Planterose Jiménez, Manfred Kayser

  • 1 Department of Genetic Identification, Erasmus University Medical Center Rotterdam, Rotterdam 3000, CA, The Netherlands

Received: March 25, 2020       Accepted: February 25, 2021       Published: March 11, 2021
Although DNA methylation variation of autosomal CpGs provides robust age predictive biomarkers, no male-specific age predictor exists based on Y-CpGs yet. Since sex chromosomes play an important role in aging, a Y-chromosome-based age predictor would allow studying male-specific aging effects and would also be useful in forensics. Here, we used blood-based DNA methylation microarray data of 1,057 males from six cohorts aged 15-87 and identified 75 Y-CpGs with an interquartile range of ≥0.1. Of these, 22 and six were significantly hyper- and hypomethylated with age (p(cor)<0.05, Bonferroni), respectively. Amongst several machine learning algorithms, a model based on support vector machines with radial kernel performed best in male-specific age prediction. We achieved a mean absolute deviation (MAD) between true and predicted age of 7.54 years (cor=0.81, validation) when using all 75 Y-CpGs, and a MAD of 8.46 years (cor=0.73, validation) based on the most predictive 19 Y-CpGs. The accuracies of both age predictors did not worsen with increased age, in contrast to autosomal CpG-based age predictors that are known to predict age with reduced accuracy in the elderly. Overall, we introduce the first-of-its-kind male-specific epigenetic age predictor for future applications in aging research and forensics.


BIC: Bayesian information criterion; BMIQ: beta mixture quantile; CpG: cytosine-phosphate-guanine site; CV: cross-validation; DNA: Deoxyribonucleic acid; DNAm: DNA methylation age (Horvath clock); EWAS: epigenome-wide association study; FDP: forensic DNA phenotyping; GEO: Gene Expression Omnibus database; HIV: human immunodeficiency viruses; IGV: integrative genomics viewer; IQR: inter-quantile range; MAD: mean absolute deviation; MLR: multiple linear regression; MSE: mean square error; OLS: ordinary least squares; oob: out-of-band; QC: quality control; RELIC: regression on logarithm of internal control probes; RFR: random forest regression; RMSE: root mean square error; RSS: residual sum of squares; SNP: single nucleotide polymorphism; SVM: support vector machine; Y-CpG: Y-chromosome-located CpG.