Abstract

This study was designed to identify lncRNA biomarker candidates using lung cancer data from RNA-Seq and microarray platforms separately.

Lung cancer datasets were obtained from the Gene Expression Omnibus (GEO, n = 287) and The Cancer Genome Atlas (TCGA, n = 216) repositories, only common lncRNAs were used. Differentially expressed (DE) lncRNAs in tumors with respect to normal were selected from the Affymetrix and TCGA datasets. A training model consisting of the top 20 DE Affymetrix lncRNAs was used for validation in the TCGA and Agilent datasets. A second similar training model was generated using the TCGA dataset.

First, a model using the top 20 DE lncRNAs from Affymetrix for training and validated using TCGA and Agilent, achieved high prediction accuracy for both training (98.5% AUC for Affymetrix) and validation (99.2% AUC for TCGA and 92.8% AUC for Agilent). A similar model using the top 20 DE lncRNAs from TCGA for training and validated using Affymetrix and Agilent, also achieved high prediction accuracy for both training (97.7% AUC for TCGA) and validation (96.5% AUC for Affymetrix and 80.9% AUC for Agilent). Eight lncRNAs were found to be overlapped from these two lists.