Identification of a seven-cell cycle signature predicting overall survival for gastric cancer

While genetic alterations in several regulators of the cell cycle have a significant impact on the gastric carcinogenesis process, the prognostic role of them remains to be further elucidated. The TCGA-STAD training set were downloaded and the mRNA expression matrix of cell cycle genes was extracted and corrected for further analysis after taking the intersection with GSE84437 dataset. Differentially expressed mRNAs were identified between tumor and normal tissue samples in TCGA-STAD. Univariate Cox regression analysis and lasso Cox regression model established a novel seven-gene cell cycle signature (including GADD45B, TFDP1, CDC6, CDC25A, CDC7, SMC1A and MCM3) for GC prognosis prediction. Patients in the high-risk group shown significantly poorer survival than patients in the low-risk group. The signature was found to be an independent prognostic factor for GC survival. Nomogram including the signature shown some clinical net benefit for overall survival prediction. The signature was further validated in the GSE84437 dataset. In tissue microarray, CDC6 and MCM3 protein expression were significant differences by the immunohistochemistry-based H-score between tumor tissues and adjacent tissues, and CDC6 is an independent prognostic factor for GC. Interestingly, our GSEA revealed that low-risk patients were more related to cell cycle pathways and might benefit more from therapies targeting cell cycle. Our study identified a novel robust seven-gene cell cycle signature for GC prognosis prediction that may serve as a beneficial complement to clinicopathological staging. The signature might provide potential biomarkers for the application of cell cycle regulators to therapies and treatment response prediction.


INTRODUCTION
The mammalian cell cycle is a highly structured and regulated process that ensures the duplication of the genetic material and cell division. Cell cycle regulation involves both mechanisms that govern growth regulation and genetic integrity. Cancer is characterized by abnormal cell cycle activity, possibly because of upstream signaling pathway gene mutations or mutations in the genes that encode cyclins. Genetic changes in several cell cycle regulators have a key influence on the pathogenetic process of gastric cancer (GC) [1]. Studies have shown that cell cycle regulators are potential GC prognosis biomarkers of great clinical value [2], and targeting cell cycle regulators in GC treatment has also increasingly attracted attention [3].
Among the five most common cancers worldwide, GC is the third major cause of cancer mortality [4]. Despite surgical and adjuvant therapies, the prognosis of GC patients remains poor, and patients'5-year overall survival rate is below 25% [5]. Until now, the prediction of prognosis has primarily depended on histopathologic diagnosis and neoplasm staging systems. However, many patients in the progressive stage who have similar clinicopathologic features show great differences in prognosis.
Although much effort has been spent developing optimum tools for the prediction of GC prognosis, no consensus has been reached as to the best method. In most of the existing literature, clinical baseline characteristics (e.g., tumor size, lymphonodus count, and lymphatic vascular space invasion) and unimolecular biomarkers (e.g., CD44 [6], PPAR γ [7], IL-13Rα2 [8], HDAC6 [9]) are utilized to construct prognostic models. Nevertheless, the predictive power of monogenic biomarkers is insufficient. Based on recent development singenomic sequencing, the integration of prognosisrelated genetic markers with traditional parameters has huge potential for the prediction of GC prognosis [10,11].
In this study, differentially expressed cell cycle genes were identified from tumors and normal tissue specimens from The Cancer Genome Atlas Stomach Adenocarcinoma (TCGA-STAD) dataset. Then, univariate Cox regression analysis and lasso regression analysis were conducted on survival data to identify cell cycle genes significantly correlated with the overall survival (OS) of GC patients. These genes were used to construct a prognostic model, which was further verified in the GSE84437 dataset from Gene Expression Omnibus (GEO). Additionally, tissue microarrays were used in Chinese patients with GC to explore the relationship between selected cell cycle genes and prognosis based on their protein expression levels.

Prognostic model building and validation based on cell cycle genes in TCGA-STAD
The expression levels of 55 cell cycle genes in 334 patients with OS longer than 30 days (Table 1) from TCGA-STAD were used to train a prognostic model. Based on a univariate Cox regression model, seven genes (i.e., GADD45B, TFDP1, CDC6, CDC25A, CDC7, SMC1A and MCM3) were correlated with survival. Lasso regression analysis, including GADD45B, TFDP1, CDC6, CDC25A, CDC7, SMC1A and MCM3, was conducted to construct the prognostic model. The resulting risk score is calculated by 0.0090×expression level of GADD45B −0.0116×expression level of TFDP1 +0.0053×expression level of CDC6 −0.0177×expression level of CDC25A −0.0127×expression level of CDC7 −0.0157×expression level of SMC1A −0.0018×expression level of MCM3. All patients were classified into either the high-or low-risk groups on the basis of the optimum cutoff value of the risk score, which was set at −0.474.
Then, 431 patients with OS longer than 30 days (Table 1) were selected from the GSE84437 dataset to validate the model. These patients were also divided into highand low-risk groups on the basis of the optimum cutoff value of 1.087. Next, the Kaplan-Meier method was used to determine the survival curves of the two groups. As can be seen in Figure 1, the survival curves of the high-risk patients are different from those of the lowrisk patients; in addition, this difference was statistically significant.

Prognostic model and prognostic value of clinicopathologic features
Clinical information (age, sex, T-staging, and N-staging) found in both the TCGA and GEO datasets were tested for their impact on prognosis. Through univariate and multivariate Cox regression analyses, we found that age, T-staging, and N-staging were associated with prognosis. Additionally, the lasso-derived prognostic model (risk score) was an independent prognostic factor for OS.
Next, we assessed the sensitivity and specificity of the prognostic model with receiver operator characteristic (ROC) curves ( Figure 2). The areas under the curve (AUCs) were 0.656 and 0.629, respectively. This indicates that the proposed model performs well, showing medium sensitivity and specificity.

Nomogram building and validation based on the genetic model and clinical data of patients
A nomogram was constructed on the basis of age, T-staging, and N-staging, as well as the proposed prognostic model. As indicated by the calibration chart (  our results, the nomogram model better predicted survival prognosis than the clinical model for GC.

External immunohistochemical validation based on protein levels
Protein levels of CDC6 and MCM3 from 234 Chinese patients with GC were obtained using immunohistochemistry. We found that the expression in tumor tissues of these proteins was significantly higher than that in nontumor tissues (P < 0.01). Among the patients, data from 232 Chinese GC patients with OS longer than 30 days (Table 1) were included in univariate and multivariate Cox regression analyses. The results showed that TNM staging were associated with prognosis (Table 1). Furthermore, CDC6 was an independent prognostic factor in GC patients with T1-3N1-2M0 staging and no vascular tumor thrombus as presented in the 7 th version of the ACJJ (P<0.05) ( Figure 4).

Gene set enrichment analysis
Through gene set enrichment analysis (GSEA) of the differentially expressed genes from the TCGA-STAD dataset, we found that KEGG cell cycle pathways were enriched in the low-risk group (P<0.001 and false discovery rate (FDR)<0.001) ( Figure 5).

DISCUSSION
In this study, we included cell cycle-related genes associated with GC to explore their potential as prognostic factors for GC. We identified seven prognosis-related cell cycle genes and established a risk prediction model. For GC patients with high-and low-risk scores, we observed different survival rates. Additionally, a nomogram created by combining clinical factors with our prognostic model performed even better in predicting the survival prognosis of GC patients, with verification in a second dataset.
As a heterogeneous disease, the development of GC is a long and multistep process. Because of the gradual accumulation of genetic mutations, carcinogenesis and anti-cancer pathway imbalances eventually give rise to GC [12]. With progress in medical technology, the morbidity and mortality of GC have both declined. However, the current prognosis of GC patients is not optimistic [13].
In cancer, abnormal activity of various cyclins leads to uncontrollable tumor cell proliferation. In the literature, it has been reported that cell cycle regulators are related to the prognosis of GC patients. For instance, in gastric adenocarcinoma tissues, downregulated expression of the protein p27 is correlated with advanced tumors [14]. Among poorly differentiated tumors, the expression levels of p27 and CCND1 were elevated and were shown to be negative prognostic factors for patient survival [15]. Overexpression of the proteins CCND1 and CCND2 is associated with the short OS of GC patients [16]. Furthermore, binding of cyclin E and cyclin-dependent kinase 2 promotes the transition of the cell cycle from stage G1 to stage S. The prognosis of GC patients positive for cyclin E is poor; combining cyclin E overexpression with p53 expression was able to differentiate patients with poor prognosis [17,18]. Recently, bioinformatics methods were used to confirm that five G2/M checkpoint-related genes (i.e., MARCKS, CCNF, MAPK14, INSENP, and CHAF1A) were related to the OS of GC patients [19]. Nevertheless, these studies have not established a prognostic model for GC prediction based on the genetic characterization of genes associated with the cell cycle.
In recent years, mRNA expression has been investigated on the basis of whole-genome sequencing data [20].
Using these data, studies have investigated TP53 mutations and mutations in autophagy-related genes to construct prognostic models, showing that these models can successfully predict GC prognosis [10,11]. In this study, CDC6 and MCM3 showed the greatest differences and were most strongly correlated with prognosis in the TCGA dataset. MCM3 falls into a family of six highly conserved minichromosomal maintenance proteins, MCM2-MCM7. These proteins are critical for ensuring that eucaryon DNA replication takes place only once in each cell cycle. Additionally, MCM3 also serves as a helicase to facilitate replication extension. At an advanced stage of M1, CDT1 and CDC6 recruit iso-hexamer MCM2-MCM7 complexes and load them onto an origin of replication, producing prereplication complexes [21]. Additionally, CDC6 is an important cell cycle regulator [22]. By assembling prereplication complexes, it plays an essential role in maintaining chromosome integrity [23,24]. Moreover, CDC6 is also closely related to tumorigenesis. Besides  Figures (E, G, I) are derived from TCGA data, whereas those in Figures (F, H, J) are from the GSE84437 array. AGING  playing a role in cancer-related pathways by regulating the expression of certain oncogenes and cancer suppressor genes such as KRAS [25], CDC6 promotes cancer progression as it is positively regulated by associated noncoding RNAs (e.g., lncRNA-CDC6) [26]. Furthermore, the expression of CDC6 is upregulated in multiple tumor types [27,28]; CDC6 serves as a prognostic biomarker of breast carcinoma [29], colorectal cancer [30], pancreatic cancer [31], and other cancers. CDC6 knockdown has been shown to inhibit the growth and invasion of GC cells. Transfection of pS-CDC6 knocked down CDC6 expression in the GC cell lines BGC823 and SGC7901. MTT assay showed that transfection of pS-CDC6 markedly inhibited cell proliferation in BGC823 and SGC7901 cells. Transwell assay revealed that pS-CDC6 transfection inhibited the invasive capacities of BGC823 and SGC7901 cells. This likely be due to CDC6 knockdown promoting the apoptosis of these cells [32]. Among 232 Chinese GC patients, validation through immunohistochemical methods showed that the protein levels of CDC6 and MCM3 in tumor and nontumor tissues had significantly different histochemistry scores (H-scores; see Methods).
Additionally, we showed that CDC6 is an independent prognostic factor for GC with stratified analysis. Finally, via KEGG enrichment analysis, we showed that the gene expression in the low-risk group was related to cell cycle pathways, and patients in this group may benefit more from cell cycle therapies.
In summary, cell cycle genes play an important role in cancer progression. This study shows the value of cell cycle genes as prognostic biomarkers for GC. However, as a retrospective study, certain limitations are inevitable. To understand the specific mechanism and biological functions of cell cycle genes on prognosis, further investigation is needed.

Conclusions
We constructed a prognostic model involving seven cell cycle genes based on GC data from TCGA and GEO.
Combining the proposed prognostic model with a nomogram of clinical pathology, a higher survival prognosis prediction ability can be achieved. According to our findings, the proposed prognostic model may promote individualized diagnosis and treatment of GC patients and thus further improve the prognosis of these patients.

Gene expression data
On the TCGA website, we downloaded RNA sequencing and clinical data involving 375 GC tissues and 32 nontumor tissues from the STAD dataset. Subsequently, 125 cell cycle-related genes were obtained from the KEGG database, and GSE84437 data were downloaded from the GEO database. These datasets were used for further analysis after they were corrected by the "batchType" correction model in the SVA4.04R software package.

Differentially expressed cell cycle genes in GC
Using the "limma" package in R, 375 tumor samples and 32 normal tissue samples from TCGA were subjected to differential expression analysis for cell cycle genes. Differentially expressed genes were selected with an FDR of <0.05 and log fold-change of >0.5.

Prognostic gene model
Univariate Cox proportional hazards regression analysis was conducted to select cell cycle genes significantly correlated with the OS of individuals included in the TCGA-STAD dataset. According to the hazard ratio of each gene and P-values obtained using the Wald test, P<0.05 was adopted as the threshold to select genes that were significantly correlated with the survival of GC patients. Based on the "glmnet" package in R [33,34], a multivariate model of cell cycle genes was constructed via lasso regression. In the lasso regression model, genes with nonzero coefficients were selected to calculate risk scores by the following equation: risk score=∑ n j=1 Coefj*Xj, where Coef j stands for the coefficient and Xj for the relative expression level of each cell cycle gene [35]. All TCGA patients were divided into two groups (i.e., a high-risk group and a low-risk group) depending on the median risk score. Similar to the TCGA dataset, the same formula was used to determine the risk scores in the GEO dataset. We then calculated Kaplan-Meyer survival curves and used the log-rank test to clarify the statistical significance of the difference in OS between the two groups. Clinical information (e.g., age, sex, Tstaging, and N-staging) was then extracted from the TCGA-STAD and GSE84437 datasets. These factors were tested for their impact on prognosis with univariate and multivariate Cox regression analyses. Any factor with a P<0.05 was deemed statistically significant. Moreover, ROC curves were also calculated to verify the predictive value of the model.

Nomogram-based prediction model
Survival data from the patients were combined with their age, T-staging, N-staging, and risk scores to build a nomogram using the "rms" package in R. Subsequently, calibration curves were calculated to evaluate the consistency between the actual and predicted survival rates. Furthermore, a C-index ranging from 0.5 to 1.0 was calculated to assess the performance of the prognosis prediction model. The AUC was used to evaluate the sensitivity and specificity of the model.

Immunohistochemical evaluation of protein expression levels of cell cycle genes
Two hundred fifty tumor tissues and 144 nontumor tissues were derived from 250 Chinese patients affected by GC. From the TCGA database, CDC6 and MCM3, which were most closely correlated with prognosis, were selected for validation based on their protein levels. The H-score is used to evaluate immunohistochemical conditions. In detail, the number of positive cells on each slide and their staining intensity are converted into values as a semiquantitative measure of the amount of histological staining. H-Score (H-SCORE=∑pi × i)=(percentage of weak intensity cells ×1)+(percentage of moderate intensity cells ×2)+(percentage of strong intensity cells ×3), where pi represents the proportion of positive cells and i is the staining intensity. The H-score has a range of 0-300. The greater the value is, the greater the staining intensity of positive cells will be [36,37].

Gene set enrichment analyses
Regarding the differentially expressed genes in the TCGA-STAD array between the high-and low-risk groups, GSEA was used to identify enriched pathways [38,39]. P<0.05 and FDR<0.25 were considered statistically significant.

Statistical analysis
All statistical analyses were conducted in R software (Version 4.0.4; https://www.R-project.org). The correlation between the risk scores and clinical features was investigated using the χ 2 test. Besides drawing Kaplan-Meier curves, a log-rank test was conducted to test intergroup differences in OS. Univariate and multivariate Cox proportional risk regression analyses were conducted to determine the association between the risk score and OS. The AUC was calculated as a measure of the prediction accuracy of the prognostic models. P<0.05 was used to signify statistically significant results.

AUTHOR CONTRIBUTIONS
L-Q Z, L-D W and F-Y Z contributed design and idea of the study. S-L Z, J-K L, X-K Z and L-Q Z contributed collection and procession of clinical data. S-L Z and J-K L performed experiments. L-Q Z, P-N C and S-L Z performed the data analysis and interpreted the results. S-L Z, L-Q Z, X-L L and F-Y Z wrote the manuscript. All authors contributed to the article and approved the submitted version.