Tumor microenvironment-related multigene prognostic prediction model for breast cancer

Background: Breast cancer is an invasive disease with complex molecular mechanisms. Prognosis-related biomarkers are still urgently needed to predict outcomes of breast cancer patients. Methods: Original data were download from The Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO). The analyses were performed using perl-5.32 and R-x64-4.1.1. Results: In this study, 1086 differentially expressed genes (DEGs) were identified in the TCGA cohort; 523 shared DEGs were identified in the TCGA and GSE10886 cohorts. Eight subtypes were estimated using non-negative matrix factorization clustering with significant differences seen in overall survival (OS) and progression-free survival (PFS) (P < 0.01). Univariate Cox analysis and least absolute shrinkage and selection operator (LASSO) regression analysis were performed to develop a related risk score related to the 17 DEGs; this score separated breast cancer into low- and high-risk groups with significant differences in survival (P < 0.01) and showed powerful effectiveness (TCGA all group: 1-year area under the curve [AUC] = 0.729, 3-year AUC = 0.778, 5-year AUC = 0.781). A nomogram prediction model was constructed using non-negative matrix factorization clustering, the risk score, and clinical characteristics. Our model was confirmed to be related with tumor microenvironment. Furthermore, DEGs in high-risk breast cancer were enriched in histidine metabolism (normalized enrichment score [NES] = 1.49, P < 0.05), protein export (NES = 1.58, P < 0.05), and steroid hormone biosynthesis signaling pathways (NES = 1.56, P < 0.05). Conclusions: We established a comprehensive model that can predict prognosis and guide treatment.

AGING molecular types have been presented, such as luminal A, luminal B, HER2-enriched, basal-like, normal-like, and claudin-low [5]. Different subtypes have been confirmed to have different prognoses and drug responses. With advances in statistical analysis, multigene signatures are widely used to predict patient prognosis and drug response [6,7]. Some multigene prediction models, such as the Oncotype DX 21-gene test, Prediction Analysis of Microarray 50, and 70-gene signature (MammaPrint), have been applied in the clinic [8][9][10]. The Oncotype DX 21-gene test can evaluate the tumor recurrence and predict chemotherapy responses in patients with ERpositive breast cancer. MammaPrint signature and PAM50 can improve prognostic prediction in breast cancer patients. Nonetheless, despite their high power, these tools only consider gene status, and thus, a model with comprehensive consideration of additional factors is urgently needed.
Breast cancer outcomes are significantly related to factors, such as tumor size, tumor stage, lymph node status, age, tumor tissue receptor status, and gene status. To effectively evaluate prognoses of breast cancer and predict drug responses to guide treatment, we constructed a comprehensive prediction model with varied factors. In this study, we identified differentially expressed genes (DEGs) from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) and performed non-negative matrix factorization (NMF) clustering and a least absolute shrinkage and selection operator (LASSO) regression analysis to construct a nomogram prediction model. We also explored the correlations and potential signaling pathways and discuss a possibly internal mechanism of breast cancer.

Data acquisition
Gene expression, tumor mutation burden, and clinical information datasets of breast cancer were obtained from The Cancer Genome Atlas (TCGA, https://portal.gdc.cancer.gov/) and Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/) databases in October 2021. We selected 1208 samples, including 1096 breast cancer samples and 112 normal samples from The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) program. After searching the datasets with more than 150 human breast cancer samples with complete expression profile data, we selected the GSE10886 dataset from the GEO [9].

Protein-protein interaction network and enrichment analysis
The protein-protein interaction (PPI) network for DEGs was constructed using the STRING website tool (https://string-db.org/); the high confidence genes were conserved (interaction score ≥ 0.7), and hub nodes were visualized by R-x64-4.1.1 (Supplementary File 1). Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were conducted using the clusterprofiler R package, where P < 0.05 [16][17][18] (Supplementary File 1).

Non-negative matrix factorization clustering
First, we intersected the TCGA-BRCA and GSE10886 data to obtain the shared DEGs; then, we analyzed the DEGs using the survival and NMF R packages (Supplementary File 1), through which eight distinct subtypes were identified according to the cophenetic correlation coefficient for the cluster number from 2-10 [19]. Second, the overall survival (OS) and progressionfree survival (PFS) differences among these subtypes, as observed by Kaplan-Meier (K-M) analysis and logrank test, were analyzed using the survival and survminer R packages [20] (Supplementary File 1). Third, the analysis of microenvironment cell populations (MCP) differences among these subtypes was performed using the limma, ggpubr, and MCPcounter R packages [21] (Supplementary File 1).

Nomogram model establishment
Based on the DEG data, a univariate Cox analysis was performed to identify the survival related genes (P < 0.05). A modified LASSO regression analysis was conducted to find the genes most relevant to the OS of breast cancer patients (P < 0.05) [22][23][24] (Supplementary File 1). To verify the accuracy of our risk score predictor, analyses of training, and test groups were performed using data that were randomly obtained from the whole group. The predictive ability of the risk score was evaluated by survival probability curve, receiver operating characteristic (ROC) curve, and the area under the curve (AUC) [25] (Supplementary File 1). The risk computing formula is as follows: Risk score In addition, univariate and multivariate analyses were performed to demonstrate the independent predictive ability of the risk score. A nomogram prediction model was established using clinical characteristics, NMF clustering, and risk score to predict the survival of breast cancer patients; the nomogram-predicted probability of the 1-, 3-, and 5-year OS is shown by the calibration curve [26]. To identify the superiority of the nomogram, a decision curve analysis (DCA) was conducted [27] (Supplementary File 1). The R packages survival, survminer, caret, glmnet, timeROC, ggDCA, regplot, and rms were used for these analyses (Supplementary File 1).

Clinical characteristics, genes, immune cells, and tumor mutation burden correlation analysis
The correlations between risk score and clinical characteristics, known breast cancer genes, immune cells, and tumor mutation burden (TMB) are shown through box plots, correlation matrix, and circular plot generated by ggpubr, corrplot, and circlize R packages, respectively [29] (Supplementary File 1).

Relevance analysis between NMF clustering and risk score
To further explore the relevance between our novel typing mode and independent predictive factors, a Sankey diagram was plotted using ggalluvial, ggplot2, and dplyr R packages [30][31][32] (Supplementary File 1).

Immunohistochemistry
Immunohistochemistry results for risk score related DEGs in breast cancer were obtained from The Human Protein Atlasdatabase (THPA, https://www. proteinatlas.org/).

Chemotherapeutic and immunotherapeutic response prediction
We predicted the drug response in breast cancer patients based on the Genomics of Drug Sensitivity in Cancer database [33]. Six common chemotherapeutic and immunotherapeutic drugs (paclitaxel, cytarabine, camptothecin, lapatinib, erlotinib, and gefitinib) were selected for the analysis. The R package pRRophetic was used to conduct the analysis [34] (Supplementary File 1). The inhibitory concentration (IC50) was assessed to determine the drug sensitivities. ComBat was used to adjust for batch effects, and the average expression was calculated for repeated genes. AGING

Statistical analysis
Statistical analysis was performed using R-x64-4.1.1 and perl-5.32 (Supplementary File 1). Data are presented as means ± standard deviation. Differences with P < 0.05 and FDR < 0.05 were considered statistically significant.

Ethical statement
TCGA, GEO, THPA, and GSEA belong to public databases. Patients involved in them all had ethical approval. All analyses of us were based on them, therefore, no ethical issues existed.

Differentially expressed genes in breast cancer
The expression of 1086 DEGs was founded in the TCGA-BRCA cohort by comparing breast cancer tissue (1096 tumor samples) with normal tissue (112 normal breast samples) (Figure 2A, 2B).

Non-negative matrix factorization clustering analysis
First, a univariate Cox regression analysis was performed using the data of the DEGs to increase the robustness of our cluster (P < 0.01). The NMF algorithm was then used to cluster the breast cancer cases based on gene expression, survival time, and survival status. We identified the optimal k value of 8 with the cophenetic correlation coefficients, and confirmed that the optimal cluster number was 8 ( Figure 4A). As shown in   Figure 4B, the boundaries of the eight subtypes (C1-C8) are clear, which indicates that the clustering is relatively reliable. According to the OS and PFS curves ( Figure 4C, 4D), C4 and C5 breast cancer is associated with the best prognosis, and C6 breast cancer patients have the worst (P < 0.01). MCP counting for eight tissue-infiltrating immune cell types (B lineage, monocytic lineage, cytotoxic lymphocytes, neutrophils, myeloid dendritic cells, NK cells, T cells, and CD8+ T cells) and two stromal cell types (fibroblasts and endothelial cells) was performing using the R package MCPcounter (Supplementary File 1). We found that C1 had the greatest abundance of these cells, except neutrophils, in the microenvironment (Figures 4, 5). Moreover, fibroblasts, endothelial cells, and neutrophils were significantly high in C5, while tissueinfiltrating immune cells were significantly low in C6 ( Figure 5B, 5C, 5F).

Construction of the prognostic model
Based on the 523 DEGs, a univariate Cox analysis was first performed to increase the stability of the results. Subsequently, LASSO regression analysis was performed to identify the 17 DEGs included in the risk assessment model ( Figure 6A, 6B). The risk score was calculated according to the following formula: The cut-off value used to divide patients into the low-and high-risk breast cancer was −2.025. To further verify the accuracy of the risk score predictor, ROC analysis was performed ( Figure 6C-6E). The 1-, 2-, and 3-year AUC values for the whole TCGA cohort were 0.729, 0.778, and 0.781, respectively, while the 1-, 2-, and 3-year AUC values for the TCGA training cohort were 0.805, 0.782, and 0.793, respectively, and the 1-, 2-, and 3-year AUC values for the TCGA test cohort were 0.627, 0.770, and 0.765, respectively. According to the K-M plotter, survival differences between the low-and high-risk breast cancer groups in the whole TCGA cohort, TCGA training cohort, and TCGA test cohort were significant (P<0.01) ( Figure 6F-6H). In addition, significant differences were also observed in the age >60, age ≤60, stage I-II, and stage III-IV subsets ( Figure 7). Moreover, the univariate and multivariate Cox analyses of the relationship between the risk score and clinical characteristics demonstrated that the risk score was an independent predictor of breast cancer (univariate Cox regression: hazard ratio [HR] = 1.14, 95% confidence interval [CI] = 1.10 − 1.17, P < 0.01; multivariate Cox regression: hazard ratio = 1.13, 95% confidence interval = 1.09 − 1.17, P < 0.01) ( Table 1). Although the risk score is an independent factor, compared with other clinical characteristics, the AUC value of the risk score was not the highest ( Figure 8A). To further improve predictive ability, a nomogram model was constructed using the clinical characteristics, NMF clustering, and risk score. According to the nomogram model, each patient has a total score that can predict the 1-, 3-, and 5-year survival rates ( Figure 8B). The calibration curves showed that the nomogram-predicted 1-, 3-, and 5-year OS probabilities were close to the actual OS ( Figure 8C). Moreover, the results of DCA showed that our nomogram was the best predictor ( Figure 8D).

Factors correlated with the risk score
We researched the correlations between the risk score and clinical characteristics and found that age ( Figure  9C), M stage ( Figure 9D), N stage ( Figure 9E), T stage ( Figure 9F), and clinical stage ( Figure 10A) were significantly associated with the risk score. We then selected 12 known breast cancer-related genes and analyzed their correlations with the risk score. As shown in Figure 10B, significant negative correlations existed between the risk score and TP53, KIT, MCL1, MAP3K1, JAK1, PDCD1, CTLA4, and CD274. Infiltrating T cells, CD8 + T cells, cytotoxic  lymphocytes, B lineage cells, NK cells, monocytic lineage cells, and myeloid dendritic cells were significantly negatively correlated with the risk score, while the fibroblasts were significantly positive correlated to the risk score ( Figure 10C). In addition, we did not find a significant correlation between the risk score and TMB ( Figure 10D).

Correlation between NMF clustering and the risk score
To connect the NMF clustering and risk score, we generated a Sankey diagram ( Figure 10E). The plot showed that more dead patients had high-risk scores, which further demonstrated the reliability of the risk score. Furthermore, the C2 and C6 subtypes mainly contained high-risk scores breast cancer and had the worse prognoses according to the survival curves, while the C4 and C5 subtypes included more low-risk breast cancer had better prognoses. These results demonstrated the accuracy of both the novel typing mode and the predictive factor.

Differences in drug response between the low-and high-risk breast cancer groups
As shown in Figure 12, the high-risk breast cancer cohort showed a significantly higher response to paclitaxel, cytarabine, camptothecin, erlotinib, and gefitinib, while the low-risk cohort showed a significantly better response to lapatinib.

DISCUSSION
Breast cancer is a heterogeneous disease with a high incidence rate and poor prognosis [35]. As the molecular mechanism of breast cancer is complex, continuous studies aim to identify better molecular typing of breast cancer. Wang et al. [7] developed a five-gene (EDN2, CLEC3B, SV2C, WT1, and MUC2) prognostic signature using LASSO Cox regression analysis. Gao et al. [6] developed a pyroptosis-related lncRNA-associated (AC121761.2, AC027307.2, LINC01871, U73166.1, AL513477.2, AC005034.5 and AL451085.2) predictive model using LASSO Cox regression analysis. However, these risk models only considered the molecular status. Indeed, breast cancer is a complicated disease that requires additional considerations. Our nomogram is a comprehensive prognostic prediction tool that includes clinical characteristics, NMF clustering-based typing, and the risk score. Many multigene analysis-based models have been published in the last decade [36][37][38]. NMF clustering is a novel typing method that is rarely used in breast cancer and with which we can achieve a more detailed typing to predict more accurate prognoses for breast cancer patients.
Based on the shared DEG expression and survival data of breast cancer patients, eight novel subtypes were identified. According to the K-M survival plots, C4 and C5 breast cancer had a better OS and PFS, whereas C6 had an obviously poor prognosis. Notably, C4 and C5 breast cancer had high neutrophil infiltration, while C6 breast cancer had lower levels of neutrophil infiltration. As one of the most important immune cell types, neutrophils play a vital role in cancer progression, such as by directly eliminating cancer cells, releasing factors that affect the tumor microenvironment (TME), and producing reactive oxygen and nitrogen species [39]. These anti-tumor effects might account for the better prognosis of C4 and C5 breast cancer. Moreover, we found that C4 and C5 breast cancer had a low number of monocytes and that C6 breast cancer had highest monocyte infiltration. Mononuclear cells are precursors of tumor-associated macrophages (TAMs) which comprise the most abundant proportion of tumorinfiltrating immune cells [40]. Substantial evidence showed that TAMs are highly associated with poor prognosis in cancer [41,42]. The potential mechanism of TAMs is complicated and includes tumor promotion, an increase in cancer resistance, and promotion of cancer cell migration [43][44][45][46]. Undoubtedly, mononuclear cells are important in the breast cancer microenvironment, as they have the    To construct a stable predictive model, we then conducted LASSO Cox regression analysis with DEGs to divide the breast cancer cases into low-and highrisk subsets. Finally, 17 DEGs were included in the risk score calculation. Next, we reviewed previous studies of these 17 DEGs (Table 2). Although the functions of some of these 17 DEGs were unclear and even controversial, and not all of them were reported to be related to breast cancer, we plan to perform additional research on these DEGs. To explore the potential signaling pathways among these 17 DEGs in the low-and high-risk groups, we performed the GSEA, from which we found that high-risk breast cancer was associated with histidine metabolism, protein export, and steroid hormone biosynthesis. Matboli et al. [47] demonstrated that histidine-rich glycoprotein expression was higher in basal-like breast cancer than in the normal-like subtype, while other evidence demonstrated that the basal-like breast cancer subtype had a worse prognosis [48]. Furthermore, Saha et al. [49] reported that steroid hormone receptors can drive cell cycle regulation and breast cancer progression, thereby controlling tumor proliferation. These enriched pathways suggested that the disease included in the high-risk group was more invasive and was associated with poorer survival than that in the low-risk group.
After further analysis, we found that the risk score was significantly correlated with clinical characteristics. Patients ≤60 years of age with stage I-II, M0, and N0-1 breast cancer had significantly lower risk score than those >60 years with stage III-IV, M1, and N2-3 breast cancer. Moreover, we found that T4 breast cancer had the highest risk score, followed by T2-3 and T1. Notably, no significant differences were identified between stages I and II, stages III and IV, T2 and T3, N0 and N1, or N2 and N3. We performed a correlation analysis for the risk score and several known breast cancer-associated genes, after which we found that TP53, KIT, MCL1, MAP3K1, JAK1, PDCD1, CTLA4, and CD274 were significantly negatively correlated with the risk score. The risk score was also negatively correlated with T cells, CD8+ T cells, cytotoxic lymphocytes, B lineage cells, NK cells, monocytic lineage cells, and myeloid dendritic cells, and was positively correlated with fibroblasts. From this analysis, we believed that the risk score was strongly associated with the TME. In breast cancer, many immune-related pathways are abnormally regulated, thereby influencing the microenvironment, such as through immune cell infiltration [50]. Based on current literature, some believe that the TME is a potential treatment target in breast cancer [51,52]. To further demonstrate this conclusion, a drug sensitivity analysis was performed through which we identified significantly higher responses to agents, with the exception of lapatinib, in high-risk breast cancer. This special phenomenon was similar to what was observed in triple-negative breast cancer, which exhibited good chemotherapeutic sensitivity but poor prognosis [53][54][55]. Obviously, the risk score was strongly associated with the TME and was shown to predict chemotherapeutic and immunotherapeutic responses in breast cancer. However, no significant correlation was found between the risk score and breast cancer TMB, which requires additional research.
Based on these analyses, we believe that the risk score is valuable, but nevertheless, as a clinical predictive model, it should be better. Therefore, to further refine our model, we considered the risk score, clinical characteristics, and NMF clustering and constructed a nomogram. By summing the scores of these terms, we can predict the 1-, 3-, 5-year survival for every breast cancer patient, which might be a more precise and stable method. However, our analysis still has some limitations. First, our analysis only uses data from online databases, more real-word data are needed to further confirm our findings. Second, the detection of 17 DEGs would cost more than a model with fewer genes. However, our final prediction model has been confirmed to be effective, and since there are several other clinical prediction tools that require detection of more than 17 genes, our model is acceptable. Since breast cancer has varied subtypes, more detailed prediction models for the different subtypes should be constructed. We believe that these prediction models will be more powerful and cost-effective. Finally, although the HR of a single risk score was not very high, significant differences in OS and PFS as well as in AUC values were observed between the low-and high-risk score groups. To further improve the power of prediction model, we constructed a model using NMF clustering, the risk score, and clinical characteristics. Fortunately, our final nomogram model is shown to be better than any single factor.
In this study, we developed a novel predictive model using NMF clustering, clinical characteristics of breast cancer, and risk score based on 17 DEGs. The model was verified by randomly dividing the TCGA cohort into training and test cohorts, and separately analyzing the survival differences between low-and high-risk groups in these cohorts. Differences were observed in immune cell infiltration, clinical correlation, potential signaling pathways, and drug sensitivity. AGING NFE2 [78] Nuclear factor, erythroid 2 CR450284 NFE2 promotes breast cancer cell growth in the bone microenvironment, which leads to bone metastasis; enhances expression of Wnt-related molecules.
MMP25 [80] Matrix metallopeptidase 25 HF584190 High expression of MMP25 in head and neck cancer is associated with a worse prognosis; MMP25 is related to apoptosis, the KRS signaling pathway, the PI3K/AKT/mTOR signaling pathway, and the JAK/STAT signaling pathway. DTX1 [81,82] Deltex E3 ubiquitin ligase 1 KT584324 DTX1 is a regulator of the Notch signaling pathway and acts as an E3 ubiquitin ligase that can repress Notch gene expression and inhibit earlystage non-small cell lung carcinoma growth. AGING

AUTHOR CONTRIBUTIONS
Kai Hong and Yingjue Zhang designed this study. Lingli Yao and Jiabo Zhang completed the data acquisition. Yu Guo and Xianneng Sheng performed work involving software. Kai Hong and Yingjue Zhang conducted the data analyses and manuscript writing. Lingli Yao and Yu Guo edited the manuscript. The final paper was checked by all authors.