Research Paper Volume 15, Issue 21 pp 11782—11810

Hub gene identification and molecular subtype construction for Helicobacter pylori in gastric cancer via machine learning methods and NMF algorithm

Lianghua Luo1,2, *, , Ahao Wu1,2, *, , Xufeng Shu1,2, , Li Liu1,2, , Zongfeng Feng1,2, , Qingwen Zeng1,2, , Zhonghao Wang1,2, , Tengcheng Hu1,2, , Yi Cao1, , Yi Tu3, , Zhengrong Li1, ,

  • 1 Department of General Surgery, The First Affiliated Hospital of Nanchang University, Nanchang, Jiangxi, China
  • 2 Medical Innovation Center, The First Affiliated Hospital of Nanchang University, Nanchang, Jiangxi, China
  • 3 Department of Pathology, The First Affiliated Hospital of Nanchang University, Nanchang, Jiangxi, China
* Equal contribution

Received: February 28, 2023       Accepted: July 19, 2023       Published: September 26, 2023
How to Cite

Copyright: © 2023 Luo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY .0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.


Helicobacter pylori (HP) is a gram-negative and spiral-shaped bacterium colonizing the human stomach and has been recognized as the risk factor of gastritis, peptic ulcer disease, and gastric cancer (GC). Moreover, it was recently identified as a class I carcinogen, which affects the occurrence and progression of GC via inducing various oncogenic pathways. Therefore, identifying the HP-related key genes is crucial for understanding the oncogenic mechanisms and improving the outcomes of GC patients. We retrieved the list of HP-related gene sets from the Molecular Signatures Database. Based on the HP-related genes, unsupervised non-negative matrix factorization (NMF) clustering method was conducted to stratify TCGA-STAD, GSE15459, GSE84433 samples into two clusters with distinct clinical outcomes and immune infiltration characterization. Subsequently, two machine learning (ML) strategies, including support vector machine-recursive feature elimination (SVM-RFE) and random forest (RF), were employed to determine twelve hub HP-related genes. Beyond that, receiver operating characteristic and Kaplan-Meier curves further confirmed the diagnostic value and prognostic significance of hub genes. Finally, expression of HP-related hub genes was tested by qRT-PCR array and immunohistochemical images. Additionally, functional pathway enrichment analysis indicated that these hub genes were implicated in the genesis and progression of GC by activating or inhibiting the classical cancer-associated pathways, such as epithelial-mesenchymal transition, cell cycle, apoptosis, RAS/MAPK, etc. In the present study, we constructed a novel HP-related tumor classification in different datasets, and screened out twelve hub genes via performing the ML algorithms, which may contribute to the molecular diagnosis and personalized therapy of GC.


HP: Helicobacter pylori; GC: Gastric cancer; TCGA: The Cancer Genome Atlas; GEO: Gene Expression Omnibus; MSigDB: Molecular Signatures Database; NMF: Non-negative matrix factorization; STAD: Stomach Adenocarcinoma; SVM-RFE: support vector machine-recursive feature elimination; RF: Random Forest; ML: machine learning; qRT-PCR: Quantitative reverse transcription polymerase chain reaction; IHC: Immunohistochemical; ROC: Receiver operating characteristic; K-M: Kaplan-Meier; EMT: Epithelial-mesenchymal transition; IARC: International Agency for Research on Cancer; AI: Artificial intelligence; CNV: Copy-number variant; FC: Fold change; PCA: Principal component analysis; IC50: Half-maximal inhibitory concentration; AUC: Area under the ROC: curve; GSCA: Gene Set Cancer Analysis; GDSC: Genomics of Drug Sensitivity in Cancer; PPI: Protein–protein interaction; GAPDH: Glyceraldehyde 3-phosphate dehydrogenase; GO: Gene Ontology; KEGG: Kyoto Encyclopedia of Genes and Genomes; TME: Tumor microenvironment; TMB: Tumor mutation burden; SNV: Single nucleotide variants; SNP: Single-nucleotide polymorphism; RTK: Receptor tyrosine kinase; CAMs: Cell adhesion molecules.