Predicting the predisposition to colorectal cancer based on SNP profiles of immune phenotypes using supervised learning models

Çakmak, Ali; Ayaz, Huzeyfe; Arikan, Soykan; Ibrahimzada, Ali; Demirkol, Seyda; Sonmez, Dilara; Hakan, Mehmet; Surmen, Saime; Horozoglu, Cem; Dogan, Mehmet; Kucukhuseyin, Özlem; Cacina, Canan; KIRAN, BAYRAM; Zeybek, Şakir; Baysan, Mehmet; Yaylim, İlhan

doi:10.1007/s11517-022-02707-9

Predicting the predisposition to colorectal cancer based on SNP profiles of immune phenotypes using supervised learning models

Atıf İçin Kopyala

Çakmak A., Ayaz H., Arikan S., Ibrahimzada A. R., Demirkol S., Sonmez D., ...Daha Fazla

MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING, cilt.61, sa.1, ss.243-258, 2023 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 61 Sayı: 1
Basım Tarihi: 2023
Doi Numarası: 10.1007/s11517-022-02707-9
Dergi Adı: MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, ABI/INFORM, Applied Science & Technology Source, BIOSIS, Biotechnology Research Abstracts, Business Source Elite, Business Source Premier, CINAHL, Compendex, Computer & Applied Sciences, INSPEC
Sayfa Sayıları: ss.243-258
Anahtar Kelimeler: Colorectal cancer, Machine learning, Classification, Cancer screening, Immune checkpoints, VALIDATION
İstanbul Üniversitesi Adresli: Evet

Özet

This study explores the machine learning-based assessment of predisposition to colorectal cancer based on single nucleotide polymorphisms (SNP). Such a computational approach may be used as a risk indicator and an auxiliary diagnosis method that complements the traditional methods such as biopsy and CT scan. Moreover, it may be used to develop a low-cost screening test for the early detection of colorectal cancers to improve public health. We employ several supervised classification algorithms. Besides, we apply data imputation to fill in the missing genotype values. The employed dataset includes SNPs observed in particular colorectal cancer-associated genomic loci that are located within DNA regions of 11 selected genes obtained from 115 individuals. We make the following observations: (i) random forest-based classifier using one-hot encoding and K-nearest neighbor (KNN)-based imputation performs the best among the studied classifiers with an Fl score of 89% and area under the curve (AUC) score of 0.96. (ii) One-hot encoding together with K-nearest neighbor-based data imputation increases the Fl scores by around 26% in comparison to the baseline approach which does not employ them. (iii) The proposed model outperforms a commonly employed state-of-the-art approach, ColonFlag, under all evaluated settings by up to 24% in terms of the AUC score. Based on the high accuracy of the constructed predictive models, the studied 11 genes may be considered a gene panel candidate for colon cancer risk screening.