A machine-learning approach for nonalcoholic steatohepatitis susceptibility estimation

Ghadiri F., Husseini A. A., ÖZTAŞ O.

INDIAN JOURNAL OF GASTROENTEROLOGY, vol.41, no.5, pp.475-482, 2022 (ESCI) identifier identifier identifier

  • Publication Type: Article / Article
  • Volume: 41 Issue: 5
  • Publication Date: 2022
  • Doi Number: 10.1007/s12664-022-01263-2
  • Journal Indexes: Emerging Sources Citation Index (ESCI), Scopus, CAB Abstracts, EMBASE, MEDLINE
  • Page Numbers: pp.475-482
  • Keywords: Algorithm, Artificial intelligence, Disease susceptibility, Fatty liver, Gene, Machine learning, Neural network model, Nonalcoholic fatty liver disease, Nonalcoholic steatohepatitis, Single nucleotide polymorphism, Support vector machine, THAN-G POLYMORPHISM, RISK, NAFLD
  • Istanbul University Affiliated: No


Background Nonalcoholic steatohepatitis (NASH), a severe form of nonalcoholic fatty liver disease, can lead to advanced liver damage and has become an increasingly prominent health problem worldwide. Predictive models for early identification of high-risk individuals could help identify preventive and interventional measures. Traditional epidemiological models with limited predictive power are based on statistical analysis. In the current study, a novel machine-learning approach was developed for individual NASH susceptibility prediction using candidate single nucleotide polymorphisms (SNPs). Methods A total of 245 NASH patients and 120 healthy individuals were included in the study. Single nucleotide polymorphism genotypes of candidate genes including two SNPs in the cytochrome P450 family 2 subfamily E member 1 (CYP2E1) gene (rs6413432, rs3813867), two SNPs in the glucokinase regulator (GCKR) gene (rs780094, rs1260326), rs738409 SNP in patatin-like phospholipase domain-containing 3 (PNPLA3), and gender parameters were used to develop models for identifying at-risk individuals. To predict the individual's susceptibility to NASH, nine different machine-learning models were constructed. These models involved two different feature selections including Chi-square, and support vector machine recursive feature elimination (SVM-RFE) and three classification algorithms including k-nearest neighbor (KNN), multi-layer perceptron (MLP), and random forest (RF). All nine machine-learning models were trained using 80% of both the NASH patients and the healthy controls data. The nine machine-learning models were then tested on 20% of both groups. The model's performance was compared for model accuracy, precision, sensitivity, and F measure. Results Among all nine machine-learning models, the KNN classifier with all features as input showed the highest performance with 86% F measure and 79% accuracy. Conclusions Machine learning based on genomic variety may be applicable for estimating an individual's susceptibility for developing NASH among high-risk groups with a high degree of accuracy, precision, and sensitivity.