Artificial intelligence in health data analysis: The Darwinian evolution theory suggests an extremely simple and zero-cost large-scale screening tool for prediabetes and type 2 diabetes

https://doi.org/10.1016/j.diabres.2021.108722Get rights and content

Abstract

Aims

The effective identification of individuals with early dysglycemia status is key to reduce the incidence of type 2 diabetes. We develop and validate a novel zero-cost tool that significantly simplifies the screening of undiagnosed dysglycemia.

Methods

We use NHANES cross-sectional data over 10 years (2007–2016) to derive an equation that links non-laboratory exposure variables to the possible presence of undetected dysglycemia. For the first time, we adopt a novel artificial intelligence approach based on the Darwinian evolutionary theory to analyze health data. We collected data for 47 variables.

Results

Age and waist circumference are the only variables required to use the model. To identify undetected dysglycemia, we obtain an area under the curve (AUC) of 75.3%. Sensitivity and specificity are 0.65 and 0.73 by using the optimal threshold value determined from external validation data.

Conclusions

The use of uniquely two variables allows to obtain a zero-cost screening tool of analogous precision than that of more complex tools widely adopted in the literature. The newly developed tool has clinical use as it significantly simplifies the screening of dysglycemia. Furthermore, we suggest that the definition of an age-related waist circumference cut-off might help to improve existing diabetes risk factors.

Introduction

Risk-based screening tools for the identification of individuals with increased risk to develop diabetes have become very popular in the last decades [1], [2], [3], also thanks to the development of health data analysis methods based on artificial intelligence algorithms [4], [5], [6]. The early identification of high-risk individuals is a crucial task for public health administrations in order to reduce the incidence of several fatal and nonfatal complications among the population [1]. For example, it is known that individuals with prediabetes have enhanced future risk of cardiovascular disease [7], [8] and 3–12 times higher annual diabetes incidence than individuals with normoglycemia [9], [10], [11]. Furthermore, type 2 diabetes, which currently represents over 95% of diabetes cases and affects about 26.8 million people in the USA [12], [13], [14], is a major risk factor of death.

Recent studies have shown that the lifestyle modification or early pharmacological interventions can reduce the risk to develop dysglycemia, i.e. prediabetes or diabetes [15], [16], [17]. To this extent, clinical guidelines of several countries, including the USA [12], [18], often suggest to direct suitable treatments towards identified high-risk individuals based on models that involve the evaluation of risk factors. However, despite the accuracy reached by state-ot-the-art risk-based prediction models and the effectiveness of primary prevention approaches in reducing the risk of dysglycemia, in 2018, 88 million Americans (34.5% of the entire US population) were affected by prediabetes while 7.3 million (2.8%) had undiagnosed diabetes [14].

Identification of individuals with undetected dysglycemia is anything but a trivial task as it usually involves invasive and costly procedures [1], [19]. For this reason, alongside with models based on longitudinal cohort studies for the prediction of the future risk associated to diabetes [2], [20], tools capable to detect the current dysglycemia status of an individual based on the evaluation of zero-cost variables are highly desirable. They can serve as a pre-screening for more accurate diagnostic methods [12], [21]. To this end, detailed analyses of cross-sectional studies are required. In this framework, many recent works have been focused on obtaining dysglycemia screening tools for the US national population via variables that do not involve laboratory tests or that involve non-invasive laboratory examinations [4], [5], [6], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31]. However, resulting tools are often unpractical for large-scale screenings as they exploit a large number of exposure variables connected by non-trivial relations or even not expressed by a mathematical equation.

To account for the need of a practical tool for the screening of dysglycemia in the USA population based exclusively on zero-cost non-laboratory variables, we performed an innovative analysis using US National Health and Nutrition Examination Surveys (NHANES) data from 2007 to 2016 through a unique artificial intelligence approach. The methodology used in our analysis is novel as it combines hybridization of state-of-the-art machine learning techniques such as genetic programming, which implements a scheme derived from the Darwinian evolutionary theory, and artificial neural networks [32], [33]. The combination of such peculiar techniques allows for the first time to perform a global and totally unbiased search for the optimal mathematical expression that connects non-laboratory variables, previously suggested as good dysglycemia predictors, with the undetected presence of prediabetes or undiagnosed type 2 diabetes.

Section snippets

Data source

To develop a model capable to identify the presence of undetected dysglycemia in the US national population, we analyzed cross-sectional data from successive NHANES over 10 years (2007–2016). Individuals were categorized either as normoglycemic or dysglycemic. The latter included prediabetes and undetected diabetes. We adopted the prescriptions of the American Diabetes Association (ADA) to define dysglycemia [12]. We classified an individual as being diabetic using fasting plasma glucose

Results

We collected 50,588 participants from 2007 to 2016 US NHANES. 5922 individuals aged more than 20 years were eligible for the preliminary dataset. The large number of ineligible patients was predominantly due to missing data in FPG, OGTT or HbA1c, while 5577 participants were excluded because of missing data in one of the variables listed in Table 1. The mean age of the participants was 49 years and they were equally distributed between male (2986) and female (2936) sex. A close inspection of

Discussion

Our novel artificial intelligence approach based on genetic programming and artificial neural networks allowed us to determine the most advantageous equation to relate the presence of undetected dysglycemia to zero-cost non-laboratory variables using US NHANES data. No a priori assumptions regarding the mathematical expression of the model were made, thus enabling a completely unbiased search. Evolutionary criteria used to derive the model were the error associated to the model prediction and

Conclusions

The early identification of adult individuals with enhanced risk to develop dysglycemia has been shown to be a crucial task for the public health administrations, in order to reduce the prevalence of type 2 diabetes among the population. Despite the large effort devoted to the development of risk prediction models, the prevalence of type 2 diabetes in the US population is currently increasing, making it fundamental to develop practical and effective tools for large scale screening of undetected

Acknowledgments

Data used in this study were collected by the National Health and Nutrition Examination Survey (NHANES) and they are free and publicly available on the National Center for Health Statistics of the Centers for Disease Control and Prevention (CDC) website. D.D. acknowledges funding support from the Italian Ministry of Education, University and Research (MIUR) through the “PON Ricerca e Innovazione 2014-2020, Azione I.2 A.I.M., D.D. 407/2018”.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (54)

  • M.D. Jensen et al.

    2013 AHA/ACC/TOS guideline for the management of overweight and obesity in adults: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines and The Obesity Society

    J Am Coll Cardiol

    (2014)
  • R.R. Kalyani et al.

    Age-related and disease-related muscle loss: the effect of diabetes, obesity, and other diseases

    Lancet Diab Endocrinol

    (2014)
  • R. Taylor

    Type 2 diabetes etiology and reversibility

    Diab Care

    (2013)
  • J. Lindstrom et al.

    The diabetes risk score: a practical tool to predict type 2 diabetes risk

    Diab Care

    (2003)
  • D. Noble et al.

    Risk models and scores for type 2 diabetes: systematic review

    BMJ

    (2011)
  • B. Buijsse et al.

    Risk assessment tools for identifying individuals at risk of developing type 2 diabetes

    Epidemiol Rev

    (2011)
  • K. De Silva et al.

    A combined strategy of feature selection and machine learning to identify predictors of prediabetes

    J Am Med Inform Assoc

    (2019)
  • A. Dinh et al.

    A data-driven approach to predicting diabetes and cardiovascular disease with machine learning

    BMC Med Inf Decis Making

    (2019)
  • W. Yu et al.

    Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes

    BMC Med Inf Decis Making

    (2010)
  • The Emerging Risk Factors Collaboration

    Glycated hemoglobin measurement and prediction of cardiovascular disease

    JAMA

    (2014)
  • D.H. Morris et al.

    Progression rates from HbA1c 6.0–6.4% and other prediabetes definitions to type 2 diabetes: a meta-analysis

    Diabetologia

    (2013)
  • American Diabetes Association

    2. Classification and diagnosis of diabetes: standards of medical care in diabetes–2020

    Diab Care

    (2020)
  • A. Menke et al.

    The prevalence of type 1 diabetes in the United States

    Epidemiology

    (2013)
  • Centers for Disease Control and Prevention. National diabetes statistics report; 2020....
  • W.C. Knowler et al.

    Reduction in the incidence of type 2 diabetes with lifestyle intervention or metformin

    N Engl J Med

    (2002)
  • J. Tuomilehto et al.

    Prevention of type 2 diabetes mellitus by changes in lifestyle among subjects with impaired glucose tolerance

    N Engl J Med

    (2001)
  • A.L. Siu

    U.S. Preventive Services Task Force. Screening for abnormal blood glucose and type 2 diabetes mellitus: U.S. Preventive Services Task Force Recommendation Statement

    Ann Intern Med

    (2015)
  • Cited by (11)

    • Stratified analysis of the age-related waist circumference cut-off model for the screening of dysglycemia at zero-cost

      2022, Obesity Medicine
      Citation Excerpt :

      In the present manuscript, we use a subset of 4581 individuals that were not previously included in the learning and cross validation datasets. Using a procedure that involves weighting patterns in the testing set (Buccheri et al., 2021), the distribution of patients of our testing set was carefully corrected to match that of the reference population. This is a key aspect of our study, as it is required to correctly estimate the accuracy of the model in a realistic use case.

    • Artificial intelligence and diabetes technology: A review

      2021, Metabolism: Clinical and Experimental
      Citation Excerpt :

      They reported that age, education, BMI, systolic BP, diastolic BP, direct cholesterol, and total cholesterol are statistically significant factors for predicting diabetes. Buccheri et al. [38] attempted to identify dysglycemia in the NHANES dataset with waist circumference and age by combining a hybridization of machine learning, a genetic algorithm, and an ANN. The development of wide-ranging sensing technologies and generation of the associated novel datasets are opening increasing avenues for AI to diagnose, characterize, and manage diabetes.

    View all citing articles on Scopus
    View full text