University of Oulu

Angélica Atehortúa, Polyxeni Gkontra, Marina Camacho, Oliver Diaz, Maria Bulgheroni, Valentina Simonetti, Marc Chadeau-Hyam, Janine F. Felix, Sylvain Sebert, Karim Lekadir, Cardiometabolic risk estimation using exposome data and machine learning, International Journal of Medical Informatics, Volume 179, 2023, 105209, ISSN 1386-5056,

Cardiometabolic risk estimation using exposome data and machine learning

Saved in:
Author: Atehortúa, Angélica1; Gkontra, Polyxeni1; Camacho, Marina1;
Organizations: 1BCN-AIM laboratory, Facultat de Matemàtiques i Informàtica, Universitat de Barcelona, Barcelona, Spain
2R&D Ab.Acus s.r.l., Milano, Italy
3Department of Epidemiology and Biostatistics, MRC-HPA Centre for Environment and Health, School of Public Health, Imperial College London, London, United Kingdom
4The Generation R Study Group, Erasmus MC, University Medical Center Rotterdam, Rotterdam, the Netherlands
5Department of Pediatrics, Erasmus MC, University Medical Center Rotterdam, Rotterdam, the Netherlands
6Research Unit of Population Health, Faculty of Medicine, University of Oulu, Oulu, Finland
Format: article
Version: published version
Access: open
Online Access: PDF Full Text (PDF, 2.3 MB)
Persistent link:
Language: English
Published: Elsevier, 2023
Publish Date: 2023-11-02


Background: The human exposome encompasses all exposures that individuals encounter throughout their lifetime. It is now widely acknowledged that health outcomes are influenced not only by genetic factors but also by the interactions between these factors and various exposures. Consequently, the exposome has emerged as a significant contributor to the overall risk of developing major diseases, such as cardiovascular disease (CVD) and diabetes. Therefore, personalized early risk assessment based on exposome attributes might be a promising tool for identifying high-risk individuals and improving disease prevention.

Objective: Develop and evaluate a novel and fair machine learning (ML) model for CVD and type 2 diabetes (T2D) risk prediction based on a set of readily available exposome factors. We evaluated our model using internal and external validation groups from a multi-center cohort. To be considered fair, the model was required to demonstrate consistent performance across different sub-groups of the cohort.

Methods: From the UK Biobank, we identified 5,348 and 1,534 participants who within 13 years from the baseline visit were diagnosed with CVD and T2D, respectively. An equal number of participants who did not develop these pathologies were randomly selected as the control group. 109 readily available exposure variables from six different categories (physical measures, environmental, lifestyle, mental health events, sociodemographics, and early-life factors) from the participant’s baseline visit were considered. We adopted the XGBoost ensemble model to predict individuals at risk of developing the diseases. The model’s performance was compared to that of an integrative ML model which is based on a set of biological, clinical, physical, and sociodemographic variables, and, additionally for CVD, to the Framingham risk score. Moreover, we assessed the proposed model for potential bias related to sex, ethnicity, and age. Lastly, we interpreted the model’s results using SHAP, a state-of-the-art explainability method.

Results: The proposed ML model presents a comparable performance to the integrative ML model despite using solely exposome information, achieving a ROC-AUC of 0.78 ± 0.01 and 0.77 ± 0.01 for CVD and T2D, respectively. Additionally, for CVD risk prediction, the exposome-based model presents an improved performance over the traditional Framingham risk score. No bias in terms of key sensitive variables was identified.

Conclusions: We identified exposome factors that play an important role in identifying patients at risk of CVD and T2D, such as naps during the day, age completed full-time education, past tobacco smoking, frequency of tiredness/unenthusiasm, and current work status. Overall, this work demonstrates the potential of exposome-based machine learning as a fair CVD and T2D risk assessment tool.

see all

Series: International journal of medical informatics
ISSN: 1386-5056
ISSN-E: 1872-8243
ISSN-L: 1386-5056
Volume: 179
Article number: 105209
DOI: 10.1016/j.ijmedinf.2023.105209
Type of Publication: A1 Journal article – refereed
Field of Science: 3121 General medicine, internal medicine and other clinical medicine
3142 Public health care science, environmental and occupational health
Funding: This work has received funding by the European Union's Horizon 2020 research and innovation programme under grant agreement No 874739 (LongITools project). PG and KL have additionally received funding by the European Union's Horizon 2020 research and innovation programme under grant agreement No 825903 (euCanSHare project).
EU Grant Number: (874739) LONGITOOLS - Dynamic longitudinal exposome trajectories in cardiovascular and metabolic non-communicable diseases
Copyright information: © 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (