University of Oulu

Luis G Leal, Alessia David, Marjo-Riita Jarvelin, Sylvain Sebert, Minna Männikkö, Ville Karhunen, Eleanor Seaby, Clive Hoggart, Michael J E Sternberg, Identification of disease-associated loci using machine learning for genotype and network data integration, Bioinformatics, Volume 35, Issue 24, 15 December 2019, Pages 5182–5190,

Identification of disease-associated loci using machine learning for genotype and network data integration

Saved in:
Author: Leal, Luis G.1; David, Alessia1; Järvelin, Marjo-Riita2,3,4,5,6;
Organizations: 1Department of Life Sciences, Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London SW7 2AZ, UK
2Center for Life Course Health Research, Faculty of Medicine, University of Oulu, Oulu FI- 90014, Finland
3Biocenter Oulu, University of Oulu, Oulu 90220, Finland
4Unit of Primary Health Care, Oulu University Hospital, Oulu 90220, Finland
5Department of Epidemiology and Biostatistics, MRC-PHE Centre for Environment and Health, School of Public Health, Imperial College London, London W2 1PG, UK
6Department of Life Sciences, College of Health and Life Sciences, Brunel University London, Middlesex UB8 3PH, UK
7Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
8Department of Medicine, Imperial College London, London W2 1PG, UK
Format: article
Version: published version
Access: open
Online Access: PDF Full Text (PDF, 0.5 MB)
Persistent link:
Language: English
Published: Oxford University Press, 2019
Publish Date: 2020-02-26


Motivation: Integration of different omics data could markedly help to identify biological signatures, understand the missing heritability of complex diseases and ultimately achieve personalized medicine. Standard regression models used in Genome-Wide Association Studies (GWAS) identify loci with a strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Development of novel machine learning algorithms for merging genotype data with other omics data is highly needed as it could enhance the prioritization of weak loci.

Results: We developed cNMTF (corrected non-negative matrix tri-factorization), an integrative algorithm based on clustering techniques of biological data. This method assesses the inter-relatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations. cNMTF was used to prioritize genes associated with lipid traits in two population cohorts. We replicated 129 genes reported in GWAS world-wide and provided evidence that supports 85% of our findings (226 out of 265 genes), including recent associations in literature (NLGN1), regulators of lipid metabolism (DAB1) and pleiotropic genes for lipid traits (CARM1). Moreover, cNMTF performed efficiently against strong population structures by accounting for the individuals’ ancestry. As the method is flexible in the incorporation of diverse omics data sources, it can be easily adapted to the user’s research needs.

Availability and implementation: An R package (cnmtf) is available at

see all

Series: Bioinformatics
ISSN: 1367-4803
ISSN-E: 1460-2059
ISSN-L: 1367-4803
Volume: 35
Issue: 24
Pages: 5182 - 5190
DOI: 10.1093/bioinformatics/btz310
Type of Publication: A1 Journal article – refereed
Field of Science: 1182 Biochemistry, cell and molecular biology
113 Computer and information sciences
Funding: Luis G. Leal is supported by the President’s PhD Scholarship Scheme from Imperial College London. Alessia David is supported by the Wellcome Trust (grant WT/104955/Z/14/Z). Clive Hoggart is supported by the European Union’s Horizon 2020 research and innovation programme (grant 668303). The NFBC1966 received financial support from the Academy of Finland (project grants 104781, 120315, 129269, 1114194, 24300796), University Hospital Oulu, Biocenter, University of Oulu, Finland (75617), National Heart, Lung and Blood Institute (5R01HL087679-02) through the STAMPEED program (1RL1MH083268-01), National Institutes of Health/The National Institute of Mental Health (5R01MH63706: 02) and the Medical Research Council, UK (MR/M013138/1). The program is currently being funded by the DynaHEALTH action (H2020-633595) and academy of Finland EGEA-project (285547). The DNA extractions, sample quality controls, biobank up-keeping and aliquoting were performed in the National Public Health Institute, Biomedicum Helsinki, Finland and supported financially by the Academy of Finland and Biocentrum Helsinki. The eMERGE Network was initiated and funded by the National Human Genome Research Institute, in conjunction with additional funding from National Institute of General Medical Sciences through the following grants: (U01-HG-004610) Group Health Cooperative/University of Washington; (U01-HG-004608) Marshfield Clinic Research Foundation and Vanderbilt University Medical Center; (U01-HG-04599) Mayo Clinic; (U01HG004609) Northwestern University; (U01-HG-04603) Vanderbilt University Medical Center, also serving as the Administrative Coordinating Center; (U01HG004438) Center for Inherited Disease Research and (U01HG004424) the Broad Institute serving as Genotyping Centers.
EU Grant Number: (633595) DYNAHEALTH - Understanding the dynamic determinants of glucose homeostasis and social capability to promote Healthy and active aging
Academy of Finland Grant Number: 285547
Detailed Information: 285547 (Academy of Finland Funding decision)
129269 (Academy of Finland Funding decision)
114194 (Academy of Finland Funding decision)
Copyright information: © The Author(s) 2019. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.