Statistical analysis of the impact of class imbalance on model performance |
|
Author: | Nguyen, Lien1 |
Organizations: |
1University of Oulu, Faculty of Science, Statistics |
Format: | ebook |
Version: | published version |
Access: | open |
Online Access: | PDF Full Text (PDF, 2.3 MB) |
Pages: | 44 |
Persistent link: | http://urn.fi/URN:NBN:fi:oulu-202305302046 |
Language: | English |
Published: |
Oulu : L. Nguyen,
2023
|
Publish Date: | 2023-05-30 |
Thesis type: | Master's thesis |
Tutor: |
Sillanpää, Mikko |
Reviewer: |
Waldmann, Patrik Sillanpää, Mikko |
Description: |
Abstract This work studies the impact of class imbalance in data distribution on models and performance metrics. With the increasing amount of data available, new challenges arise, and one of them is unbalanced data. This problem occurs when one class in a dataset is underrepresented compared to others. The under-represented class is called the minority, while the dominant class is called the majority. In an unbalanced data problem, the number of instances in the minority class is much fewer than that in the majority class, with a ratio such as 1:99. The thesis begins with a comprehensive overview of the challenges and solutions associated with unbalanced data in machine learning. Different types of class imbalances have been defined and discussion of challenges from this issue. Next the thesis discusses various techniques for dealing with unbalanced data, including resampling and more complex algorithms such as XGboost. The thesis also presents an empirical study and results on the Caravan insurance dataset. This dataset was provided in the Coil Challenge in 2000, where participants competed to develop the best model to predict whether a customer would buy insurance for their caravan. The Caravan dataset shows that only around six percent of customers will buy insurance, making it heavily unbalanced. With this distribution, the dataset is a good candidate to study the impact of imbalance on model performance and performance metrics. Unbalanced data can cause problems in various industries, from fraud detection to credit risk management. Therefore, understanding how to deal with this issue is crucial for developing accurate machine learning models that can be used effectively across different domains. This work provides valuable insights into how to address this challenge by presenting empirical evidence and discussing commonly applied solutions. It is a helpful resource for anyone interested in machine learning and data science. Using a real business data set for empirical test, there are three findings from this research. Firstly, performance metrics such as accuracy may not be suitable for heavily unbalanced data. Secondly, most common modeling method, logistic regressions may not provide best results for minority class since performance metrics are dominated by majority class. Finally, to address imbalance issue, both resampling techniques to balance data before modelling and ensemble modelling methods such as XGBoost can produce good results. Tilastollisen aineiston todennäköisyysjakaumien luokittaisen epätasapainon vaikutuksia tilastollisten mallien ennustekykyyn Tiivistelmä Pro gradu -tutkielmani tavoitteena on tarkastella tilastollisen aineiston todennäköisyysjakaumien luokittaisen epätasapainon vaikutuksia tilastollisten mallien ennustekykyyn. Liiketaloudelliseen aineistoon perustuvan empiirisen tutkimuksen tuloksena saatiin kolme keskeistä tulosta. Ensiksi tilastollisten mallien arvioinnissa ennustekyvyn mittarit eivät sovellu luokittaisesti epätasapainoisiin aineistoihin. Toiseksi yleisimpiin käytössä oleviin tilastollisiin malleihin kuuluva logistinen regressioanalyysi aliarvioi havaintomäärältään pienimpien osajoukkojen esiintymisten todennäköisyyttä. Kolmanneksi luokittaisesta epätapainosta aiheutuvaa tilastollisen mallin ennustetarkkuuteen liittyvää virhettä voidaan merkittävästi parantaa XGBoostmallinnustekniikalla ja resampling-menetelmillä. see all
|
Subjects: | |
Copyright information: |
© Lien Nguyen, 2023. Except otherwise noted, the reuse of this document is authorised under a Creative Commons Attribution 4.0 International (CC-BY 4.0) licence (https://creativecommons.org/licenses/by/4.0/). This means that reuse is allowed provided appropriate credit is given and any changes are indicated. For any use or reproduction of elements that are not owned by the author(s), permission may need to be directly from the respective right holders. |
https://creativecommons.org/licenses/by/4.0/ |