Extracting Predictor Variables to Construct Breast Cancer Survivability Model with Class Imbalance Problem
Authors
Abstract:
Application of data mining methods as a decision support system has a great benefit to predict survival of new patients. It also has a great potential for health researchers to investigate the relationship between risk factors and cancer survival. But due to the imbalanced nature of datasets associated with breast cancer survival, the accuracy of survival prognosis models is a challenging issue for researchers. This study aims to develop a predictive model for 5-year survivability of breast cancer patients and discover relationships between certain predictive variables and survival. The dataset was obtained from SEER database. First, the effectiveness of two synthetic oversampling methods Borderline SMOTE and Density based Synthetic Oversampling method (DSO) is investigated to solve the class imbalance problem. Then a combination of particle swarm optimization (PSO) and Correlation-based feature selection (CFS) is used to identify most important predictive variables. Finally, in order to build a predictive model three classifiers decision tree (C4.5), Bayesian Network, and Logistic Regression are applied to the cleaned dataset. Some assessment metrics such as accuracy, sensitivity, specificity, and G-mean are used to evaluate the performance of the proposed hybrid approach. Also, the area under ROC curve (AUC) is used to evaluate performance of feature selection method. Results show that among all combinations, DSO + PSO_CFS + C4.5 presents the best efficiency in criteria of accuracy, sensitivity, G-mean and AUC with values of 94.33%, 0.930, 0.939 and 0.939, respectively.
similar resources
Breast Cancer Diagnosis from Perspective of Class Imbalance
Introduction: Breast cancer is the second cause of mortality among women. Early detection is the only rescue to reduce the risk of breast cancer mortality. Traditional methods cannot effectively diagnose tumor since they are based on the assumption of well-balanced dataset.. However, a hybrid method can help to alleviate the two-class imbalance problem existing in the ...
full textHandling class imbalance problem in miRNA dataset associated with cancer
MiRNAs are small (~22nt long) non-coding RNA sequences; binds to the complementarity target sites in 3' Untranslated Region (UTR) of mRNA sequences but not restricted to other mRNA regions viz., 5' UTR and Coding sequences (CDS). Complementarity binding of miRNA to mRNA target sites either results in complete degradation of the mRNA itself or it may regulate the mRNA as an oncogene or as a tumo...
full textClassification with class imbalance problem: A Review
Most existing classification approaches assume the underlying training set is evenly distributed. In class imbalanced classification, the training set for one class (majority) far surpassed the training set of the other class (minority), in which, the minority class is often the more interesting class. In this paper, we review the issues that come with learning from imbalanced class data sets a...
full textClass Imbalance Problem in Data Mining Review
In last few years there are major changes and evolution has been done on classification of data. As the application area of technology is increases the size of data also increases. Classification of data becomes difficult because of unbounded size and imbalance nature of data. Class imbalance problem become greatest issue in data mining. Imbalance problem occur where one of the two classes havi...
full textThe Class Imbalance Problem: Signiicance and Strategies
Although the majority of concept-learning systems previously designed usually assume that their training sets are well-balanced, this assumption is not necessarily correct. Indeed, there exist many domains for which one class is represented by a large number of examples while the other is represented by only a few. The purpose of this paper is 1) to demonstrate experimentally that, at least in ...
full textHandling Class Imbalance Problem Using Feature Selection
1 Introduction The class imbalance problem is a challenge to machine learning and data mining, and it has attracted significant research recent years. A classifier affected by the class imbalance problem for a specific data set would see strong accuracy overall but very poor performance on the minority class. The imbalance data sets are pervasive in real-world applications. Examples of these ki...
full textMy Resources
Journal title
volume 6 issue 2
pages 263- 276
publication date 2018-07-01
By following a journal you will be notified via email when a new issue of this journal is published.
Hosted on Doprax cloud platform doprax.com
copyright © 2015-2023