Predicting disease risks from highly imbalanced data using random forest
نویسندگان
چکیده
BACKGROUND We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare. METHODS We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases. RESULTS We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process. CONCLUSIONS In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.
منابع مشابه
Application of Random Forest in Various Fields as the Most Reliable and Effective Data Mining Technique
This paper reviews the work done in various papers in the field of random forest to emphasize its importance as an important data analysis algorithm while comparing it with other algorithm at some places and sometimes using the algorithm with other methods for better accuracy at tree learning. It also studies the practical experiments done to establish the superiority of random forest over othe...
متن کاملComparison of Random Forest and Logistic Regression Methods in Predicting Mortality in Colorectal Cancer Patients and its Related Factors
Background and Objectives: The purpose of this study was to predict the mortality rate of colorectal cancer in Iranian patients and determine the effective factors on the mortality of patients with colorectal cancer using random forest and logistic regression methods. Methods: Data from 304 patients with colorectal cancer registry from the Gastroenterology and Liver Research Center of Shah...
متن کاملComparison of Random Survival Forests for Competing Risks and Regression Models in Determining Mortality Risk Factors in Breast Cancer Patients in Mahdieh Center, Hamedan, Iran
Introduction: Breast cancer is one of the most common cancers among women worldwide. Patients with cancer may die due to disease progression or other types of events. These different event types are called competing risks. This study aimed to determine the factors affecting the survival of patients with breast cancer using three different approaches: cause-specific hazards regression, subdistri...
متن کاملPrediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature
MOTIVATION In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid overfitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest (RF) algorithm is used because it is fast and has robust performance for different parameter value...
متن کامل