synthetic minority over sampling technique

A novel approach to generate robust classification models to predict developmental toxicity from imbalanced datasets.

Journal: :SAR and QSAR in environmental research 2014

S B Gunturi N Ramamurthi

Computational models to predict the developmental toxicity of compounds are built on imbalanced datasets wherein the toxicants outnumber the non-toxicants. Consequently, the results are biased towards the majority class (toxicants). To overcome this problem and to obtain sensitive but also accurate classifiers, we followed an integrated approach wherein (i) Synthetic Minority Over Sampling (SMO...

متن کامل

Forecasting Cyber Attacks with Imbalanced Data Sets and Different Time Granularities

2018

Ahmet Okutan Shanchieh Jay Yang Katie McConky

If cyber incidents are predicted a reasonable amount of time before they occur, defensive actions to prevent their destructive effects could be planned. Unfortunately, most of the time we do not have enough observables of the malicious activities before they are already under way. Therefore, this work suggests to use unconventional signals extracted from various data sources with different time...

متن کامل

Support Vector Machines for Class Imbalance Rail Data Classification with Bootstrapping-Based Over-Sampling and Under-Sampling

2014

Ali Zughrat

Support Vector Machines (SVMs) is a popular machine learning technique, which has proven to be very effective in solving many classical problems with balanced data sets in various application areas. However, this technique is also said to perform poorly when it is applied to the problem of learning from heavily imbalanced data sets where the majority classes significantly outnumber the minority...

متن کامل

GS4: Generating Synthetic Samples for Semi-Supervised Nearest Neighbor Classification

2014

Panagiotis Moutafis Ioannis A. Kakadiaris

In this paper, we propose a method to improve nearest neighbor classification accuracy under a semi-supervised setting. We call our approach GS4 (i.e., Generating Synthetic Samples Semi-Supervised). Existing self-training approaches classify unlabeled samples by exploiting local information. These samples are then incorporated into the training set of labeled data. However, errors are propagate...

متن کامل

An improved survivability prognosis of breast cancer by using sampling and feature selection technique to solve imbalanced patient classification data

2013

Kung-Jeng Wang Bunjira Makond Kung-Min Wang

BACKGROUND Breast cancer is one of the most critical cancers and is a major cause of cancer death among women. It is essential to know the survivability of the patients in order to ease the decision making process regarding medical treatment and financial preparation. Recently, the breast cancer data sets have been imbalanced (i.e., the number of survival patients outnumbers the number of non-s...

متن کامل

A comparative assessment of the performance of ensemble learning in customer churn prediction

Journal: :Int. Arab J. Inf. Technol. 2014

Hossein Abbasimehr Mostafa Setak Mohammad Jafar Tarokh

Customer churn is a main concern of most firms in all industries. The aim of customer churn prediction is detecting customers with high tendency to leave a company. Although, many modeling techniques have been used in the field of churn prediction, performance of ensemble methods has not been thoroughly investigated yet. Therefore, in this paper, we perform a comparative assessment of the perfo...

متن کامل

From machine learning to deep learning: A comprehensive study of alcohol and drug use disorder

Journal: :Healthcare analytics 2022

This study aims to train and validate machine learning deep models identify patients with risky alcohol drug misuse in a Screening, Brief Intervention, Referral Treatment (SBIRT) program. An observational cohort of 6978 adults was admitted the western region Alabama at three medical facilities between January December 2019. Data were cleaned pre-processed using data imputation techniques an aug...

متن کامل

Evaluation of Classifiers in Software Fault-Proneness Prediction

Journal: Journal of Artificial Intelligence and Data Mining 2017

F. Karimian, S. M. Babamir,

Reliability of software counts on its fault-prone modules. This means that the less software consists of fault-prone units the more we may trust it. Therefore, if we are able to predict the number of fault-prone modules of software, it will be possible to judge the software reliability. In predicting software fault-prone modules, one of the contributing features is software metric by which one ...

متن کامل

A Bayesian beta kernel model for binary classification and online learning problems

Journal: :Statistical Analysis and Data Mining 2014

Cameron A. MacKenzie Theodore B. Trafalis Kash Barker

Recent advances in data mining have integrated kernel functions with Bayesian probabilistic analysis of Gaussian distributions. These machine learning approaches can incorporate prior information with new data to calculate probabilistic rather than deterministic values for unknown parameters. This paper extensively analyzes a specific Bayesian kernel model that uses a kernel function to calcula...

متن کامل

Predictive Model Using a Machine Learning Approach for Enhancing the Retention Rate of Students At-Risk

Journal: :International Journal on Semantic Web and Information Systems 2022

Student retention is a widely recognized challenge in the educational community to assist institutes formation of appropriate and effective pedagogical interventions. This study intends predict students at-risk low performances during an on-going course, those graduating late than tentative timeline predicting capacity campus. The data constitutes demographics, learning, academic related attrib...

متن کامل