Approx-SMOTE: Fast SMOTE for Big Data on Apache Spark

نویسندگان

چکیده

One of the main goals Big Data research, is to find new data mining methods that are able process large amounts in acceptable times. In classification, as traditional class imbalance a common problem must be addressed, case also looking for solution can applied an execution time. this paper we present Approx-SMOTE, parallel implementation SMOTE algorithm Apache Spark framework. The key difference with original SMOTE, besides parallelism, it uses approximated version k-Nearest Neighbor which makes highly scalable. Although already exists (SMOTE-BD), exact Nearest search, does not make entirely Approx-SMOTE on other hand achieve up 30 times faster run without sacrificing improved classification performance offered by SMOTE.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Preprocessing for Liver Dataset Using SMOTE

-The class imbalanced problem occurs in various disciplines when one of target classes has a small number of instances compare to other classes. A classifier normally ignores or neglects to detect a minority class due to the small number of class instances. It poses a challenge to any classifier as it becomes hard to learn the minority class samples. Most of the oversampling methods may generat...

متن کامل

SMOTE for Regression

Several real world prediction problems involve forecasting rare values of a target variable. When this variable is nominal we have a problem of class imbalance that was already studied thoroughly within machine learning. For regression tasks, where the target variable is continuous, few works exist addressing this type of problem. Still, important application areas involve forecasting rare extr...

متن کامل

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

*Correspondence: [email protected] 1Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071 Granada, Spain Full list of author information is available at the end of the article Abstract The large amounts of data have created a need for new fram...

متن کامل

Static and Dynamic Big Data Partitioning on Apache Spark

Many of today’s large datasets are organized as a graph. Due to their size it is often infeasible to process these graphs using a single machine. Therefore, many software frameworks and tools have been proposed to process graph on top of distributed infrastructures. This software is often bundled with generic data decomposition strategies that are not optimised for specific algorithms. In this ...

متن کامل

Geometric SMOTE: Effective oversampling for imbalanced learning through a geometric extension of SMOTE

Classification of imbalanced datasets is a challenging task for standard algorithms. Although many methods exist to address this problem in different ways, generating artificial data for the minority class is a more general approach compared to algorithmic modifications. SMOTE algorithm and its variations generate synthetic samples along a line segment that joins minority class instances. In th...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Neurocomputing

سال: 2021

ISSN: ['0925-2312', '1872-8286']

DOI: https://doi.org/10.1016/j.neucom.2021.08.086