Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

نویسنده

چکیده مقاله:

This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes which is known to deteriorate the performance of classifiers. It also influences its validity and generalizablity. The classification models in this study were built using five machine learning algorithms known as PLS-DA, MLP, SVM, C4.5 and ID3. This model is built after carrying out a number of intensive data preprocessing procedures to tackle the problem of imbalanced classes and improve the performance of the constructed classifiers.These procedures involves applying data transformation, normalization, standardization, re-sampling and data reduction procedures using a number of variables importance scorers. The best performance was achieved by building an MLP model that was trained and tested using five-fold cross-validation using datasets that were re-sampled using SMOTE method and then reduced using SVM variable importance scorer. This model was successful in classifying samples with excellent accuracy and also in identifying the potential disease biomarkers. The results confirm the validity of metabolomics data mining for diagnosis of cachexia. It also emphasizes the importance of data preprocessing procedures such as sampling and data reduction for improving data mining results, particularly when data suffers from the problem of imbalanced classes.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Mining Fuzzy Classification Rules for Imbalanced Data

Fuzzy rule-based classification system (FRBCS) is a popular machine learning technique for classification purposes. One of the major issues when applying it on imbalanced data sets is its biased to the majority class, such that, it performs poorly in respect to the minority class. However many cases the minority classes are more important than the majority ones. In this paper, we have extended ...

متن کامل

Data in support of enhancing metabolomics research through data mining

Metabolomics research has evolved considerably, particularly during the last decade. Over the course of this evolution, the interest in this 'omic' discipline is now more evident than ever. However, the future of metabolomics will depend on its capability to find biomarkers. For that reason, data mining constitutes a challenging task in metabolomics workflow. This work has been designed in supp...

متن کامل

Learning FCM by Data Mining in a Purchase System

Fuzzy Cognitive Maps (FCMs) have successfully been applied in numerous domains to show the relations between essential components in complex systems. In this paper, a novel learning method is proposed to construct FCMs based on historical data and by using meta-heuristic: Genetic Algorithm (GA), Simulated Annealing (SA), and Tabu Search (TS). Implementation of the proposed method has demonstrat...

متن کامل

On Mining Fuzzy Classification Rules for Imbalanced Data

Fuzzy rule-based classification system (FRBCS) is a popular machine learning technique for classification purposes. One of the major issues when applying it on imbalanced data sets is its biased to the majority class, such that, it performs poorly in respect to the minority class. However many cases the minority classes are more important than the majority ones. In this paper, we have extended ...

متن کامل

Preprocessing of NMR metabolomics data.

Metabolomics involves the large scale analysis of metabolites and thus, provides information regarding cellular processes in a biological sample. Independently of the analytical technique used, a vast amount of data is always acquired when carrying out metabolomics studies; this results in complex datasets with large amounts of variables. This type of data requires multivariate statistical anal...

متن کامل

Mining Imbalanced Data with Learning Classifier Systems

This chapter investigates the capabilities of XCS for mining imbalanced datasets. Initial experiments show that, for moderate and high class imbalances, XCS tends to evolve a large proportion of overgeneral classifiers. Theoretical analyses are developed, deriving an imbalance bound up to which XCS should be able to differentiate between accurate and overgeneral classifiers. Some relevant param...

متن کامل

منابع من

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}


عنوان ژورنال

دوره 11  شماره Special Issue

صفحات  79- 89

تاریخ انتشار 2019-07-01

با دنبال کردن یک ژورنال هنگامی که شماره جدید این ژورنال منتشر می شود به شما از طریق ایمیل اطلاع داده می شود.

میزبانی شده توسط پلتفرم ابری doprax.com

copyright © 2015-2023