J3.9 Multiple Imputation through Machine Learning Algorithms

نویسندگان

  • Michael B. Richman
  • Theodore B. Trafalis
  • Indra Adrianto
چکیده

A problem common to meteorological and climatological datasets is how to address missing data. The majority of multivariate analysis techniques require that all variables be represented for each observation; hence, some action is required in the presence of missing data. In cases where the individual observations are thought not important, deletion of every observation missing one or more pieces of data (complete case deletion) is common. As the amount of missing data increases, tacit deletion can lead to bias in the first two statistical moments of the remaining data as population estimators and inaccuracies in subsequent analyses. What is desired is a principled method that uses information available in the remaining data to predict the missing values. Such techniques include substituting nearby data, interpolation techniques and linear regression using nearby sites as predictors. One class of technique that uses the information available in an iterative manner is known as multiple imputation. In this work, different types of machine learning techniques, such as support vector machines (SVMs) and artificial neural networks (ANNs) are tested against standard imputation methods (e.g., multiple regression), simple regression, mean substitution, and casewise deletion. All methods are used to predict the known values of climatological data which have been altered to produce missing data. These data sets are on the order of 400 variables (data station sites) and a large number of observations. Both precipitation and air temperature data are used to provide a range of inherent spatial coherence seen by analysts. The MSE of the prediction and the MAE of the variance are presented to assess the efficacy of each technique. Results indicate that the non-iterative methods, such as casewise deletion and mean substitution, lead to the largest errors and iterative imputation has considerably lower errors. Within the iterative techniques, SVMs are most promising in reducing error.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Experimental analysis of methods for imputation of missing values in databases

A very important issue faced by researchers and practitioners who use industrial and research databases is incompleteness of data, usually in terms of missing or erroneous values. While some of data analysis algorithms can work with incomplete data, a large portion of them require complete data. Therefore, different strategies, such as deletion of incomplete examples, and imputation (filling) o...

متن کامل

The machine learning process in applying spatial relations of residential plans based on samples and adjacency matrix

The current world is moving towards the development of hardware or software presence of artificial intelligence in all fields of human work, and architecture is no exception. Now this research seeks to present a theoretical and practical model of intuitive design intelligence that shows the problem of learning layout and spatial relationships to artificial intelligence algorithms; Therefore, th...

متن کامل

A Hybrid Optimization Algorithm for Learning Deep Models

Deep learning is one of the subsets of machine learning that is widely used in Artificial Intelligence (AI) field such as natural language processing and machine vision. The learning algorithms require optimization in multiple aspects. Generally, model-based inferences need to solve an optimized problem. In deep learning, the most important problem that can be solved by optimization is neural n...

متن کامل

Comparative Analysis of Machine Learning Algorithms with Optimization Purposes

The field of optimization and machine learning are increasingly interplayed and optimization in different problems leads to the use of machine learning approaches‎. ‎Machine learning algorithms work in reasonable computational time for specific classes of problems and have important role in extracting knowledge from large amount of data‎. ‎In this paper‎, ‎a methodology has been employed to opt...

متن کامل

Missing data imputation using statistical and machine learning methods in a real breast cancer problem

OBJECTIVES Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set. MATERIALS AND METHODS Imputation methods based...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006