Bringing Elastic MapReduce to Scientific Clouds
نویسندگان
چکیده
The MapReduce programming model, proposed by Google, offers a simple and efficient way to perform distributed computation over large data sets. The Apache Hadoop framework is a free and open-source implementation of MapReduce. To simplify the usage of Hadoop, Amazon Web Services provides Elastic MapReduce, a web service that enables users to submit MapReduce jobs. Elastic MapReduce takes care of resource provisioning, Hadoop configuration and performance tuning, data staging, fault tolerance, etc. This service drastically reduces the entry barrier to perform MapReduce computations in the cloud. However, Elastic MapReduce is limited to using Amazon EC2 resources, and requires an extra fee. In this paper, we present our work towards creating an implementation of Elastic MapReduce which is able to use resources from other clouds than Amazon EC2, such as scientific clouds. This work will also serve as a foundation for more advanced experiments, such as performing MapReduce computations over multiple distributed clouds.
منابع مشابه
Resilin: Elastic MapReduce for Private and Community Clouds
The MapReduce programming model, introduced by Google, offers a simple and efficient way of performing distributed computation over large data sets. Although Google’s implementation is proprietary, MapReduce can be leveraged by anyone using the free and open source Apache Hadoop framework. To simplify the usage of Hadoop in the cloud, Amazon Web Services offers Elastic MapReduce, a web service ...
متن کاملHaving a ChuQL at XML on the Cloud
MapReduce/Hadoop has gained acceptance as a framework to process, transform, integrate, and analyze massive amounts of Web data on the Cloud. The MapReduce model (simple, fault tolerant, data parallelism on elastic clouds of commodity servers) is also attractive for processing enterprise and scientific data. Despite XML ubiquity, there is yet little support for XML processing on top of MapReduc...
متن کاملIterative MapReduce for Large Scale Machine Learning
Large datasets (“Big Data”) are becoming ubiquitous because the potential value in deriving insights from data, across a wide range of business and scientific applications, is increasingly recognized. The data growth has been accompanied by rapid adoption of large, elastic, multi-tenanted computing clusters (“compute clouds”), leading to a virtuous cycle: the scalability of cloud computing make...
متن کاملData Mining Using Clouds: An Experimental Implementation of Apriori over MapReduce
Cloud computing has become a viable mainstream solution for data processing, storage and distribution. It promises on demand, scalable, pay-as-you-go compute and storage capacity. To analyze “big data” on clouds, it is very important to research data mining strategies based on cloud computing paradigm from both theoretical and practical views. For this purpose, we study a strategy of data minin...
متن کاملHigh Performance Parallel Computing with Clouds and Cloud Technologies
Infrastructure services (Infrastructure-as-a-service), provided by cloud vendors, allow any user to provision a large number of compute instances fairly easily. Whether leased from public clouds or allocated from private clouds, utilizing these virtual resources to perform data/compute intensive analyses requires employing different parallel runtimes to implement such applications. Among many p...
متن کامل