Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience
نویسندگان
چکیده
Increasingly, organizations capture, transform and analyze enormous data sets. Prominent examples include internet companies and e-science. The Map-Reduce scalable dataflow paradigm has become popular for these applications. Its simple, explicit dataflow programming model is favored by some over the traditional high-level declarative approach: SQL. On the other hand, the extreme simplicity of MapReduce leads to much low-level hacking to deal with the many-step, branching dataflows that arise in practice. Moreover, users must repeatedly code standard operations such as join by hand. These practices waste time, introduce bugs, harm readability, and impede optimizations. Pig is a high-level dataflow system that aims at a sweet spot between SQL and Map-Reduce. Pig offers SQL-style high-level data manipulation constructs, which can be assembled in an explicit dataflow and interleaved with custom Mapand Reduce-style functions or executables. Pig programs are compiled into sequences of Map-Reduce jobs, and executed in the Hadoop Map-Reduce environment. Both Pig and Hadoop are open-source projects administered by the Apache Software Foundation. This paper describes the challenges we faced in developing Pig, and reports performance comparisons between Pig execution and raw Map-Reduce execution.
منابع مشابه
PonIC: Using Stratosphere to Speed Up Pig Analytics
Pig, a high-level dataflow system built on top of Hadoop MapReduce, has greatly facilitated the implementation of data-intensive applications. Pig successfully manages to conceal Hadoop’s one input and two-stage inflexible pipeline limitations, by translating scripts into MapReduce jobs. However, these limitations are still present in the backend, often resulting in inefficient execution. Strat...
متن کاملReStore: Reusing Results of MapReduce Jobs
Analyzing large scale data has emerged as an important activity for many organizations in the past few years. This large scale data analysis is facilitated by the MapReduce programming and execution model and its implementations, most notably Hadoop. Users of MapReduce often have analysis tasks that are too complex to express as individual MapReduce jobs. Instead, they use high-level query lang...
متن کاملSPARQling Pig - Processing Linked Data with Pig Latin
In recent years, dataflow languages such as Pig Latin have emerged as flexible and powerful tools for handling complex analysis tasks on big data. These languages support schema flexibility as well as common programming patterns such as iteration. They offer extensibility through user-defined functions while running on top of scalable distributed platforms. In doing so, these languages enable a...
متن کاملAutomatic Optimization of Parallel Dataflow Programs
Large-scale parallel dataflow systems, e.g., Dryad and Map-Reduce, have attracted significant attention recently. High-level dataflow languages such as Pig Latin and Sawzall are being layered on top of these systems, to enable faster program development and more maintainable code. These languages engender greater transparency in program structure, and open up opportunities for automatic optimiz...
متن کاملNesting Strategies for Enabling Nimble MapReduce Dataflows for Large RDF Data
Graph and semi-structured data are usually modeled in relational processing frameworks as “thin” relations (node, edge, node) and processing such data involves a lot of join operations. Intermediate results of joins with multi-valued attributes or relationships, contain redundant subtuples due to repetition of single-valued attributes. The amount of redundant content is high for real-world mult...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 2 شماره
صفحات -
تاریخ انتشار 2009