Tutorial: SQL-on-Hadoop Systems
نویسندگان
چکیده
Enterprises are increasingly using Apache Hadoop, more specifically HDFS, as a central repository for all their data; data coming from various sources, including operational systems, social media and the web, sensors and smart devices, as well as their applications. At the same time many enterprise data management tools (e.g. from SAP ERP and SAS to Tableau) rely on SQL and many enterprise users are familiar and comfortable with SQL. As a result, SQL processing over Hadoop data has gained significant traction over the recent years, and the number of systems that provide such capability has increased significantly. In this tutorial we use the term SQL-on-Hadoop to refer to systems that provide some level of declarative SQL(-like) processing over HDFS and noSQL data sources, using architectures that include computational or storage engines compatible with Apache Hadoop. It is important to note that there are important distinct characteristics of this emerging eco-system that are different than traditional relational warehouses. First, in the world of Hadoop and HDFS data, complex data types, such as arrays, maps, structs, as well as JSON data are more prevalent. Second, the users utilize UDFs (user-defined-functions) very widely to express their business logic, which is sometimes very awkward to express in SQL itself. Third, often times there is little control over HDFS. Files can be added or modified outside the tight control of a query engine, making statistics maintenance a challenge. These factors complicate the query optimization further in the Hadoop system. There is a wide variety of solutions, system architectures, and capabilities in this space, with varying degree of SQL support and capabilities. The purpose of this tutorial is to provide an overview of these options, discuss various different approaches, and compare them to gain insights into open research problems. In this tutorial, we will examine the SQL-on-Hadoop systems along various dimensions. One important aspect is their data storage. Some of these systems support all native Hadoop formats, and do not impose any propriety data for-
منابع مشابه
A Study of SQL-on-Hadoop Systems
Hadoop is now the de facto standard for storing and processing big data, not only for unstructured data but also for some structured data. As a result, providing SQL analysis functionality to the big data resided in HDFS becomes more and more important. Hive is a pioneer system that support SQL-like analysis to the data in HDFS. However, the performance of Hive is not satisfactory for many appl...
متن کاملBenchmarking SQL-on-Hadoop Systems: TPC or Not TPC?
Benchmarks are important tools to evaluate systems, as long as their results are transparent, reproducible and they are conducted with due diligence. Today, many SQL-on-Hadoop vendors use the data generators and the queries of existing TPC benchmarks, but fail to adhere to the rules, producing results that are not transparent. As the SQL-on-Hadoop movement continues to gain more traction, it is...
متن کاملSQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures
SQL query processing for analytics over Hadoop data has recently gained significant traction. Among many systems providing some SQL support over Hadoop, Hive is the first native Hadoop system that uses an underlying framework such as MapReduce or Tez to process SQL-like statements. Impala, on the other hand, represents the new emerging class of SQL-on-Hadoop systems that exploit a shared-nothin...
متن کاملImpala: A Modern, Open-Source SQL Engine for Hadoop
Cloudera Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Apache Hive. This paper presents Impala from a user’s perspective, gives an overview of its architecture and main components and...
متن کاملA Generic Solution to Integrate SQL and Analytics for Big Data
There is a need to integrate SQL processing with more advanced machine learning (ML) analytics to drive actionable insights from large volumes of data. As a first step towards this integration, we study how to efficiently connect big SQL systems (either MPP databases or new-generation SQL-on-Hadoop systems) with distributed big ML systems. We identify two important challenges to address in the ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 8 شماره
صفحات -
تاریخ انتشار 2015