Benchmarking SQL-on-Hadoop Systems: TPC or Not TPC?
نویسندگان
چکیده
Benchmarks are important tools to evaluate systems, as long as their results are transparent, reproducible and they are conducted with due diligence. Today, many SQL-on-Hadoop vendors use the data generators and the queries of existing TPC benchmarks, but fail to adhere to the rules, producing results that are not transparent. As the SQL-on-Hadoop movement continues to gain more traction, it is important to bring some order to this “wild west” of benchmarking. First, new rules and policies should be defined to satisfy the demands of the new generation SQL systems. The new benchmark evaluation schemes should be inexpensive, effective and open enough to embrace the variety of SQL-on-Hadoop systems and their corresponding vendors. Second, adhering to the new standards requires industry commitment and collaboration. In this paper, we discuss the problems we observe in the current practices of benchmarking, and present our proposal for bringing standardization in the SQL-on-Hadoop space.
منابع مشابه
SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures
SQL query processing for analytics over Hadoop data has recently gained significant traction. Among many systems providing some SQL support over Hadoop, Hive is the first native Hadoop system that uses an underlying framework such as MapReduce or Tez to process SQL-like statements. Impala, on the other hand, represents the new emerging class of SQL-on-Hadoop systems that exploit a shared-nothin...
متن کاملBenchmarking Infrastructure for Big Data
This paper discusses the applicability of 20 years of experience benchmarking transactional systems (TPC) and storage system benchmarking (SPC) to the design of systems suitable for use in comparing different analytic systems. One of the most challenging problems is finding tools to make meaningful comparisons between substantially different architectures working on solving similar problems. Fo...
متن کاملA Study of SQL-on-Hadoop Systems
Hadoop is now the de facto standard for storing and processing big data, not only for unstructured data but also for some structured data. As a result, providing SQL analysis functionality to the big data resided in HDFS becomes more and more important. Hive is a pioneer system that support SQL-like analysis to the data in HDFS. However, the performance of Hive is not satisfactory for many appl...
متن کاملPig vs Hive: Benchmarking High Level Query Languages
This article presents benchmarking results of two benchmarking sets (run on small clusters of 6 and 9 nodes) applied to Hive and Pig running on Hadoop 0.14.1. The first set of results were obtainted by replicating the Apache Pig benchmark published by the Apache Foundation on 11/07/07 (which served as a baseline to compare major Pig Latin releases). The second results were obtained by applying ...
متن کاملAvailability Benchmarking of a Database System
We present the results of an availability benchmarking study of a three-tier transactionprocessing-oriented database system based on Microsoft SQL Server 2000. Following the general availability benchmarking methodology introduced by our previous work on software RAID systems, we carried out a set of fault-injection experiments in which we measured the effects of 14 different types of realistic...
متن کامل