Optimal parameters for bloom-filtered joins in Spark

نویسنده

  • Ophir Lojkine
چکیده

Схемы баз данных типа «Звезда» или «Снежинка» используют одну большую таблицу фактов и несколько маленьких таблиц измерения. Подобные схемы часто требуют фильтровки в таблицах измерения, поэтому такие схемы требуют обработки множества записей даже когда результат запроса маленький по объему. Наша работа не затрагивает исключительно подобные таблицы. В этой статье мы предположим, что у нас имеется только две таблицы, одна из которых по объему больше другой. Одна из таблиц достаточно маленькая (в статье будет раскрыто понятие «достаточно»). Другая таблица условно будет называться большой. Обе таблицы распределенные и находятся на одном кластере. Цель данного научного исследования — выполнить следующий запрос:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lightning Fast and Space Efficient Inequality Joins

Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B-tree, R⇤-tree and Bitmap, inequality joins have received little attention and queries containing such joins are usua...

متن کامل

The STARK Framework for Spatio-Temporal Data Analytics on Spark

Big Data sets can contain all types of information: from server log files to tracking information of mobile users with their location at a point in time. Apache Spark has been widely accepted for Big Data analytics because of its very fast processing model. However, Spark has no native support for spatial or spatio-temporal data. Spatial filters or joins using, e.g., a contains predicate are no...

متن کامل

Bloom Filters in Distributed Query Execution

The MapReduce framework [5] has emerged as a successful parallel computation model in large-scale data analytics, mostly due to its simple interface and its scalability over thousands of nodes. However, while various primitives, such as aggregations, are performed efficiently in this framework, more complicated relational algebra operations such as joins and multiway joins are still implemented...

متن کامل

Efficient Skew Handling for Outer Joins in a Cloud Computing Environment

Outer joins are ubiquitous in many workloads and Big Data systems. The question of how to best execute outer joins in large parallel systems is particularly challenging, as real world datasets are characterized by data skew leading to performance issues. Although skew handling techniques have been extensively studied for inner joins, there is little published work solving the corresponding prob...

متن کامل

Cuttlefish: A Lightweight Primitive for Adaptive Query Processing

Modern data processing applications execute increasingly sophisticated analysis that requires operations beyond traditional relational algebra. As a result, operators in query plans grow in diversity and complexity. Designing query optimizer rules and cost models to choose physical operators for all of these novel logical operators is impractical. To address this challenge, we develop Cuttlefis...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1706.02785  شماره 

صفحات  -

تاریخ انتشار 2017