Research Article

Exploiting Sharing Join Opportunities in Big Data Multiquery Optimization with Flink

Table 1

The comparison of the J-MOTH system with other proposed solutions.

No.ReferenceFunctionalities descriptionQuery typeStoring typeData distributionMethodologiesDifference relative to J-MOTH system

1MRShare: sharing across multiple queries in MapReduce [18]The concurrent-sharing framework that exploits sharing opportunities among multiple jobs by grouping them into a single jobAggregationTemporaryUniformGrouping shared queriesThe sharing optimization of similar work and overlapped work in MRShare and relaxed MRShare, respectively, are considered fine-grained sharing
2Multiquery optimization in MapReduce framework [31]Relaxing and generalizing MRShare overlapping queries to increase sharing opportunity into a single jobAggregationTemporaryUniformGrouping materialization of overlapped queries
3Restore: reusing results of MapReduce jobs [19]Nonconcurrent sharing system that optimizes query evaluation using materialized results produced by pig workflows of MapReduce jobsN/APermanentUniformMaterializationThe worst case of using both of HOME and ReStore systems is that the output of a full or subjob is not reused by future queries; thus, these existing systems suffer from big data storage limitation which incurs a high cost to buy additional physical storages or rent extra virtual storages in big data environment
4HOME : HiveQL optimization in multisession environment [51]HOME system improves Hadoop performance by storing data of previous results and using it in the next run for the same session or for a different sessionSelection aggregation joinPermanentUniformMaterialization
5Improving the performance of Hadoop Hive by sharing scan and computation tasks [52]MQO framework, SharedHive, improves the overall performance of Hadoop by grouping correlated HiveQL queries into a new set of queriesN/ATemporaryUniformGroupingIt exploits the temporary fine-grained correlated queries to optimize multiquery for only one session over Hive
6Reuse-based optimization for pig Latin [53]The PigReuse system operates common subexpressions using the algebraic representations of pig Latin scripts and reuse-based algorithms to share its resultsN/ATemporaryUniformReused-basedIt reutilizes the fine-grained reused-based opportunities to optimize multiquery over pig without considering data distribution
7Exploiting Soft and Hard Correlations in big data query optimization [54]EXORD system has been proposed to exploit the granularity of data correlations (i.e., soft and hard) to improve big data query optimizationN/ATemporaryUniformMaterializationIt deals with multiquery sharing and reuses fine-grained computed results based on sharing correlations
8JOUM :  an indexing methodology for improving join in Hive star schema [24]JOUM (join once use many) improves the speed-up of Hive join query tasks by using pipeline materializing of the full joined star schemaJoinPermanentUniformMaterializationIt does not exploit sharing among join queries; it optimizes individual incoming join query based on the prejoined schema
9Optimizing join in Hive star schema using Key/Facts indexing [41]Proposing key/facts indexing to materialize the star schema data and build an index for this materialized data to optimize JOIN on HiveJoinPermanentUniformMaterializationIt does not exploit sharing among join queries; it optimizes individual incoming join query based on the indexed prejoined schema
10BlockJoin: efficient matrix partitioning through joins [42]Proposing an optimized distributed join algorithm, namely, BlockJoin to reduce shuffling costs of intermediate data by merging relational and liner algebra operations into specialized physical operatorsJoinTemporaryUniformMaterialization and index joinIt does not exploit sharing among join queries; it optimizes individual incoming join query based on the indexed and materialization strategy (i.e., early and late) based on the shape of tables
11Wide table layout optimization based on column ordering and duplication [55]Proposing a fine-grained cost model for column accesses and column duplication to optimize HDFS I/O cost of a query workloadJoinTemporaryUniformN/AIt does not exploit sharing among join queries; it implements the fine-grained cost model using simulated annealing-based column ordering algorithm to find the approximated optimal column orders and combines with storage constrained column duplication algorithm to optimize I/O throughput
12Selecting subexpressions to materialize at datacenter scale [36]Proposing a vertex-centric graph algorithm called BIGSUBS to iteratively select subexpressions in parallel to be materialized over very large workloadsJoinN/AUniformMaterializationA subexpression selection is mapped to a bipartite graph labeling problem and then it is solved in an iterative manner
13In-memory caching for multiquery optimization of data-intensive scalable computing workloads [48]Proposing a method for multiquery optimization which is combined within memory cache primitives to improve the efficiency of data-intensive frameworksJoinTemporaryUniformN/AIt exploits sharing opportunities by caching distributed relations
14A parallel query processing system based on graph-based database partitioning [49]Proposing a novel graph-based database partitioning method called GPT which considers the trade-off between data redundancy and the number of opportunities for joins processing without shufflingJoinN/AUniformGrouping, sorting, and partitioningIt exploits sharing by using the partitioning method (i.e., hash-based multicolumn (HMC)), while our work considers data granularities and implicit sort to reduce shuffling in multiquery
15An executable specification of map-join-reduce using Haskell [50]Proposing an executable Map-Join-Reduce programming model based on Haskell, the key idea of the Map-Join-Reduce model is adding a new join module to MapReduce which is based on filtering-join-aggregation MapReduce to optimize multiway joinJoinN/AUniformN/AMap-Join-Reduce optimizes single multiway join, while J-MOTH considers sharing opportunities among multiple multiway join queries on MapReduce and Flink
16Exploiting coarse-grained reused-based opportunities in big data multiquery optimization [21]Proposing the MOTH system to exploit the coarse granularity of full and partial sharing in multiquery on slow storagesSelection projectionTemporaryNonuniformMaterializationMOTH system is our previous work, and it tackles sharing data within big data multiquery
17SOOM : sort-based optimizer for big data multiquery [22]Proposing the MOTH system to exploit the sharing sort opportunities, including explicit sorts of sort queries and implicit sorts of aggregation queriesAggregation and sort queriesTemporaryNonuniformMaterializationSOOMM system is our previous work, and it tackles sharing data and work within big data multiquery including explicit sorts of sort queries and implicit sorts of aggregation queries
18J-MOTH : exploiting sharing work for big data multiquery optimizationProposing the J-MOTH system to exploit the coarse granularity of sharing data besides the implicit sorts in two-way join and pipleined multiway join queriesJoin (two-way and multiway)TemporaryNonuniformMaterializationOur contribution in this work