Complexity

Research Article

Exploiting Sharing Join Opportunities in Big Data Multiquery Optimization with Flink

Table 1

The comparison of the J-MOTH system with other proposed solutions.


No.	Reference	Functionalities description	Query type	Storing type	Data distribution	Methodologies	Difference relative to J-MOTH system

1	MRShare: sharing across multiple queries in MapReduce [18]	The concurrent-sharing framework that exploits sharing opportunities among multiple jobs by grouping them into a single job	Aggregation	Temporary	Uniform	Grouping shared queries	The sharing optimization of similar work and overlapped work in MRShare and relaxed MRShare, respectively, are considered fine-grained sharing
2	Multiquery optimization in MapReduce framework [31]	Relaxing and generalizing MRShare overlapping queries to increase sharing opportunity into a single job	Aggregation	Temporary	Uniform	Grouping materialization of overlapped queries
3	Restore: reusing results of MapReduce jobs [19]	Nonconcurrent sharing system that optimizes query evaluation using materialized results produced by pig workflows of MapReduce jobs	N/A	Permanent	Uniform	Materialization	The worst case of using both of HOME and ReStore systems is that the output of a full or subjob is not reused by future queries; thus, these existing systems suffer from big data storage limitation which incurs a high cost to buy additional physical storages or rent extra virtual storages in big data environment
4	HOME : HiveQL optimization in multisession environment [51]	HOME system improves Hadoop performance by storing data of previous results and using it in the next run for the same session or for a different session	Selection aggregation join	Permanent	Uniform	Materialization
5	Improving the performance of Hadoop Hive by sharing scan and computation tasks [52]	MQO framework, SharedHive, improves the overall performance of Hadoop by grouping correlated HiveQL queries into a new set of queries	N/A	Temporary	Uniform	Grouping	It exploits the temporary fine-grained correlated queries to optimize multiquery for only one session over Hive
6	Reuse-based optimization for pig Latin [53]	The PigReuse system operates common subexpressions using the algebraic representations of pig Latin scripts and reuse-based algorithms to share its results	N/A	Temporary	Uniform	Reused-based	It reutilizes the fine-grained reused-based opportunities to optimize multiquery over pig without considering data distribution
7	Exploiting Soft and Hard Correlations in big data query optimization [54]	EXORD system has been proposed to exploit the granularity of data correlations (i.e., soft and hard) to improve big data query optimization	N/A	Temporary	Uniform	Materialization	It deals with multiquery sharing and reuses fine-grained computed results based on sharing correlations
8	JOUM : an indexing methodology for improving join in Hive star schema [24]	JOUM (join once use many) improves the speed-up of Hive join query tasks by using pipeline materializing of the full joined star schema	Join	Permanent	Uniform	Materialization	It does not exploit sharing among join queries; it optimizes individual incoming join query based on the prejoined schema
9	Optimizing join in Hive star schema using Key/Facts indexing [41]	Proposing key/facts indexing to materialize the star schema data and build an index for this materialized data to optimize JOIN on Hive	Join	Permanent	Uniform	Materialization	It does not exploit sharing among join queries; it optimizes individual incoming join query based on the indexed prejoined schema
10	BlockJoin: efficient matrix partitioning through joins [42]	Proposing an optimized distributed join algorithm, namely, BlockJoin to reduce shuffling costs of intermediate data by merging relational and liner algebra operations into specialized physical operators	Join	Temporary	Uniform	Materialization and index join	It does not exploit sharing among join queries; it optimizes individual incoming join query based on the indexed and materialization strategy (i.e., early and late) based on the shape of tables
11	Wide table layout optimization based on column ordering and duplication [55]	Proposing a fine-grained cost model for column accesses and column duplication to optimize HDFS I/O cost of a query workload	Join	Temporary	Uniform	N/A	It does not exploit sharing among join queries; it implements the fine-grained cost model using simulated annealing-based column ordering algorithm to find the approximated optimal column orders and combines with storage constrained column duplication algorithm to optimize I/O throughput
12	Selecting subexpressions to materialize at datacenter scale [36]	Proposing a vertex-centric graph algorithm called BIGSUBS to iteratively select subexpressions in parallel to be materialized over very large workloads	Join	N/A	Uniform	Materialization	A subexpression selection is mapped to a bipartite graph labeling problem and then it is solved in an iterative manner
13	In-memory caching for multiquery optimization of data-intensive scalable computing workloads [48]	Proposing a method for multiquery optimization which is combined within memory cache primitives to improve the efficiency of data-intensive frameworks	Join	Temporary	Uniform	N/A	It exploits sharing opportunities by caching distributed relations
14	A parallel query processing system based on graph-based database partitioning [49]	Proposing a novel graph-based database partitioning method called GPT which considers the trade-off between data redundancy and the number of opportunities for joins processing without shuffling	Join	N/A	Uniform	Grouping, sorting, and partitioning	It exploits sharing by using the partitioning method (i.e., hash-based multicolumn (HMC)), while our work considers data granularities and implicit sort to reduce shuffling in multiquery
15	An executable specification of map-join-reduce using Haskell [50]	Proposing an executable Map-Join-Reduce programming model based on Haskell, the key idea of the Map-Join-Reduce model is adding a new join module to MapReduce which is based on filtering-join-aggregation MapReduce to optimize multiway join	Join	N/A	Uniform	N/A	Map-Join-Reduce optimizes single multiway join, while J-MOTH considers sharing opportunities among multiple multiway join queries on MapReduce and Flink
16	Exploiting coarse-grained reused-based opportunities in big data multiquery optimization [21]	Proposing the MOTH system to exploit the coarse granularity of full and partial sharing in multiquery on slow storages	Selection projection	Temporary	Nonuniform	Materialization	MOTH system is our previous work, and it tackles sharing data within big data multiquery
17	SOOM : sort-based optimizer for big data multiquery [22]	Proposing the MOTH system to exploit the sharing sort opportunities, including explicit sorts of sort queries and implicit sorts of aggregation queries	Aggregation and sort queries	Temporary	Nonuniform	Materialization	SOOMM system is our previous work, and it tackles sharing data and work within big data multiquery including explicit sorts of sort queries and implicit sorts of aggregation queries
18	J-MOTH : exploiting sharing work for big data multiquery optimization	Proposing the J-MOTH system to exploit the coarse granularity of sharing data besides the implicit sorts in two-way join and pipleined multiway join queries	Join (two-way and multiway)	Temporary	Nonuniform	Materialization	Our contribution in this work