|
No. | Reference | Functionalities description | Query type | Storing type | Data distribution | Methodologies | Difference relative to J-MOTH system |
|
1 | MRShare: sharing across multiple queries in MapReduce [18] | The concurrent-sharing framework that exploits sharing opportunities among multiple jobs by grouping them into a single job | Aggregation | Temporary | Uniform | Grouping shared queries | The sharing optimization of similar work and overlapped work in MRShare and relaxed MRShare, respectively, are considered fine-grained sharing |
2 | Multiquery optimization in MapReduce framework [31] | Relaxing and generalizing MRShare overlapping queries to increase sharing opportunity into a single job | Aggregation | Temporary | Uniform | Grouping materialization of overlapped queries |
3 | Restore: reusing results of MapReduce jobs [19] | Nonconcurrent sharing system that optimizes query evaluation using materialized results produced by pig workflows of MapReduce jobs | N/A | Permanent | Uniform | Materialization | The worst case of using both of HOME and ReStore systems is that the output of a full or subjob is not reused by future queries; thus, these existing systems suffer from big data storage limitation which incurs a high cost to buy additional physical storages or rent extra virtual storages in big data environment |
4 | HOME : HiveQL optimization in multisession environment [51] | HOME system improves Hadoop performance by storing data of previous results and using it in the next run for the same session or for a different session | Selection aggregation join | Permanent | Uniform | Materialization |
5 | Improving the performance of Hadoop Hive by sharing scan and computation tasks [52] | MQO framework, SharedHive, improves the overall performance of Hadoop by grouping correlated HiveQL queries into a new set of queries | N/A | Temporary | Uniform | Grouping | It exploits the temporary fine-grained correlated queries to optimize multiquery for only one session over Hive |
6 | Reuse-based optimization for pig Latin [53] | The PigReuse system operates common subexpressions using the algebraic representations of pig Latin scripts and reuse-based algorithms to share its results | N/A | Temporary | Uniform | Reused-based | It reutilizes the fine-grained reused-based opportunities to optimize multiquery over pig without considering data distribution |
7 | Exploiting Soft and Hard Correlations in big data query optimization [54] | EXORD system has been proposed to exploit the granularity of data correlations (i.e., soft and hard) to improve big data query optimization | N/A | Temporary | Uniform | Materialization | It deals with multiquery sharing and reuses fine-grained computed results based on sharing correlations |
8 | JOUM : an indexing methodology for improving join in Hive star schema [24] | JOUM (join once use many) improves the speed-up of Hive join query tasks by using pipeline materializing of the full joined star schema | Join | Permanent | Uniform | Materialization | It does not exploit sharing among join queries; it optimizes individual incoming join query based on the prejoined schema |
9 | Optimizing join in Hive star schema using Key/Facts indexing [41] | Proposing key/facts indexing to materialize the star schema data and build an index for this materialized data to optimize JOIN on Hive | Join | Permanent | Uniform | Materialization | It does not exploit sharing among join queries; it optimizes individual incoming join query based on the indexed prejoined schema |
10 | BlockJoin: efficient matrix partitioning through joins [42] | Proposing an optimized distributed join algorithm, namely, BlockJoin to reduce shuffling costs of intermediate data by merging relational and liner algebra operations into specialized physical operators | Join | Temporary | Uniform | Materialization and index join | It does not exploit sharing among join queries; it optimizes individual incoming join query based on the indexed and materialization strategy (i.e., early and late) based on the shape of tables |
11 | Wide table layout optimization based on column ordering and duplication [55] | Proposing a fine-grained cost model for column accesses and column duplication to optimize HDFS I/O cost of a query workload | Join | Temporary | Uniform | N/A | It does not exploit sharing among join queries; it implements the fine-grained cost model using simulated annealing-based column ordering algorithm to find the approximated optimal column orders and combines with storage constrained column duplication algorithm to optimize I/O throughput |
12 | Selecting subexpressions to materialize at datacenter scale [36] | Proposing a vertex-centric graph algorithm called BIGSUBS to iteratively select subexpressions in parallel to be materialized over very large workloads | Join | N/A | Uniform | Materialization | A subexpression selection is mapped to a bipartite graph labeling problem and then it is solved in an iterative manner |
13 | In-memory caching for multiquery optimization of data-intensive scalable computing workloads [48] | Proposing a method for multiquery optimization which is combined within memory cache primitives to improve the efficiency of data-intensive frameworks | Join | Temporary | Uniform | N/A | It exploits sharing opportunities by caching distributed relations |
14 | A parallel query processing system based on graph-based database partitioning [49] | Proposing a novel graph-based database partitioning method called GPT which considers the trade-off between data redundancy and the number of opportunities for joins processing without shuffling | Join | N/A | Uniform | Grouping, sorting, and partitioning | It exploits sharing by using the partitioning method (i.e., hash-based multicolumn (HMC)), while our work considers data granularities and implicit sort to reduce shuffling in multiquery |
15 | An executable specification of map-join-reduce using Haskell [50] | Proposing an executable Map-Join-Reduce programming model based on Haskell, the key idea of the Map-Join-Reduce model is adding a new join module to MapReduce which is based on filtering-join-aggregation MapReduce to optimize multiway join | Join | N/A | Uniform | N/A | Map-Join-Reduce optimizes single multiway join, while J-MOTH considers sharing opportunities among multiple multiway join queries on MapReduce and Flink |
16 | Exploiting coarse-grained reused-based opportunities in big data multiquery optimization [21] | Proposing the MOTH system to exploit the coarse granularity of full and partial sharing in multiquery on slow storages | Selection projection | Temporary | Nonuniform | Materialization | MOTH system is our previous work, and it tackles sharing data within big data multiquery |
17 | SOOM : sort-based optimizer for big data multiquery [22] | Proposing the MOTH system to exploit the sharing sort opportunities, including explicit sorts of sort queries and implicit sorts of aggregation queries | Aggregation and sort queries | Temporary | Nonuniform | Materialization | SOOMM system is our previous work, and it tackles sharing data and work within big data multiquery including explicit sorts of sort queries and implicit sorts of aggregation queries |
18 | J-MOTH : exploiting sharing work for big data multiquery optimization | Proposing the J-MOTH system to exploit the coarse granularity of sharing data besides the implicit sorts in two-way join and pipleined multiway join queries | Join (two-way and multiway) | Temporary | Nonuniform | Materialization | Our contribution in this work |
|