Due to the rapid development of massively parallel data processing systems such as MapReduce and Spark, there have been revived interests in the theoretical computer science community to study algorithms in a massively parallel computational model. Join evaluation, as one of the central algorithmic problems in database theory, has received much more attention. In this thesis, we study how to compute join queries efficiently (with theoretical guarantees) in today’s massively parallel systems. We give an almost complete characterization of acyclic join queries on which instance-optimality, output-optimality and worst-case optimality can be achieved respectively. We first give an instance-optimal algorithm for r-hierarchical joins and prove that instance-optimality cannot be achieved for non-r-hierarchical joins. Then we give a new output-sensitive algorithm for arbitrary acyclic joins, which is optimal if the output size is not too large. At last, we present a worst-case optimal algorithm for arbitrary acyclic joins, whose complexity is characterized by the fractional edge cover number of the join hypergraph. On the other hand, we give evidence that cyclic joins are inherently more difficult than acyclic ones. In particular, we give the first output-sensitive lower bound for the triangle join, as well as a worst-case lower bound for the so-called ⊟-join. Beyond natural joins, we also investigate similarity joins under l
1/l
2/l
∞ distance metrics by designing three algorithms: (1) a deterministic output-optimal algorithm for l
1/l
∞ metric in constant dimensions; (2) a randomized output-sensitive algorithm for l
2 metric in constant dimensions; (3) an LSH-based approximation algorithm for arbitrary metric in high dimensions. Although this thesis is theoretical in nature, some of the developed algorithms are also quite practical. We have implemented them in Spark and the experimental results have demonstrated their efficiency in practice.
| Date of Award | 2019 |
|---|
| Original language | English |
|---|
| Awarding Institution | - The Hong Kong University of Science and Technology
|
|---|
Massively parallel algorithms for acyclic joins
HU, X. (Author). 2019
Student thesis: Doctoral thesis