Relational queries 101
Themes: Dataflow and Dataflow Operators
A quick primer on the relational algebra
- selection
- projection
- cartesian product
- union
- difference
- join (in particular, theta-join)
Dataflow Architecture: Iterators
Basic question: how do you turn the algebraic operator abstraction into a
programming construct? Dataflow.
An OO view of iterators:
An iterator is a superclass for all dataflow operators. It has three
methods:
- open(): initialize state to set up dataflow.
- next(): produce a record as output
- close(): clean up state
The output type of next() is well typed. In a SQL system, this is a
SQL record in some schema, e.g. (integer, numeric(10,2), varchar, float).
Iterators are strung together into a dataflow by parameterizing them to
have other iterators as inputs, and then having their methods invoke the
input iterators' methods. Usually parent.open() invokes child.open()
and sometimes child.next(). Usually parent.next() invokes child.next()
or nothing. Usually parent.close() invokes child.close().
An iterator also has a number of private data members, including all the
"input" iterators it invokes (usually one or two of these), and any state
it maintains. The types of the records from input iterators need to
match the operator semantics in generating the output type. In a DBMS,
this is all set up automatically at query parsing/optimization time.
Extensible "optimizer generators" have been developed that let you declare
these rules in a high-level way (not as optimizer code).
Dataflow Design Space
Typically, iterators in a single-site query processor make synchronous calls
to their children. This is a "pull" model, like sucking data through
a straw. For now, it's fine to think of iterators that way.
As we've seen with P2 and we'll see again with Click, in some
contexts pull is not the best choice -- particularly if there's
asynchrony in your system (e.g. unpredictable networks). Some
kind of buffering and/or rate-matching has to happen to keep
operators in sync.
Typical DB Dataflow Operators
Unary Operations
- sorting: see Knuth, or Shapiro paper. Then follow up at Jim Gray's Sort Benchmark homepageUses?
- hashing: Two-level partitioning: scan and form coarse hash
partitions on disk, then for each partition hash into memory.
Recurse as needed. Uses?
Binary Operations: Union
Without duplicate elimination, a very simple iterator (SQL's "UNION ALL").
Only issue is that both inputs have to have identical schemas.
With duplicate elimination (SQL's plain "UNION"), you need to eliminate
duplicates too. Simple implementation: do UNION ALL as one iterator,
and dup-elim as another.
Binary Operations: Join
Join algorithms apply to almost any form of combining multiple collections,
except for UNION.
Some commonly used join variants (alternative logical algebra operators):
- semi-join: R semi-join S ~= R join remove-dups(S) projected to the
columns of R
- S basically serves as a filter
- logically, selection is a semijoin with an infinite relation (!!)
- outer join: R outer-join S: compute R join S, and for each tuple of
R that has no match send it to the output with the S columns filled in with
NULLs. Left, Right, and Full Outer Joins exist. Very common for web
apps. Some logical subtleties here (no longer commutative/associative).
- intersection & difference
These logical algebra operators can be implemented as minor variations on
the same join algorithms!
The "Guy Lohman Test" for join operators (a.k.a. pipelining of intermediate
results):
- does the operator work for joining 3 inputs without storing the output
of 2 of the inputs?
The "Joe Hellerstein Test" for join operators (a.k.a. full pipelining):
- does the operator produce outputs before it finishes reading any of
its inputs?
- how quickly -- and are they the juicy outputs?
Nested Loops Join
for each tuple r of R
for each tuple s of S
if rs satisfies join predicate
output rs
R is the outer relation (left)
S is the inner relation (right)
- works for any join predicate
- inner input must be stored
Refinement 1: Block Nested Loops Join
for each block BR of R
for each tuple s of S
for each tuple r of BR such that rs satisfies join predicate
output rs
Further refinements to nested loops:
- load R in chunks of M-K pages to amortize seek cost, "pin" K pages
of the inner into memory
- alternate scan direction on inner ("boustrophedonism")
Refinement 2: Index Nested Loops Join
for each tuple r of R
probe index over S;
output all s s.t. rs satisfies join predicate;
Notes:
- Still called "nested loops", S is still referred to as the "inner"
relation
- any join predicate can be supported if the index supports the predicate!
SELECT cities.name
FROM cities, forests
WHERE cities.boundary overlaps forests.boundary;
- can convert all nested-loops joins to index-nested-loops by indexing
inner on the fly. usually not worth it.
(Sort-)Merge Join
Works for equijoin, "band" joins
we will assume here you know how to do a 2-pass sort [see Knuth or Shapiro]
idea: if R & S are sorted on the join column(s), we can "simply" merge
them
But duplicates complicate things: little embedded nested loop on dups.
Refinement: do merging of R & S while merging sort runs from each.
- Requires enough buffers for merging both R and S simultaneously.
Note: Sort-merge is symmetric, so "outer", "inner", "left", "right" are arbitrary
Grace Hash Join
Phase 1 is repeated with S in place of R
THEN
Advantages:
- works well for big R, esp. when RAM is just big enough for output buffers
of R
Disadvantages:
Hybrid Hash
Original paper: DeWitt, Katz, Olken, Shapiro, Stonebraker, Wood, SIGMOD '84.
Phase 2 as in Grace Join
Hybrid Hash Advantages:
- As good as simple hash for small R
- As good as Grace for big R
- If RAM is bigger than # of output buffers, improves on Grace
Disadvantages:
Handling Partition Overflow:
- If a partition of R overflows, recursively partition it, along with
the corresponding bucket of S
- Note that size of S does not affect level of recursion!
- Makes hash-join particularly effective if |R| << |S| (compare
to sort-merge)
Additional Tricks: Filters
Idea: build a filter based on R so you stage less of S to disk
Parallel DB 101
Performance metrics:
- Speedup: fix job,
measure small-system-elapsed/large-system-elapsed (grows)
- Scaleup
- Transaction scaleup: N times as many TPC-C's for N machines
- Batch scaleup: N times as big a query for N machines
2 kinds of parallelism
- pipelined (most are short)
- partition
3 barriers to linearity:
- startup overheads
- interference
- skew
3 basic architectures
- shared-memory
- shared-disk
- shared-nothing
data placement
- round-robin
- range partitioning: good for range queries, but results in skew
- hash partitioning
note the history
- Teradata had 1,000-node data-parallel clusters in 1992!
- Gamma and Bubba were university research projects running 32
and 40-processor data-parallel in 1992.
The Exchange Operator: Parallelizing Iterators
Annoyance: "parallelizing" a query executor is a hassle.
Solution: encapsulate the parallelism in a query operator, not
in the QP infrastructure.
- another classic level-of-indirection scheme, this
time for dataflows.
We want to enable partition and pipelined parallel dataflows over
iterators -- including setup, teardown, and runtime logic -- in a
clean encapsulated way.
The exchange operator: an operator you pop into any single-site
dataflow graph as desired -- anonymous to the other operators.
Implementation:
-
Note: Volcano was done with processes, but today you'd use threads
-
splits the graph into two threads. The lower thread has an X-OUT
iterator at the top. The upper thread has an X-IN iterator at the
botton.
-
X-OUT is a driver for the lower iterator. Says next() a bunch of
times, constructs a packet, pushes that packet via IPC (or network
comm) onto a queue in X-IN's "port". X-IN responds to next() when
it has tuples in its queue.
-
Why is push beneficial?
-
Flow control: semaphore on the port dictates the maximum degree to which
the producer can get ahead of the consumer.
Benefits of exchange:
-
opaquely handles setup and teardown of clones (in an SMP...for shared-nothing,
would need to have daemons at each site, and a protocol to request clone
spawning)
-
at the top of a local subplan, allows pipeline parallelism: turns iterator-based,
unithreaded "pull" into network-based, cross-thread "push".
-
at the top of a local subplan, allows decoupling of children's scheduling.
-
inside a subplan, can mix pull and push to get the best of both
"Extensibility" features of Volcano and exchange:
-
operators don't interpret records, support functions do; goes for
partitioning as well
-
basically just a level of indirection for anything data-dependent
-
Volcano had a much more aggressive extensibility story for query optimization
-
rule-based query optimizer generator -- evolved from Graefe's thesis work
on the Exodus optimizer generator. Later work on the Cascades optimizer
evolved this further
-
contrasts with Starburst's rule-based optimizer, which is more like an
extensible version of the System R optimizer (more standard).
-
if you're interested in using query optimization ideas for building up
dataflows of arbitrary operators (not just relational operators), you should
know this material