Distributed DBMS: Overview and Query Processing

Goal: a small-scale (dozens of sites) distributed DBMS

1.     Location transparency: users don’t know where the data is

2.     Performance transparency: performance independent of submission site

3.     Copy transparency: objects can be copied, and copies are maintained automatically.  Critical for availability

4.     Transaction transparency: looks like single-site xacts

5.     Fragment “transparency”: tables can be fragmented to different sites

6.     Schema change transparency: schema updates at a single site affect global schema

7.     Local DBMS transparency: shouldn’t matter what DBMS is running at each site (ha!)


Nobody provides all of this (#7 & #3 are still major research field)

IBM Almaden’s R* was the most “real” DDBMS prototype.

·       VTAM/CICS provided communication

·       Good optimizer, concurrency control


·       SDD-1: done at CCA, never really ran.  PDP-10’s on Arpanet.  Strange design decisions based on unreasonably slow network.

·       Distributed INGRES: UCB, also never ran.  Lack of UNIX networking software (pre TCP/IP!).

·       Commercial vendors beginning to provide this functionality


The R* Distributed DBMS


·       location transparency

·       local privacy & control

·       local performance on local queries

·       no central dependencies: site autonomy

·       no central catalogs

·       no central scheduler

·       no central deadlock detector


Tables can be

·       dispersed

·       replicated

·       partitioned:

·       horizontal partitioning (by predicate)

·       vertical partitioning (but a key at each partition)

·       snapshots


Table names: user_name@user_site.object_name@birth_site


Catalogs store info on local objects, and objects born locally (along with pointers to current sites)

Remote catalog info is cached as hints:

·       invalid cache entries are detected easily and updated

·       catalog entries have version numbers

a catalog entry for a table:

·       System-Wide Name

·       type info

·       access paths

·       view definition (if it’s a view)

·       optimizer stats


To find a catalog entry:

1.     check local catalog & cache

2.     check birth site

3.     check the place the birth site points to


An overview of Distributed CC:

Every transaction gets an xactno: site_name.time (or SN)

deadlocks: shoot the youngest, largest-numbered xact


Commit Protocol – you can’t just decide to commit:

·       some site may not be able to

·       you can’t distingush inability from slowness!

Two Phase Commit (2PC):

1.     coordinator says to participants: want to commit? (N-1 messages)

2.     Participants all vote yea/nay (N-1 messages)

3.     coordinator tells everyone to commit/abort (N-1 messages)

4.     participants acknowledge message (N-1 messages)

            This does the right thing, but it’s pretty costly.  We will  read about improving this next time.


Distributed Deadlock Detection:

·       find and resolve local deadlocks

·       pass your waits-for graph around to detect global deadlocks

·       we will read more about how to do this efficiently next time


Mackert & Lohman: Distributed Query Optimization & Processing

·       Motivation for reading this paper: understand issues/alternatives/tradeoffs in distributed query processing.

·       Goal is not really to study the validation of the R* optimizer.

·       Basic idea in cost evaluation: cost is IOCost + CPUCost + MsgCost + Bit-ShipCost

·       New costs for sending messages.  Note breakdown into msg overhead & data

·       Simple case: non-replicated, horizontally partitioned single table:


Madison: EMP where employer = ‘UW’

Berkeley: EMP where employer = ‘UC’

NYC: EMP where employer = ‘NYU’

Optimal plan is Q(T) = where n is # partitions and Q(Ti) is best local plan.

(Can often “semantically”' detect that some partitions will yield null result.)


·       Replicated data: just choose best of all plans: Q(T) = Choose-best(Q(T1), … ,Q(Tn)).

·       More complicated: n-way join queries.  Options include:

1. “Ship Outer”

·       Retrieve qualified tuples at outer site.

·       ship to inner site.

·       join arriving tuples with inner relation (packet by packet.)

·       Note: this is pipelined.

2. “Ship Inner”

·       Retrieve qualifying (inner) tuples at inner site.

·       Send them to outer site.

·       Accumulate temp relation at outer site.

·       join outer with temp at outer site.

·       Note: this is sequential.

3. “Fetch Matches”

·       retrieve qualified outer tuples.

·       send their join column values to inner site.

·       retrieve matching inner tuples.

·       return those inner tuples to outer site.

·       join them on arrival w/waiting outer tuples.

·       repeat until done.

·       Mackert & Lohman argue that this never wins

4.“Ship Both”

·       Retrieve qualified inner tuples.

·       Send to join site.

·       accumulate in temp realation.

·       retrieve qualified outer tuples.

·       send to join site.

·       as outer tuples arrive, join with temp.

·       (1/2 pipelined, 1/2 sequential.)

·       Might actually make sense if next join is at the join site!

5. “Semijoin strategy''

·       (Outer) sent to inner site.

·       Join inner with this projection, send result to outer site.

·       At outer’s site, join outer with result of semi-join.

·       SDD-1 used semi-joins exclusively (slow network)

6. “Bloom join”

·       Scan outer, producing bit vector.

·       send bit vector to inner site.

·       use bit vector filter to reduce inner.

·       send reduced inner to outer, do join.

·       Works pretty well!


Miscellaneous Observations

·       They exploit parallelism in various ways

·       queries pipeline some of the time

·       distributing the work reduces contention at buffers, making optimizer “more accurate”

·       Optimizer must incorporate all costs – SDD-1-style assumptions lead to short-sighted design

·       No work on heterogeneity, not even different machine speeds or network links!

·       No fragmentation or replication