Concurrency Control and Recovery for Search Trees

Concurrency in B+-Trees

Challenge: Develop a concurrency protocol that allows highly concurrent access to a search tree.  (2PL is unacceptable!)

Solution 1: Optimistic CC.  Kung and Robinson said we can do optimistic locking on B-trees, since each lookup touches only h pages, and the tree will have fh pages.

Solution 2: release locks early (non-2PL!)  "Latches". Solution 2.1: Extra latch modes + "Safety analysis" (Bayer/Schkolnick)
Solution 2.2: Page latches, Rightlinks and compensating traversal (Lehman/Yao)
Solution 2.3: Latch coupling during traversal. "Repositioning" logic ocassionally required. (ARIES/IM)
 

B-Link Trees (Lehman/Yao)

A super-high concurrency solution, at the expense of a little extra complexity in the data structure. Search
current = root;
A = get(current);
while (current is not a leaf) {
        current = scannode1(v, A);
        A = get(current);
}
while ((t = scannode(v,A)) == link pointer of A) {
        current = t;
        A = get(current);
}
if (v is in A)
        return(success);
else return(failure);
Simple! Only trick is to have scannode know about high-keys and right-links.

(Footnote: The scannode(v,A) routine examines memorypage A and finds the appropriate pointer for value v .  Note that itmay return a right-link pointer instead of a child pointer.)

Insert

First, we find a leaf node, and keep a stack of the rightmost node we visited at eachlevel:

initialize stack;
current = root;
A = get(current);
while (current is not a leaf) {
        t = current;
        current = scannode(v,A);
        if (current not link pointer in A)
                push t;
        A = get(current);
}
When we get to the leaf level, we may need to search right for the appropriate leaf.  The move_right procedure scans right across the bottom, with lock coupling (i.e.if you have to move right, first lock right neighbor, then release lock on current).
lock(current);
A = get(current);
move_right();
Now, assuming the key/ptr pair is not already in the tree, we proceed to insert &possibly split:
Doinsertion:
if A is safe {
        insert new key/ptr pair on A;
        put(A, current);
        unlock(current);
}
else { // gonna have to split
        u = allocate(1 new page for B);
        redistribute A over A and B;
        y = max value on A now;
        make high key of B equal old high key of A;
        make right-link of B equal old right-link of A;
        make high key of A equal y;
        make right-link of A point to B;
        put (B, u);
        put (A, current);
        oldnode = current;
        new key/ptr pair = (y, u); // high key of new page, new page
        current = pop(stack);
        lock(current);
        A = get(current);
        move_right(); // at this point we may have 3 locks: oldnode,
                      // and two at the parent level while moving right
        unlock(oldnode);
        goto Doinsertion;
}
Note the worst-case multiple locking here. (Improvement later proposed by Sagiv, and byLanin & Shasha: unlock current after put(A, current) ).

Delete:

Just remove from the leaf. They punt on underflow -- just let leaves get empty, never delete them (hence never do deletion from internal nodes.) If you think your tree is too empty, then reorganize it offline.   In practice, people don't deal with underflow in real systems, but do reclaim empty pages periodically.

Why does this all work?

  1. Deadlock-free: there is a total order on tree nodes (bottom-up, left-to-right) that is followed by locking protocol
  2. Correctness of Tree Modifications: By doing put(B, u) before put(A, current) in insert, we make node splitting atomic at each level. This means at any time the tree will appear consistent (though it may have "big" nodes)
  3. Correct Interaction: it is somewhat tricky to show that reader/writer and writer/writer pairs don’t step on each other. The worrisome case is when a reader reads node N, which is subsequently updated by a writer. In this case, the reader will subsequently detect relevant inserts by seeing (possibly lower in the tree) that she needs to keep looking right.


Alternative techniques are used in ARIES: ARIES/KVL & ARIES/IM. Don;t require right-links.  Instead, do lock-coupling during traversal. On occasion need to"reposition" (i.e. find the appropriate spot on a level to continue.) Papers handle lots more details than Lehman/Yao, including degree-3 consistency, deletion, logging/recovery, savepoints.
 

Degree-3 Consistency in B+-Trees

The Phantom Problem:  Assume you do correct tree concurrency and 2PL of data records.  Suppose you start a transaction that queries an index for <name = "Dennis the Menace">, and the result is empty.  Now suppose another transaction inserts a person with name Dennis the Menace, and you re-run your query.  Now you'll see a Phantom Menace!

Problem: 2PL locks the data, but not the "holes" in the data domain that overlap the query.

Solutions:

Concurrency Control in GiSTs (and R-trees, etc.)

A CS286 class project in ’94, published VLDB ’95 (Kornacker & Banks). Improved for GiST with concurrency, degree-3 consistency, deletion, savepoints in SIGMOD ’97 (Kornacker, Mohan & Hellerstein).

Main differences to focus on:

Raises 3 questions:
  1. how do we detect a node has split?
  2. how do we limit extent to which we move right?
  3. how can we find a physical surrogate for phantom protection?
Idea for first 2 questions: impose an ordering that has nothing to do with the data.

Each page gets a Node Sequence Number (NSN), like a timestamp.  On page split, the new right sibling gets the original NSN, and the left sibling gets a new NSN, and parent’s NSN is updated on insertion of pointer to new sibling.

Split detection: Traversal remembers the global NSN when remembering a child pointer.  If upon arrival at the child, the child’s NSN is greater than the global NSN remembered while in the parent entry, child has since split.

Limiting right-traversal: only scan until a lower NSN.

Phantom Protection in GiST:

Paper also shows that ARIES-IM doesn't apply to GiST.  If a sibling splits, need to "reposition" in the parent.  Requires you to understand what siblings you've already seen based on keys, which implies that keys must partition the data domain (or you can't tell what you've already seen!)

GiST (and B-link Tree) Recovery

Goals of recovery: Note that the former goal is not a transactional statement!  Leads to two main tricks: A detail: Note: There's nothing here that has anything to do with the data -- it's all extensible.  The contribution of the paper here is to figure out recovery for B-link and GiST, which actually improves upon ARIES/IM.

Lessons

CC&R: Extensibility