Ideas for CS286 Projects

Indexing

Big Systems

Interactive Data Analysis

  • Using tools from the CONTROL group, develop an interesting application that provides interactivity during data delivery
  • Data Mining

  • Extend libGiST to implement clustering algorithms from data mining, e.g. BIRCH/BUBBLE.  Does libGiST's extensibility allow you to improve/enhance earlier work?
  • implement an interactive tool that leverages the CARMA algorithm for online association rule data mining.
  • develop interactive algorithms for clustering or classification problems in data mining.

  • Some suggestions from Mike Carey


    And here are some more, courtesy Paul Aoki

    [Straightforward] = This basically has to work.
    [Speculative] = It really seems like this ought to work.
    [Wacky] = I have no idea what I'm talking about and want you to tell me if it works or not.

    1. support vector trees. [Wacky]

    Support vector machines are a classification technique that map a multidimensional input space into a parameter space in which it is supposed to be easier to separate (predetermined) classes using hyperplanes.  This principle seems like it might be usefully applied to geometric search structures, since we often find that we can compute good record clusterings using heuristics but then cannot represent them effectively using (e.g.)  bounding rectangles in theinput space.

    - Can we figure out a way to support input space queries (window queries, NN search) rather than just classification queries  (point-contained-in)?  This might involve some restriction on the SVM mappings we can use, e.g., we might have to limit the SVM to affine mappings to make window queries work.
    - Does SVM work for a large number of classes (think one class per disk page)?
    - Can SVM representations be updated dynamically?
    - Can SVM representations be page-externalized effectively (and hierarchically)?
    - How on earth does this really work?  Why should we believe that our clustering will map "nicely" into any of the "usual" SVM spaces?

    2. wavelet trees. [Speculative]

    Several access methods, exemplified by the arc-tree (Guenther, CVGIP90), are based on successive refinement (multiresolution) techniques.  Wavelets provide another way to construct a multiresolution decomposition.

    - Can wavelet representations be updated dynamically?
    - Can they be page-externalized effectively (and hierarchically)?
    - What problems can the resulting structure solve, and is there any reason why this will work better than a quadtree? (2-d Haar wavelets have similarities to a quadtree decomposition.)

    3. space-filling curves. [Straightforward]

    Many people, including many in the database community, have proposed clustering data using Z-ordering and Hilbert curves.  A few have suggested using Sierpinski curves (none in the database community, yet), and some recent theoretical results show that Sierpinski curve variants may be provably better than Hilbert curves by certain locality metrics (Niedermeier, FCT97).

    - Are these metrics useful?
    - Is this effect experimentally significant?  (Are the differences large enough so we care?)

    4. extensibility in libgist. [Straightforward]

    Implement a bunch of trees (meaning, at least three) in libgist, including some that are not simple variants of the R-tree.  Write up the lessons learned (in terms of perceived weaknesses in the existing interfaces and abstractions, not the bugginess of the code or the lack of documentation).

    5. NVRAM main-memory databases. [Straightforward]

    OS-types keep coming up with systems to provide persistent memory.  Most (e.g., RVM) apply standard database recovery techniques.  A few try to use non-volatile memory (see Chen, VLDB97).  Critique the failure model(s) of the proposed NVRAM techniques.

    - Can you trust the models?
    - What services and capabilities do NVRAM techniques miss out on relative to full RVM?
      - "One-safeness" wrt media failure?
      - Support for transaction rollback?
      - Bugs in various storage subsystems?

    6. WAL main-memory databases. [Straightforward]

    Various aspects of database systems can be restructured if all of the base data fits in main memory.  But do any of them change fundamentally?  Read the papers carefully.

    - Are any of the MMDB vendors (Lucent DataBlitz, TimesTen) really getting better performance because they've done anything  interesting, or is it really just because the database is in memory?
    - Is it really because they're providing less functionality and generality?  (What's the cost-per-tuple for DataBlitz vs. Informix?)

    7. WAL almost-main-memory databases. [Speculative]

    What if we let our "MMDB" grow beyond main memory?

    - Are the optimal design choices still essentially the conservative ones resulting from small main memory?
    - What really changes?  Do we just end up with (e.g.) very large pages?

    8. concurrency control in libgist. [Straightforward]

    Implement Marcel's CC algorithms (see Marcel for a couple of known bugs in the paper, or see Ravi Kanth, IPPS98) in libgist.  Measure and evaluate.

    9. what is this VC-dimension, thing, anyway. [Speculative]

    Some database researchers have been working on the sample complexity of various problems, as well as the "index complexity" (indexability) of others, etc.  The Vapnik-Chervonenkis dimension is a concept from
    machine learning that ties the storage requirements of a learning technique to its ability to the quality of its answers to
    classification problems.  I've only seen it applied in one database paper (Roy, PODS91).

    - Does the VC-dimension "port" directly into estimation and search problems, like sampling and indexability?  If not, why not?


    Joe Hellerstein, jmh@cs.berkeley.edu
    Modified: $Date: 1999/02/23 10:21:08 $ by $Author: aoki $