Home
 Basic Info
 Lecture Notes
 Syllabus
 Project Ideas

You are encouraged to design your own project. To get you thinking, here are some ideas; not surprisingly, they tend to focus on ongoing projects related to work here at Berkeley. However, you are by no means required to work on things we're interested in!

Mashups for the Masses

Courtesy Minos Garofalakis and Rob Ennals, Intel Research

Normal people have lots of data that they would like to explore in an easy, intuitive manner, posing queries like: What is the weather like in these 10 holiday destinations? Which of my friends have I not emailed in over a month? Which of these recipes have incredients I can buy in Safeway? Typically, the bulk of this data has little or no structure and not stored database systems; so, while it is possible to extract the relevant information, it is definitely not "easy".

"Mashups for the Masses" is a project that attempts to make it easy for unskilled, non-DB users to: (a) Create loosely structured databases of information (think Notepad or Excel, not Oracle). (b) Use dashboard-like "query widgets" to generate extra information, based on the information they already have and information available on the web. (c) Share data and queries with friends and with the rest of the world. (d) Browse through data without having to think about the underlying queries and data-source specifics.

Specific sub-projects of interest (with both conceptual and implementation components) include:

  1. Recognizing Data and Structure: Given loosely structured data, how do we identify interesting structure and information that can be extracted from it. For instance, one possibility is to work out which query widgets might be able to do something useful with it, based on whether other users have applied similar query widgets to similar data in the past.
  2. Interesting Query Widget Construction: Query widgets can get data in a variety of ways, including from web services and web sites. In addition, query widgets should be easy to publish and compose in order to accomplish "interesting" things; for example, the system could learn or suggest potentially interesting query widgets to users based on the actions of their friends/collaborators. A possible implementation task would be to build an API/toolkit that allows easy construction and composition of specific query widget types (e.g., ones that scrape data off a web site).
  3. Mobile Data: Write a PDA- or smartphone-based browser for personal mashup data. This should allow one to access and edit the same data that is accessible on the PC app. New issues may arrise about synchronisation of data, and making use of limited storage, bandwidth, and screen space.
Related references include work on the Semex project (http://data.cs.washington.edu/semex/semex.html) the MIT work on goal-oriented user interfaces (http://agents.media.mit.edu/projects/semanticsearch/) and the very recent efforts on personal dataspaces (http://www.cs.berkeley.edu/~franklin/Papers/dataspaceSR.pdf).

Parallel Computing

Courtesy Jim Demmel and the BeBOP group

  1. A previous CS262 project by Rajesh Nishtala involved tuning some collective communications in MPI, and obtained excellent performance improvements. This work didn't include many commonly-used MPI collective communication calls; one project is to extend the work to implement and explore optimization strategies for the other collective operations.
  2. Parallel computing platforms like grid typically have high network latencies (compared to their computational power). Latency avoiding algorithms try to overcome this by reducing the number of remote fetches at the cost of possibly more computation. Furthermore, jitter (uncertainty in the communication time) can also affect the performance adversely. It would be interesting to compare and contrast a latency avoiding implementation (e.g., Cholesky factorization algorithm in the LAPACK) with a non latency avoiding implementation (e.g., LU factorization) in the above context (e.g., on a grid). Noise can be injected in the implementation to study how much the two implementations are sensitive to jitter.

SW Engineering and Security

Courtesy Prof. Koushik Sen

  • Build a light-weight dynamic analysis tool to detect Web application security vulnerabilities such as SQL injections and cross-site scripting attacks. The idea behind such a tool would be to generate long random user input strings and to feed those inputs to the application. At the time of execution, check if those randomly generated strings are contained in an argument of a vulnerable function call (such as executeQuery). As such the tool will be able to detect security vulnerabilities with very high probability with minimal overhead.
  • Build an automated tool to generate test drivers for various modules of a program. Suppose, we have written an implementation of a data structure L (say red black tree) and an application program P that uses L. Now we want to unit test the module L. For this we need to write a test driver that tests the code of L. An example of such a test driver for a red black tree implementation can be as follows:
    void testme(){
      int i, x, toss;
      RedBlackTree t = new RedBlackTree();
    
      for(i=0; i<10; i++){
        x = input();
        toss = input();
        if(toss > 0){
          t.insert(x);
        } else if(t.contains(x)) {
          t.remove(x);
        }
      }
    }
    
    These test drivers are usually written manually. Build a tool that can automatically generate such test drivers for L by analyzing the usage of L in P.
Courtesy Prof. David Wagner

  • Build an extension to Java that supports "checked" arithmetic. For instance, whenever you add two int's, if the sum overflows the precision available, it will throw an IntegerOverflowException. (This might be accomplished by a source-to-source or bytecode-to-bytecode transformation, for instance.) How low can you make the performance overhead?
  • Build an extension to Java that adds a convenience type "bigint", which is just like "int" except that it has infinite precision. Put another way, "bigint" is to java.math.BigInteger as "int" is to java.lang.Integer. The "bigint" type should have all the usual operators, including "+", "-", "/", "*", and so on, as well as a promotion that promotes integer constants to "bigint" type when needed, so that "bigint"'s can be conveniently manipulated without any method calls. (This might be accomplished by a source-to-source transformation, for instance.)
  • Extend any JML formal verification tool in a way that allows you to prove that a Java application will never throw any RuntimeException during execution. JML includes tools for formal verification of Java code, which allows the programmer to supply annotations regarding method preconditions and postconditions and object invariants. For instance, if I call a method f() that has a precondition, then the tool will emit a proof obligation requiring me to prove that the precondition holds at the call site to f(). These formal verification tools include ESC/Java2, Jack, LOOP, and others. Extend one of these tools to emit additional proof obligations (which do not need to be specified by the programmer explicitly) to ensure that no statement in the program throws a RuntimeException. This includes proving the absence of NullPointerExceptions, ArrayStoreExceptions, ClassCastExceptions, and so on. This could be useful for critical systems where you want to be sure that the program will not crash at runtime (e.g., embedded systems, safety-critical systems).

RFID Cleaning

Courtesy Shawn Jeffrey

RFID data is notoriously dirty: readings are frequently dropped and multiple nearby readers produce duplicate readings. This project would investigate schemes for spatio-temporal cleaning for RFID data using statistical/probabilistic mechanisms. Previous work (http://www.cs.berkeley.edu/~jeffery/pubs/smurf-vldb06.pdf) introduced a statistical model for RFID data cleaning and explored adaptive temporal smoothing. This project will extend this work to the spatial domain and explore more advanced RFID data processing. Part of the project will be to implement and experiment with different methods for spatio-temporal RFID smoothing over synthetic and real-life data by working closely with RFID technology and extending an existing synthetic data generator for RFID data (open source release possible). This project is a collaboration with grad student Shawn Jeffery, intel research scientist Minos Garofalakis, and Professor Michael Franklin.

Distributed Biomedical Signal Classification in Sensor Networks

Courtesy Roozbeh Jafari and Ruzena Bajcsy

Sensor networks have not been an ideal outfit for high performance computing due to their constrained resources. Limitations in processing power, battery life, communication bandwidth, and memory constrain the applicability of existing complex medical analysis algorithms such as the Electrocardiogram (ECG) analysis. Among various limitations, battery lifetime and memory have been a major key technological constraint. The other problem with in-situ biological sensing is extensive variation in the morphology of bio-signals of different patients and under various conditions. It is not quite clear how classification can be in done in distributed manner in this architecture considering the memory and the battery power constraints. The preliminary study can found at:
Adaptive Electrocardiogram Feature Extraction on Distributed Embedded Systems, Roozbeh Jafari, Hyduke Noshadi, Soheil Ghiasi, Majid Sarrafzadeh, IEEE Transactions on Parallel and Distributed Systems special issue on High Performance Computational Biology (TPDS), vol. 17, no. 8, pp 1-11, August 2006. http://www.eecs.berkeley.edu/~rjafari/publications/journals/tpdssi-0332-0705-corrected-2.pdf
Contact: Roozbeh Jafari (rjafari@eecs.berkeley.edu)

Applications over Declarative Networks

If you missed the lecture on P2, slides are available.
One of the as-yet-unrealized promises of declarative networking is that distributed, data-intensive applications will benefit from the system -- both because of the declarative language, and because of the opportunity to co-design the app and its overlay(s). Even more exciting is the possibility that cross-layer optimizations spanning the app logic and the overlay logic could be found automatically.
Some juicy applications for P2 in this context include:
  • Distributed intrusion detection. Enhance an endpoint intrusion detection system like SNORT to do real-time, peer-to-peer correlation of events across multiple machines. This could be used to produce a visualization "cockpit" of global behavior, and/or could be used to improve the quality of classifying packets as malicious, as in a recent paper at AAAI. Compare with centralized schemes like DSHIELD.
  • Collaborative spam filtering. Build an extension to your favorite mail client (Thunderbird, Opera, etc.) or mail server (e.g. imapd or Cyrus) that uses P2 as an infrastructure to do peer-to-peer spam identification with other people running the extension. You may want to look at the recent SIGCOMM paper on the topic.
  • P2P filesharing monitors. Before P2 was built, Boon Thau Loo implemented a tool based on the PIER system for monitoring the global Gnutella p2p filesharing network, described in a VLDB paper some years back. This kind of tool should be much easier to define and deploy using P2. Implement a similar system using P2, and try to derive interesting measurement results on filesharing today. You may want to choose another filesharing network, e.g. BitTorrent, and/or see if you can focus on a different aspect of the system to measure than Loo did..
  • Onion Routing. One of the more interesting challenges in overlay networks is to make them privacy-preserving. Onion routing is one proposal for that. The TOR project implements this kind of scheme. What can the flexibility of P2 bring to TOR?

P2 Language Issues

  • Program checking for P2. A number of results are known for checking the "safety" (finite execution) of Datalog rules (references on request). How do these tests change in the face of P2's soft state semantics? Given that P2 allows function invocation outside of pure datalog, how can safety and other desirable properties be ensured -- is there an extensibility API for such functions that helps? If safety cannot be guaranteed at compile-time, can programs be automatically augmented with minimalist runtime checks?
  • Temporal logic extensions to P2. In P2's simple logic language, there is no way to express composite events that capture the order of actions. Such temporal extensions can be quite useful in network security, as demonstrated by our new faculty member Koushik Sen in his FORTE 04 paper. Can such language features be comfortable added to P2? Is a distributed implementation of such intrusion detection feasible? Another important point of reference is Lamport's TLA language.

Distributed AI techniques

  • Distributed Inference. I have a partial implementation of Paskin and Guestrin's Distributed Junction Tree algorithm expressed in P2. I have also added support for vectors and 2-d matrices in P2 -- enough to represent Gaussian distributions. The goal of this project would be to push that effort to completion, and then more deeply study the overlay networking issues relevant to distributed junction trees. As a possible expansion of the project, an approximate distributed inference algorithm, e.g. Wainright's tree reweighted loopy approach, could also be implemented and compared in a distributed context. See also the AAAI 06 paper mentioned in the discussion of intrusion detection above.
  • Distributed Network Anomaly Detection via Triggers. Recent local work spearheaded by Ling Huang has focused on detecting network traffic anomalies via approximating the Principle Components of a traffic matrix; this is described in recent papers in MineNet and NIPS. Bringing this technique into P2 could have a number of research ramifications.
  • Distributed End-Host Clustering. As part of a possible distributed intrusion detection system, design a distributed clustering algorithm to automatically group end-hosts in a P2P by various features (traffic patterns, software versions, firewall traces, etc.) Be able to continuously track the clusters over time, to enable end-hosts to determine when they drift from their usual cluster, or when their entire cluster experiences changes.

Web Infrastructure and Parallel Query Processing

  • Parallel Dataflow with Overlays. Partitioned-parallel dataflow execution is a good way to scale up cluster-based data processing. It has been used successfully in databases since the 1980's (see the DeWitt/Gray paper on "Parallel Database Systems" in the red book, the Graefe "Encapsulation" paper in the book, as well as the work on FLuX [1, 2]). Recently this idea has proven very important in web search engines; Google calls it MapReduce, and it underlies much of their batch processing. Yahoo! is using an open-souce implementation called hadoop. Traditionally, parallel dataflow assumes a very reliable set of "fate-shared" machines. As clusters scale out, they are beginning to look more and more like distributed systems. Investigate the use of overlay technologies in a high-performance parallel dataflow environment for clusters.

Networking

  • Declarative Protocol Implementation. P2 was design as a system for constructing overlay networks; however, one can certainly imagine using it to implement the core functionality of Internet routing. Consider a declarative implementation of Internet protocols like BGP, or even TCP/IP. Can these be made efficient in P2? Does the declarative perspective help to prove properties about correctness? Does the flexibility of the language and the dataflow system enable new optimizations to be defined and exploited? Some related work here includes Click, XORP, and logics for routing.
  • Ephemeral, Customized Overlay Networks. P2 allows overlay networks to be concisely specified, and constructed on the fly. This raises the possibility of customized, "ephemeral" overlays that are set up for a specific task (e.g. disseminating a command with a randomized gossip algorithm, constructing a multicast/aggregation tree with controlled fanout/in, producing a random-neighbor DHT for secure routing, etc.) This project would investigate the feasibility and motivation for such networks: how quickly can overlays be brought up and torn down, and can you demonstrate the benefit of a specific customized overlay for a particular task?

Sensor Networks

  • Source-routed tours in sensornets. The BBQ system proposed using source-routed "tours" of network nodes for data collection. More recently, we developed protocols to optimize such tours and tolerate failures (see our IPSN paper for details.) These protocols have not been tested in the field, and many of the recovery issues are open for future research. This project would pursue these ideas, possibly taking advantage of the Declarative Sensor Nets implementation (think of P2 on motes) done in last year's edition of CS262.
  • Sketch-based approximate aggregates. Evaluate the practical use of approximation techniques like distributed sketches in a real sensornet query implementation. Consider the routing opportunities of duplicate-insensitive sketches as well (e.g. Tributaries and Deltas.)
  • Mixing push and pull in probabilistic querying. BBQ proposed a "pull-oriented" paradigm for probabilistically querying sensors. A CS262 project in 2004 proposed a "push-based" paradigm called Ken. These two approaches are different; it is unclear how to integrate them intelligently, and what third approaches might be sensible.

Security and Privacy in Distributed Query Processing

  • Proof Sketches. We have developed a new technique called Proof Sketches to prove that the result to a distributed aggregate query (e.g. a vote, an average computation, etc.) is within epsilon of the correct result, despite tampering by adversarial aggregators. The idea suggests a number of interesting directions worth exploring.
  • Privacy and Verifiability in Distributed Querying. Privacy-preserving multi-party querying has been a hot topic in recent years (see Rakesh Agrawal's various papers for overviews -- this is a good starting point). This work has typically not considered wide-area environments with multi-hop collaborative execution models, in which not only is the privacy at issue, but also the veracity of the answer. The rather open-ended idea here is to revisit one of the key themes from this line of work in a distributed context, and consider the new issues that arise in verifying the correctness of the query answers while preserving privacy.

Miscellany

  • The Meta-DBMS. In some web service settings, a large volume of data can arise due to the accumulation of many, many small databases. What is the best design for providing SQL-based access to a large number of users in such an environment? Database Management Systems are typically not designed for such workloads. Adapt an open-source database engine like PostgreSQL or MySQL to perform well in such an environment, or demonstrate that an alternate scheme, possibly based on flat files loaded "just in time" into a DBMS, can work well.
  • The Ultimate Test Harness. Typical Quality Assurance (QA) tools for software development include a "test harness" to allow the definition of nightly tests of the software. However, such tests are often fairly restricted in scope, and do not try out very many possible combinations of system inputs (settings, workloads, etc.) The goal of this project is to design a massively parallel testing facility that would harness a large cluster to generate enormous numbers of tests, autogenerating the tests from high-level descriptions.