|
|
|
You are encouraged to design your own project. To get you
thinking, here are some ideas; not surprisingly, they tend to
focus on ongoing projects related to work here at Berkeley. However, you are
by no means required to work on things we're interested in!
Mashups for the Masses
Courtesy Minos Garofalakis and Rob Ennals, Intel Research
Normal people have lots of data that they would like to explore in
an easy, intuitive manner, posing queries like: What is the weather
like in these 10 holiday destinations? Which of my friends have I not
emailed in over a month? Which of these recipes have incredients I
can buy in Safeway? Typically, the bulk of this data has little or no
structure and not stored database systems; so, while it is possible
to extract the relevant information, it is definitely not "easy".
"Mashups for the Masses" is a project that attempts to make it easy for
unskilled, non-DB users to: (a) Create loosely structured databases of
information (think Notepad or Excel, not Oracle). (b) Use dashboard-like
"query widgets" to generate extra information, based on the information
they already have and information available on the web. (c) Share data
and queries with friends and with the rest of the world. (d) Browse
through data without having to think about the underlying queries and
data-source specifics.
Specific sub-projects of interest (with both conceptual and implementation
components) include:
- Recognizing Data and Structure: Given loosely structured data, how do we
identify interesting structure and information that can be extracted from
it. For instance, one possibility is to work out which query widgets might
be able to do something useful with it, based on whether other users have
applied similar query widgets to similar data in the past.
-
Interesting Query Widget Construction: Query widgets can get data in a
variety of ways, including from web services and web sites. In addition,
query widgets should be easy to publish and compose in order to accomplish
"interesting" things; for example, the system could learn or suggest
potentially interesting query widgets to users based on the actions of
their friends/collaborators. A possible implementation task would be to
build an API/toolkit that allows easy construction and composition of
specific query widget types (e.g., ones that scrape data off a web site).
-
Mobile Data: Write a PDA- or smartphone-based browser for personal mashup
data. This should allow one to access and edit the same data that is
accessible on the PC app. New issues may arrise about synchronisation of
data, and making use of limited storage, bandwidth, and screen space.
Related references include work on the Semex project
(http://data.cs.washington.edu/semex/semex.html)
the MIT work on goal-oriented user interfaces
(http://agents.media.mit.edu/projects/semanticsearch/)
and the very recent efforts on personal dataspaces
(http://www.cs.berkeley.edu/~franklin/Papers/dataspaceSR.pdf).
Parallel Computing
Courtesy Jim Demmel and the BeBOP group
-
A previous CS262 project by Rajesh Nishtala involved tuning some collective communications in MPI, and obtained excellent performance improvements. This work didn't include many commonly-used MPI collective communication calls; one project is to extend the work to implement and explore optimization strategies for the other collective operations.
-
Parallel computing platforms like grid typically have high network latencies (compared to their computational power). Latency avoiding algorithms try to overcome this by reducing the number of remote fetches at the cost of possibly more computation. Furthermore, jitter (uncertainty in the communication time) can also affect the performance adversely. It would be interesting to compare and contrast a latency avoiding implementation (e.g., Cholesky factorization algorithm in the LAPACK) with a non latency avoiding implementation (e.g., LU factorization) in the above context (e.g., on a grid). Noise can be injected in the implementation to study how much the two implementations are sensitive to jitter.
SW Engineering and Security
Courtesy Prof. Koushik Sen
Courtesy Prof. David Wagner
- Build an extension to Java that supports "checked" arithmetic.
For instance, whenever you add two int's, if the sum overflows
the precision available, it will throw an IntegerOverflowException.
(This might be accomplished by a source-to-source or
bytecode-to-bytecode
transformation, for instance.) How low can you make the performance
overhead?
- Build an extension to Java that adds a convenience type "bigint",
which is just like "int" except that it has infinite precision.
Put another way, "bigint" is to java.math.BigInteger as "int" is
to java.lang.Integer. The "bigint" type should have all the usual
operators, including "+", "-", "/", "*", and so on, as well as a
promotion that promotes integer constants to "bigint" type when
needed, so that "bigint"'s can be conveniently manipulated without
any method calls. (This might be accomplished by a source-to-source
transformation, for instance.)
- Extend any JML formal verification tool in a way that allows you to
prove that a Java application will never throw any RuntimeException
during execution. JML includes tools for formal verification of
Java
code, which allows the programmer to supply annotations regarding
method
preconditions and postconditions and object invariants. For
instance,
if I call a method f() that has a precondition, then the tool will
emit a proof obligation requiring me to prove that the precondition
holds at the call site to f(). These formal verification tools
include
ESC/Java2, Jack, LOOP, and others. Extend one of these tools to
emit
additional proof obligations (which do not need to be specified by
the programmer explicitly) to ensure that no statement in the
program
throws a RuntimeException. This includes proving the absence of
NullPointerExceptions, ArrayStoreExceptions, ClassCastExceptions,
and so on. This could be useful for critical systems where you want
to be sure that the program will not crash at runtime (e.g.,
embedded
systems, safety-critical systems).
RFID Cleaning
Courtesy Shawn Jeffrey
RFID data is notoriously dirty: readings are frequently dropped and
multiple nearby readers produce duplicate readings. This project
would investigate schemes for spatio-temporal cleaning for RFID data
using statistical/probabilistic mechanisms. Previous work
(http://www.cs.berkeley.edu/~jeffery/pubs/smurf-vldb06.pdf) introduced
a statistical model for RFID data cleaning and explored adaptive
temporal smoothing. This project will extend this work to the spatial
domain and explore more advanced RFID data processing. Part of the
project will be to implement and experiment with different methods for
spatio-temporal RFID smoothing over synthetic and real-life data by
working closely with RFID technology and extending an existing
synthetic data generator for RFID data (open source release possible).
This project is a
collaboration with grad student Shawn Jeffery, intel research
scientist Minos Garofalakis, and Professor Michael Franklin.
Distributed Biomedical Signal Classification in Sensor Networks
Courtesy Roozbeh Jafari and Ruzena Bajcsy
Sensor networks have not been an ideal outfit for high performance computing due to their constrained resources. Limitations in processing power, battery life, communication bandwidth, and memory constrain the applicability of existing complex medical analysis algorithms such as the Electrocardiogram (ECG) analysis. Among various limitations, battery lifetime and memory have been a major key technological constraint. The other problem with in-situ biological sensing is extensive variation in the morphology of bio-signals of different patients and under various conditions. It is not quite clear how classification can be in done in distributed manner in this architecture considering the memory and the battery power constraints. The preliminary study can found at:
Adaptive Electrocardiogram Feature Extraction on Distributed Embedded Systems,
Roozbeh Jafari, Hyduke Noshadi, Soheil Ghiasi, Majid Sarrafzadeh,
IEEE Transactions on Parallel and Distributed Systems special issue on High Performance Computational Biology (TPDS), vol. 17, no. 8, pp 1-11, August 2006.
http://www.eecs.berkeley.edu/~rjafari/publications/journals/tpdssi-0332-0705-corrected-2.pdf
Contact: Roozbeh Jafari (rjafari@eecs.berkeley.edu)
Applications over Declarative Networks
If you missed the lecture on P2, slides are available.
One of the as-yet-unrealized promises of declarative networking is that distributed,
data-intensive applications will benefit from the system --
both because of the declarative language, and because of the
opportunity to co-design the app and its overlay(s). Even more
exciting is the possibility that cross-layer optimizations spanning the
app logic and the overlay logic could be found automatically.
Some juicy applications for P2 in this context include:
- Distributed intrusion detection. Enhance an endpoint
intrusion detection system like SNORT to do
real-time, peer-to-peer correlation of events across multiple
machines. This could be used to produce a visualization "cockpit" of
global behavior, and/or could be used to improve the quality of
classifying packets as malicious, as in a recent
paper at AAAI. Compare with centralized schemes like DSHIELD.
- Collaborative spam filtering. Build an extension to your favorite
mail client (Thunderbird, Opera, etc.) or mail server (e.g. imapd or Cyrus) that uses P2 as an
infrastructure to do peer-to-peer spam identification with other
people running the extension. You may want to look at the recent
SIGCOMM paper on the topic.
- P2P filesharing monitors. Before P2 was built, Boon Thau
Loo implemented a tool based on the PIER system for monitoring the
global Gnutella p2p filesharing network,
described in a VLDB
paper some years back. This kind of tool should be much easier
to define and deploy using P2. Implement a similar system using P2, and try to
derive interesting measurement results on filesharing today. You may want to
choose another filesharing network, e.g. BitTorrent, and/or see if you
can focus on a different aspect of the system to measure than Loo did..
- Onion Routing. One of the more interesting challenges in overlay
networks is to make them privacy-preserving. Onion routing is one
proposal for that. The TOR project
implements this kind of scheme. What can the flexibility of P2 bring
to TOR?
P2 Language Issues
- Program checking for P2. A number of results are known for
checking the "safety" (finite execution) of Datalog rules (references
on request). How do these tests change in the face of P2's soft state
semantics? Given that P2 allows function invocation outside of pure
datalog, how can safety and other desirable properties be ensured --
is there an extensibility API for such functions that helps? If
safety cannot be guaranteed at compile-time, can programs be automatically
augmented with minimalist runtime checks?
- Temporal logic extensions to P2.
In P2's simple logic language, there is no way to express composite
events that capture the order of actions. Such temporal extensions
can be quite useful in network security, as demonstrated by our new
faculty member Koushik Sen in his FORTE
04 paper. Can such language features be comfortable added to P2?
Is a distributed implementation of such intrusion detection feasible?
Another important point of reference is Lamport's TLA
language.
Distributed AI techniques
- Distributed Inference. I have a partial implementation of
Paskin and Guestrin's
Distributed
Junction Tree algorithm expressed in P2. I have also added
support for vectors and 2-d matrices in P2 -- enough to represent Gaussian
distributions. The goal of this project would be to push that
effort to completion, and then more deeply study the overlay
networking issues relevant to distributed junction trees. As a
possible expansion of the project, an approximate distributed
inference algorithm, e.g. Wainright's tree
reweighted loopy approach, could also be implemented and compared
in a distributed context. See also the AAAI 06
paper mentioned in the discussion of intrusion detection above.
- Distributed Network Anomaly Detection via Triggers.
Recent local work spearheaded by Ling Huang has focused on
detecting network traffic anomalies via approximating the Principle
Components of a traffic matrix; this is described in recent papers in
MineNet
and NIPS.
Bringing this technique into P2 could have a number of research
ramifications.
- Distributed End-Host Clustering. As part of a possible
distributed intrusion detection system, design a distributed
clustering algorithm to
automatically group end-hosts in a P2P by various features (traffic
patterns, software versions, firewall traces, etc.) Be able to
continuously track the clusters over time, to enable end-hosts to
determine when they drift from their usual cluster, or when their
entire cluster experiences changes.
Web Infrastructure and Parallel Query Processing
- Parallel Dataflow with Overlays. Partitioned-parallel
dataflow execution is a good way to scale up cluster-based data
processing. It has been used successfully in databases since the
1980's (see the DeWitt/Gray paper on "Parallel Database Systems" in
the red book, the Graefe "Encapsulation" paper in the book, as well as the work
on FLuX
[1,
2]).
Recently this idea has proven very important in web search engines;
Google calls
it MapReduce,
and it underlies much of their batch processing. Yahoo! is using an
open-souce implementation called hadoop. Traditionally,
parallel dataflow assumes a very reliable set of "fate-shared"
machines. As clusters scale out, they are beginning to look more and
more like distributed systems. Investigate the use of overlay
technologies in a high-performance parallel dataflow environment for
clusters.
Networking
- Declarative Protocol Implementation. P2 was design as a
system for constructing overlay networks; however, one can certainly
imagine using it to implement the core functionality of Internet
routing. Consider a declarative implementation of Internet protocols
like BGP, or even TCP/IP. Can these be made efficient in P2? Does
the declarative perspective help to prove properties about
correctness? Does the flexibility of the language and the dataflow
system enable new optimizations to be defined and exploited? Some
related work here
includes Click, XORP,
and logics
for routing.
- Ephemeral, Customized Overlay Networks. P2 allows overlay
networks to be concisely specified, and constructed on the fly. This
raises the possibility of customized, "ephemeral" overlays that are
set up for a specific task (e.g. disseminating a command with a
randomized gossip algorithm, constructing a multicast/aggregation tree
with controlled fanout/in, producing a random-neighbor DHT for
secure routing, etc.) This project would investigate the
feasibility and motivation for such networks: how quickly can
overlays be brought up and torn down, and can you demonstrate the
benefit of a specific customized overlay for a particular task?
Sensor Networks
- Source-routed tours in
sensornets. The BBQ
system proposed using source-routed "tours" of network nodes for data
collection. More recently, we developed protocols to optimize such
tours and tolerate failures (see our IPSN
paper for details.) These protocols have not been tested in the
field, and many of the recovery issues are open for future research.
This project would pursue these ideas, possibly taking advantage of
the Declarative Sensor Nets implementation (think of P2 on motes) done
in last year's edition of CS262.
- Sketch-based approximate aggregates. Evaluate the practical
use of approximation techniques
like distributed
sketches in a real sensornet query implementation. Consider the
routing opportunities of duplicate-insensitive sketches as well
(e.g. Tributaries
and Deltas.)
- Mixing push and pull in probabilistic
querying. BBQ
proposed a "pull-oriented" paradigm for probabilistically querying
sensors. A CS262 project in 2004 proposed a "push-based" paradigm
called Ken. These two
approaches are different; it is unclear how to integrate them
intelligently, and what third approaches might be sensible.
Security and Privacy in Distributed Query Processing
- Proof Sketches. We have developed a
new technique called Proof
Sketches to prove that the result to a distributed aggregate query
(e.g. a vote, an average computation, etc.) is within epsilon of the
correct result, despite tampering by adversarial aggregators. The
idea suggests a number of interesting directions worth exploring.
- Privacy and Verifiability in Distributed Querying.
Privacy-preserving multi-party querying has been a hot topic in recent
years (see Rakesh Agrawal's various papers for overviews -- this is a
good starting point). This work has typically not considered
wide-area environments with multi-hop collaborative execution models,
in which not only is the privacy at issue, but also the veracity of
the answer. The rather open-ended idea here is to revisit one of the
key themes from this line of work in a distributed context, and
consider the new issues that arise in verifying the correctness of the
query answers while preserving privacy.
Miscellany
- The Meta-DBMS. In some web service settings, a large volume of data
can arise due to the accumulation of many, many small databases. What
is the best design for providing SQL-based access to a large number of
users in such an
environment? Database Management Systems are typically not designed
for such workloads. Adapt an open-source database engine like PostgreSQL or
MySQL to perform well in such an environment, or demonstrate that an
alternate scheme, possibly based on flat files loaded "just in time"
into a DBMS, can work well.
- The Ultimate Test Harness. Typical Quality Assurance (QA)
tools for software development include a "test harness" to allow the
definition of nightly tests of the software.
However, such tests are often fairly restricted in scope, and do not try out
very many possible combinations of system inputs (settings, workloads,
etc.) The goal of this project is to design a massively parallel testing
facility that would harness a large cluster to generate enormous
numbers of tests, autogenerating the tests from high-level descriptions.
|
|