You are encouraged to design your own project. To get you thinking, here 
are some ideas; not surprisingly, they tend to focus on ongoing projects 
related to work here at Berkeley. However, you are by no means required 
to work on things we're interested in: The goal is to find something that
excites you!


-------------------------------------------------------------------
Mashups for the Masses: Easy collaborative querying of distributed
data by non-experts
--------------------------------------------------------------------
Outside Contact:
 Robert Ennals, robert.ennals@intel.com (Intel Research Berkeley)
--------------------------------------------------------------------

The Mashups for the Masses project is investigating ways that we can
make it easy for non-expert users to explore and query live data from
distributed data sources. We have built a prototype tool, MashMaker,
that allows users to easily create their own mashups from multiple
websites. Like a web browser, MashMaker allows users to find interesting
queries by "wandering around their data", without having to plan in
advance what information they are interested in. Mashmaker is written in
Java, and uses the Google Web Toolkit (GWT) to present an AJAX web
interface to users.

There are many potential projects that a student could undertake under
the Mashups for the Masses umbrella. In particular:

1) Suggesting query widgets: We would like the tool to automatically
suggest "query widgets" that the user can apply to the data they are
looking at. You could design and implement an infrastructure for doing
this.

2) Mobile data: We would like to have a mobile version of MashMaker
that can be used on a phone, taking advantage of the features and
limitations of this platform. You could design and implement such a
version.

3) Collaboration: How should users collaboratatively manipulate and
query a shared set of data? How should they share interesting queries
that others could apply to similar data?

All of the above sub-projects will include both an intellectual 
component (e.g., exploring relevant literature and proposing a novel
design or solution) as well as an implementation component that will
build on our current MashMaker codebase. 

A demo of the MashMaker system has been accepted to SIGMOD'2007, and
student(s) contributing to this project will be appearing as co-authors
on the SIGMOD demo paper.


Relevant references include:

MashMaker: Mashups for the Masses
       (SIGMOD Demo, 2007 - preliminary version at:
http://berkeley.intel-research.net/rennals/pubs/mashupdemo.pdf )
       - for an overview of the project
Pipes:
       http://pipes.yahoo.com
       - A similar project being undertaken by Yahoo
Swivel:
       http://www.swivel.com
       - A mashup creator targeted specifically at tabular numerical data
The Semex Project:
       http://data.cs.washington.edu/semex/semex.html)
       - A database of links between personal data
DataSpaces:
       http://www.cs.berkeley.edu/~franklin/Papers/dataspaceSR.pdf
       - A very relevant data-integration project
Goal-oriented user interfaces:
       http://agents.media.mit.edu/projects/semanticsearch/
       - An interesting approach that adds extra information to web content
Subtext: Uncovering the Simplicity of Programming
       OOPSLA 2005  and  http://subtextual.org
       - A programming environment that bears similarities to MashMaker
A user-centered approach to functions in excel
       ICFP 2003
       - Ways for normal users to do complex tasks in Excel.






--------------------------------------------------------------------
Approximate Query Processing in Sensornets
--------------------------------------------------------------------
** Sketch-based approximate aggregation.**
Evaluate the practical use of approximation techniques like 
distributed sketches (e.g., http://www.cs.berkeley.edu/~minos/Papers/debull05.pdf)
in a real sensornet query implementation. Consider the routing opportunities of 
duplicate-insensitive sketches as well (e.g., as in the "tributaries and deltas"
paper: http://portal.acm.org/citation.cfm?id=1066157.1066191 ).

** Mixing push and pull in approximate sensornet querying. ** 
BBQ
(http://www.cs.cmu.edu/~guestrin/Publications/BBQJournal05/vldbj-2005.pdf)
proposed a "pull-oriented" paradigm using a probabilistic model (resident at 
the base station) for approximately querying sensors. More recent work has suggested
the idea of models that are "shared" (across the querying and sensing sites), thus
enabling a "push-based" paradigm (e.g., see
http://www.cs.berkeley.edu/~davidchu/papers/ChuDeshpandeHellersteinHong-ApproximateDataCollection-IEEE_ICDE-2006.pdf
and
http://www.cs.berkeley.edu/~minos/Papers/vdb05.pdf)
These two approaches are different; it is unclear how to integrate them 
intelligently, and what third approaches might be sensible. 



--------------------------------------------------------------------
Security and Privacy in Distributed Query Processing
--------------------------------------------------------------------
** Proof Sketches.**
We have developed a new technique called Proof Sketches
(see 
http://www.cs.berkeley.edu/~minos/Papers/icde07amfm.pdf)
to prove that the result to a distributed aggregate query (e.g., a vote, an 
average computation, etc.) is within epsilon of the correct result, despite 
potential malicious tampering by adversarial aggregators. The idea suggests 
a number of interesting directions worth exploring in the area of verifiable
approximate query answering. For instance, would it be possible to use similar
ideas for verifying the results of more complex queries (e.g., distributed
joins)?

Other relevant references:
- http://www.ece.cmu.edu/~dawnsong/papers/sia-hierachical.pdf
- http://www.ece.cmu.edu/~dawnsong/papers/sia.pdf


** Privacy and Verifiability in Distributed Querying.** 
Privacy-preserving multi-party querying has been a hot topic in recent years 
(see Rakesh Agrawal's various papers for overviews -- this is a good starting
point). This work has typically not considered wide-area environments with 
multi-hop collaborative execution models, in which not only is the privacy 
at issue, but also the veracity of the answer. The rather open-ended idea here 
is to revisit one of the key themes from this line of work in a distributed 
context, and consider the new issues that arise in verifying the correctness 
of the query answers while preserving privacy.



--------------------------------------------------------------------
Distributed Triggers
--------------------------------------------------------------------
Outside Contacts: Ling Huang (hling@cs.berkeley.edu)
                  Nina Taft  (nina.taft@intel.com)
--------------------------------------------------------------------
Recent work has demonstrated the effectiveness of approximately tracking
a declarative trigger condition over physically-distributed data collections
using in-network processing/filtering of local information.  
For instance, approximately tracking the Principal Components of a global
traffic matrix allows real-time detection of network-wide anomalies.
See the 
twiki set up at the RadLab
(http://radlab.cs.berkeley.edu/wiki/Dtrigger) for more information and 
relevant papers.  The rather open-ended problem here is to design 
communication-efficient protocols for the approximate tracking of more
complex types of queries. See the distributed streaming tutorial at
http://www.cs.berkeley.edu/~minos/Talks/vldb06tut-slides.ppt
for more relevant references. A possible non-trivial implementation 
task here would involve the implementation of distributed triggering
tools on top of the CoMo monitoring platform developed within Intel
Research (see http://como.sourceforge.net/).





--------------------------------------------------------------------
RFID Data Cleaning
--------------------------------------------------------------------
Outside Contact: Shawn Jeffery (jeffery@cs.berkeley.edu)
--------------------------------------------------------------------
RFID data is notoriously dirty: readings are frequently dropped and 
multiple nearby readers produce duplicate readings. This project would 
investigate schemes for spatio-temporal cleaning for RFID data using 
statistical/probabilistic mechanisms. In a recent paper
(http://www.cs.berkeley.edu/~jeffery/pubs/smurf-vldb06.pdf) we 
introduced a statistical sampling-based model for RFID data cleaning 
and explored adaptive temporal smoothing. This project will extend 
this work to the spatial domain (e.g., arbitrating spurious readings 
across multiple RFID readers) and explore more advanced RFID data 
processing. Some initial ideas for the extension to spatial RFID data
cleaning are discussed in the journal version of the paper (ask Minos 
for a copy, if interested), but a lot more work needs to be done.

Part of the project will be to implement and experiment with different 
methods for spatio-temporal RFID smoothing over synthetic and real-life 
data by working closely with RFID technology and extending an existing 
synthetic data generator for RFID data (open source release possible). 
This project will be in collaboration with CS grad student Shawn Jeffery.