You are encouraged to design your own project. To get you thinking, here are some ideas; not surprisingly, they tend to focus on ongoing projects related to work here at Berkeley. However, you are by no means required to work on things we're interested in: The goal is to find something that excites you! ------------------------------------------------------------------- Mashups for the Masses: Easy collaborative querying of distributed data by non-experts -------------------------------------------------------------------- Outside Contact: Robert Ennals, robert.ennals@intel.com (Intel Research Berkeley) -------------------------------------------------------------------- The Mashups for the Masses project is investigating ways that we can make it easy for non-expert users to explore and query live data from distributed data sources. We have built a prototype tool, MashMaker, that allows users to easily create their own mashups from multiple websites. Like a web browser, MashMaker allows users to find interesting queries by "wandering around their data", without having to plan in advance what information they are interested in. Mashmaker is written in Java, and uses the Google Web Toolkit (GWT) to present an AJAX web interface to users. There are many potential projects that a student could undertake under the Mashups for the Masses umbrella. In particular: 1) Suggesting query widgets: We would like the tool to automatically suggest "query widgets" that the user can apply to the data they are looking at. You could design and implement an infrastructure for doing this. 2) Mobile data: We would like to have a mobile version of MashMaker that can be used on a phone, taking advantage of the features and limitations of this platform. You could design and implement such a version. 3) Collaboration: How should users collaboratatively manipulate and query a shared set of data? How should they share interesting queries that others could apply to similar data? All of the above sub-projects will include both an intellectual component (e.g., exploring relevant literature and proposing a novel design or solution) as well as an implementation component that will build on our current MashMaker codebase. A demo of the MashMaker system has been accepted to SIGMOD'2007, and student(s) contributing to this project will be appearing as co-authors on the SIGMOD demo paper. Relevant references include: MashMaker: Mashups for the Masses (SIGMOD Demo, 2007 - preliminary version at: http://berkeley.intel-research.net/rennals/pubs/mashupdemo.pdf ) - for an overview of the project Pipes: http://pipes.yahoo.com - A similar project being undertaken by Yahoo Swivel: http://www.swivel.com - A mashup creator targeted specifically at tabular numerical data The Semex Project: http://data.cs.washington.edu/semex/semex.html) - A database of links between personal data DataSpaces: http://www.cs.berkeley.edu/~franklin/Papers/dataspaceSR.pdf - A very relevant data-integration project Goal-oriented user interfaces: http://agents.media.mit.edu/projects/semanticsearch/ - An interesting approach that adds extra information to web content Subtext: Uncovering the Simplicity of Programming OOPSLA 2005 and http://subtextual.org - A programming environment that bears similarities to MashMaker A user-centered approach to functions in excel ICFP 2003 - Ways for normal users to do complex tasks in Excel. -------------------------------------------------------------------- Approximate Query Processing in Sensornets -------------------------------------------------------------------- ** Sketch-based approximate aggregation.** Evaluate the practical use of approximation techniques like distributed sketches (e.g., http://www.cs.berkeley.edu/~minos/Papers/debull05.pdf) in a real sensornet query implementation. Consider the routing opportunities of duplicate-insensitive sketches as well (e.g., as in the "tributaries and deltas" paper: http://portal.acm.org/citation.cfm?id=1066157.1066191 ). ** Mixing push and pull in approximate sensornet querying. ** BBQ (http://www.cs.cmu.edu/~guestrin/Publications/BBQJournal05/vldbj-2005.pdf) proposed a "pull-oriented" paradigm using a probabilistic model (resident at the base station) for approximately querying sensors. More recent work has suggested the idea of models that are "shared" (across the querying and sensing sites), thus enabling a "push-based" paradigm (e.g., see http://www.cs.berkeley.edu/~davidchu/papers/ChuDeshpandeHellersteinHong-ApproximateDataCollection-IEEE_ICDE-2006.pdf and http://www.cs.berkeley.edu/~minos/Papers/vdb05.pdf) These two approaches are different; it is unclear how to integrate them intelligently, and what third approaches might be sensible. -------------------------------------------------------------------- Security and Privacy in Distributed Query Processing -------------------------------------------------------------------- ** Proof Sketches.** We have developed a new technique called Proof Sketches (see http://www.cs.berkeley.edu/~minos/Papers/icde07amfm.pdf) to prove that the result to a distributed aggregate query (e.g., a vote, an average computation, etc.) is within epsilon of the correct result, despite potential malicious tampering by adversarial aggregators. The idea suggests a number of interesting directions worth exploring in the area of verifiable approximate query answering. For instance, would it be possible to use similar ideas for verifying the results of more complex queries (e.g., distributed joins)? Other relevant references: - http://www.ece.cmu.edu/~dawnsong/papers/sia-hierachical.pdf - http://www.ece.cmu.edu/~dawnsong/papers/sia.pdf ** Privacy and Verifiability in Distributed Querying.** Privacy-preserving multi-party querying has been a hot topic in recent years (see Rakesh Agrawal's various papers for overviews -- this is a good starting point). This work has typically not considered wide-area environments with multi-hop collaborative execution models, in which not only is the privacy at issue, but also the veracity of the answer. The rather open-ended idea here is to revisit one of the key themes from this line of work in a distributed context, and consider the new issues that arise in verifying the correctness of the query answers while preserving privacy. -------------------------------------------------------------------- Distributed Triggers -------------------------------------------------------------------- Outside Contacts: Ling Huang (hling@cs.berkeley.edu) Nina Taft (nina.taft@intel.com) -------------------------------------------------------------------- Recent work has demonstrated the effectiveness of approximately tracking a declarative trigger condition over physically-distributed data collections using in-network processing/filtering of local information. For instance, approximately tracking the Principal Components of a global traffic matrix allows real-time detection of network-wide anomalies. See the twiki set up at the RadLab (http://radlab.cs.berkeley.edu/wiki/Dtrigger) for more information and relevant papers. The rather open-ended problem here is to design communication-efficient protocols for the approximate tracking of more complex types of queries. See the distributed streaming tutorial at http://www.cs.berkeley.edu/~minos/Talks/vldb06tut-slides.ppt for more relevant references. A possible non-trivial implementation task here would involve the implementation of distributed triggering tools on top of the CoMo monitoring platform developed within Intel Research (see http://como.sourceforge.net/). -------------------------------------------------------------------- RFID Data Cleaning -------------------------------------------------------------------- Outside Contact: Shawn Jeffery (jeffery@cs.berkeley.edu) -------------------------------------------------------------------- RFID data is notoriously dirty: readings are frequently dropped and multiple nearby readers produce duplicate readings. This project would investigate schemes for spatio-temporal cleaning for RFID data using statistical/probabilistic mechanisms. In a recent paper (http://www.cs.berkeley.edu/~jeffery/pubs/smurf-vldb06.pdf) we introduced a statistical sampling-based model for RFID data cleaning and explored adaptive temporal smoothing. This project will extend this work to the spatial domain (e.g., arbitrating spurious readings across multiple RFID readers) and explore more advanced RFID data processing. Some initial ideas for the extension to spatial RFID data cleaning are discussed in the journal version of the paper (ask Minos for a copy, if interested), but a lot more work needs to be done. Part of the project will be to implement and experiment with different methods for spatio-temporal RFID smoothing over synthetic and real-life data by working closely with RFID technology and extending an existing synthetic data generator for RFID data (open source release possible). This project will be in collaboration with CS grad student Shawn Jeffery.