U.C. Berkeley Database Group Seminar Talks

» About

All talks are held in 606 Soda Hall from 1:00 to 2:30pm unless otherwise specified. On this site, you will find upcoming talk announcements and abstracts from previous talks in Berkeley's database seminar. You can also receive talk announcements on our mailing list. If you are interested, please see the instructions for adding yourself to the mailing list.

» Fall 2007 Calendar

09/07	On Privacy of Users in Web Search Query Logs	More Info
Bo Pang	Yahoo! Research

09/14	Towards a Synopsis Warehouse	More Info
Peter Haas	IBM Almaden

» Abstract

Data synopses are an essential ingredient of methods for fast approximate analytical processing, interactive data exploration, auditing, and automated metadata discovery. We consider the problem of maintaining a warehouse of synposes that "shadows" a full-scale data warehouse. Incoming data is decomposed into partitions, and a synopsis is created for each partition. As the data partitions are rolled in and out of the full-scale warehouse, the corresponding synopses are rolled in and out of the synopsis warehouse. Synopses are combined as needed to yield synopses of the corresponding combination of partitions. This approach is efficient, allowing parallel processing, as well as flexible. We discuss some recent work aimed at supporting a warehouse of synopses. Our focus is on two types of synopses: uniform random samples and synopses for estimating the number of distinct data values in a partition. Our algorithms correct, improve, and extend techniques such as classical reservoir and Bernoulli sampling, the "concise" and "sample counting" schemes of Gibbons and Matias, and various probabilistic-counting methods.

» Bio

Peter Haas has been a Research Staff Member at the IBM Almaden Research Center since 1987, and is also a Consulting Associate Professor in the
Department of Management Science and Engineering at Stanford University, He has received a number of awards from both ACM SIGMOD and the IBM Research Division for his work on sampling-based exploration of massive datasets, automated relationship discovery in databases, query optimization methods, and technology for autonomic computing. Many of his techniques have been incorporated into IBM's DB2 database product. He has also developed theory and methods for modelling and simulation of complex discrete-event stochastic systems, and his book on stochastic Petri nets (Springer, 2002) received an Outstanding Publication Award from the INFORMS College on Simulation. He has served on numerous editorial boards and program committees, and is the author of over 100 conference publications, journal articles, and books.

09/21	Putting Context into Schema Matching --- What's up with 'Purple SOX'?	More Info
Philip Bohannon	Yahoo! Research

09/28	The Development of an Internet Application Platform	More Info
Craig Weissman	Salesforce.com

» Abstract

Salesforce.com is an on-demand platform for building data-focused applications on the Internet. This talk discusses our multi-tenant development framework that includes our virtual database model, an on-demand programming language (Apex Code), and the operational techniques of supporting the full lifecycle of on-demand applications. Our service currently serves over one hundred million transactions a day. Apex Code is one of the first languages designed to be developed and operated over the Internet in a hosted, shared environment. This limits the type of operations that can be performed and the interaction pattern with the system. Requests are governed and designed to operate efficiently with batch idioms. They are cached and integrated with our view and persistence framework to create a single model for page navigation and data interaction. Applications are designed to be robust through upgrades. This talk will describe the trade-offs required in creating a hosted language and deploying versioned applications, give an overview of the Salesforce.com application platform and talk about future services required to complete the model.

» Bio

Craig Weissman is Chief Software Architect at Salesforce.com, the leading provider of on-demand multi-tenant enterprise software and the force.com application development platform. Craig has designed and built many areas of the Salesforce.com product including the underlying schema design and metadata layers that support the virtual database. Areas of Craig's focus include a multi-tenant set-based Object/Relational data modification engine and query optimizer, the Salesforce.com API (one of the world's most popular web services), and the Apex on-demand programming language and development environment. Craig's designs focus on relational databases at scale for both transactional and analytic processing and the blending of declarative and procedural concepts for rich, managed programming models.

Previously Craig was Engineering Fellow and Vice President of development at E.piphany, inc. - a leading provider of enterprise software for Data Warehousing and Analytical tools. Craig has a Masters degree in Computer Science and a Bachelors Degree in Applied Mathematics, both from Harvard University.

10/05	PNUTS: A Massively Scalable Data Management Service	More Info
Utkarsh Srivastava	Yahoo! Research

10/12	Trio: A System for Data, Uncertainty, and Lineage	More Info
Martin Theobald	Stanford University

10/19	Web Categorization: Applications and Challenges	More Info
Srinivasan Seshadri	Kosmix

10/26	Experiment-Driven Processing of System-Management Queries	More Info
Shivnath Babu	Duke University

11/02	Statistical Analysis for Approximate Query Processing	More Info
Alin Dobra	University of Florida

» Abstract

The biggest obstacle in designing and analyzing randomized techniques like sampling and sketching for complex aggregate queries is the fact that the statistical analysis becomes extremely complicated. The formulas literally explode on paper and are very hard to control. In this talk I will show a number of "tricks" that can be used to keep the analysis under control for sampling estimators for SELECT-FROM-WHERE aggregate queries. This is essentially the statistical analysis that needs to be performed in order to characterize the sampling estimates produced by the DBO engine (developed in collaboration with Chris Jermain). A number of surprising facts will surface about the statistical analysis:
- a general template can be developed that straightforwardly applies to analyze all uniform sampling methods
- significant saving in computation of variance of estimators can be obtained with such statistical analysis
- analysis of estimators like sampling can be mechanized to a large extend if the algebraic structure is exploited.
- while the analysis is somewhat sophisticated, the results are surprisingly useful for designing efficient algorithms and understanding sampling estimators.

Surprisingly, the work on statistical estimators for DBO can be extended to computation of aggregates over probabilistic databases. As I will explain, without such analysis, the computation of the variance of probabilistic aggregates for queries involving two relations or more is hopeless. The same kind of observations that allow efficient estimation for DBO estimates can be used to efficiently compute expectations and variances of probabilistic aggregates, which immediately gives useful confidence bounds. Interestingly, the database queries can be translated into a functional form, simplified using the method we developed and translated back into database queries of just slightly higher complexity (some GROUP BY clauses have to be added).

While the theme of the talk is rather mathematical, I will focus on the basic understanding of the techniques, give a history of how me and my collaborators discovered the techniques and focus on the Database implications of the theoretical results.

11/09	Beauty and the Beast: The Theory and Practice of Information Integration	More Info
Laura Haas	IBM Almaden

11/16

TALK CANCELLED

11/23	Thanksgiving Holiday
(no seminar)

11/30	Self-Managing DBMS Technology: The AutoAdmin Experience	More Info
Surajit Chaudhuri	Microsoft Research

12/07

TALK CANCELLED

show/hide

» To subscribe to the mailing list:

To subscribe to a list, send e-mail to majordomo@db.cs.berkeley.edu, with a message (not a Subject line!) containing only the words:

subscribe <list name>

As an example, one database wannabe might send the message:

To: majordomo@db.cs.berkeley.edu
From: turing@acm.org (Alan Turing)
Subject: I wannabe!

subscribe dblunch

Unsubscribing is similar:

To: majordomo@db.cs.berkeley.edu
From: turing@acm.org (Alan Turing)
Subject: can't make it any more

unsubscribe dblunch