|
All talks are held in
606 Soda Hall from
1:00 to 2:30pm unless otherwise specified. On this site, you will find upcoming
talk announcements and abstracts from previous talks in Berkeley's database seminar.
You can also receive talk announcements on our mailing list. If you are interested,
please see the instructions for adding yourself to the mailing list.
|
» Fall 2007 Calendar
09/07 |
On Privacy of Users in Web Search Query Logs |
More Info |
» Abstract
There is great appetite to study query logs as a rich window into
human intent, but as the AOL incident shows, the privacy concerns are
broad and well-founded. It is important to anonymize the query logs
before attempting any public release.
We start by studying the privacy preservation properties of a specific
technique for query log anonymization: token-based hashing, where each
query is tokenized, and then a secure hash function is applied to each
token. We show that statistical techniques may be applied to partially
compromise the anonymization, and sensitive information about users can
be revealed from the reconstructed query log. We then investigate the
risk of revealing user identity from queries in two different scenarios.
In the first setting, we study the application of simple classifiers to
map a sequence of queries into a gender, age, and location of the user
issuing the queries; based on this information, we examine how we can
map the queries into a set of candidate users and how to identify a
known user from a large query log. In the second setting, we examine
an anonymization approach to ``bundle'' logs of multiple users together.
We investigate the risk of recovering queries from a given user by first
locating the bundle for that user via vanity search, followed by an
analysis of the structural and analytical vulnerabilities inside the
bundles.
(Joint work with Rosie Jones, Ravi Kumar, Jasmine Novak, and Andrew Tomkins.)
» Abstract
Data synopses are an essential ingredient of methods for fast approximate analytical processing, interactive data exploration, auditing, and automated metadata discovery. We consider the problem of maintaining a warehouse of synposes that "shadows" a full-scale data warehouse. Incoming data is decomposed into partitions, and a synopsis is created for each partition. As the data partitions are rolled in and out of the full-scale warehouse, the corresponding synopses are rolled in and out of the synopsis warehouse. Synopses are combined as needed to yield synopses of the corresponding combination of partitions. This approach is efficient, allowing parallel processing, as well as flexible. We discuss some recent work aimed at supporting a warehouse of synopses. Our focus is on two types of synopses: uniform random samples and synopses for estimating the number of distinct data values in a partition. Our algorithms correct, improve, and extend techniques such as classical reservoir and Bernoulli sampling, the "concise" and "sample counting" schemes of Gibbons and Matias, and various probabilistic-counting methods.
» Bio
Peter Haas has been a Research Staff Member at the IBM Almaden Research Center since 1987, and is also a Consulting Associate Professor in the
Department of Management Science and Engineering at Stanford University, He has received a number of awards from both ACM SIGMOD and the IBM Research Division for his work on sampling-based exploration of massive datasets, automated relationship discovery in databases, query optimization methods, and technology for autonomic computing. Many of his techniques have been incorporated into IBM's DB2 database product. He has also developed theory and methods for modelling and simulation of complex discrete-event stochastic systems, and his book on stochastic Petri nets (Springer, 2002) received an Outstanding Publication Award from the INFORMS College on Simulation. He has served on numerous editorial boards and program committees, and is the author of over 100 conference publications, journal articles, and books.
09/21 |
Putting Context into Schema Matching --- What's up with 'Purple SOX'? |
More Info |
» Abstract
Title: Putting Context into Schema Matching
Attribute-level schema matching has proven to be an important first
step in developing mappings for data exchange, integration,
restructuring and schema evolution. We investigate
"contextual" schema matching, in which selection conditions are
associated with matches by the schema matching process in order to
improve overall match quality. We define a general space of matching
techniques, and within this framework we identify a variety of novel,
concrete algorithms for contextual schema matching. Furthermore, we
show how common schema mapping techniques can be generalized to
take more effective advantage of contextual matches, enabling
automatic construction of mappings across certain forms of schema
heterogeneity.
Title: What's Up with 'Purple SOX'?
I will give a high-level overview of the 'Purple SOX' Extraction Management
System. This project, an outgrowth of the Cimple project with the
University of Wisconsin, is an effort to build a 'data management surround'
for information extraction to support extensibility, explain-ability, and to
maximize the utilization of social feedback on information quality.
» Bio
Philip Bohannon received a B.S. in Computer Science from
Birmingham-Southern College in 1986, a MS in Computer Science
from Rutgers University in 1998 and a PhD in 1999. He is
currently a Principal Research Scientist with the Community Systems
Research Group at Yahoo! Research. From 1996 to 2006 he was a
a Member of Technical Staff at Bell Labs. His research
interests include information extraction, data integration and
cleaning, XML, and anything to do with building high-performance
or scalable data management systems.
09/28 |
The Development of an Internet Application Platform |
More Info |
» Abstract
Salesforce.com is an on-demand platform for building data-focused applications on the Internet. This talk discusses our multi-tenant development framework that includes our virtual database model, an on-demand programming language (Apex Code), and the operational techniques of supporting the full lifecycle of on-demand applications. Our service currently serves over one hundred million transactions a day. Apex Code is one of the first languages designed to be developed and operated over the Internet in a hosted, shared environment. This limits the type of operations that can be performed and the interaction pattern with the system. Requests are governed and designed to operate efficiently with batch idioms. They are cached and integrated with our view and persistence framework to create a single model for page navigation and data interaction. Applications are designed to be robust through upgrades. This talk will describe the trade-offs required in creating a hosted language and deploying versioned applications, give an overview of the Salesforce.com application platform and talk about future services required to complete the model.
» Bio
Craig Weissman is Chief Software Architect at Salesforce.com, the leading provider of on-demand multi-tenant enterprise software and the force.com application development platform. Craig has designed and built many areas of the Salesforce.com product including the underlying schema design and metadata layers that support the virtual database. Areas of Craig's focus include a multi-tenant set-based Object/Relational data modification engine and query optimizer, the Salesforce.com API (one of the world's most popular web services), and the Apex on-demand programming language and development environment. Craig's designs focus on relational databases at scale for both transactional and analytic processing and the blending of declarative and procedural concepts for rich, managed programming models.
Previously Craig was Engineering Fellow and Vice President of development at E.piphany, inc. - a leading provider of enterprise software for Data Warehousing and Analytical tools. Craig has a Masters degree in Computer Science and a Bachelors Degree in Applied Mathematics, both from Harvard University.
10/05 |
PNUTS: A Massively Scalable Data Management Service |
More Info |
» Abstract
The PNUTS project is to build a data management service for providing back-end support to Yahoo!'s web applications. To obtain acceptable latency and throughput while operating at Yahoo!'s scale, PNUTS uses massive parallelism and distribution---data is partitioned and replicated over thousands of servers. At the same time, PNUTS provides clean abstractions for data access that hide all this system complexity from the application programmer.
In contrast to traditional database solutions, PNUTS is a centrally hosted and managed data service. Such a shared shared service model frees applications from the burden of having to set up, maintain and scale their own data store, and also amortizes the operational cost across all of Yahoo!'s applications.
While designing such a large, distributed data management system, there is an inherent tradeoff between performance and consistency. One of the key design decisions in PNUTS is to provide higher performance by providing weaker forms of consistency than the ACID guarantees provided by a database system. Instead, we provide a carefully-chosen, minimal set of primitives that allow most applications to express and enforce their consistency requirements.
In this talk, I will describe the architecture of PNUTS, especially focussing on how it addresses challenges such as high performance, availability, data consistency, fault tolerance, and ease of operation.
PNUTS is a joint project between the Platform Engineering and Community Systems Research groups at Yahoo!
10/12 |
Trio: A System for Data, Uncertainty, and Lineage |
More Info |
» Abstract
Trio is a new kind of database system that supports DBMS-style data management, uncertainty, and lineage in a fully integrated manner. The talk presents the ULDB (for Uncertainty-Lineage-Databases) data model, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in tandem. Fundamentally, lineage enables simple and consistent representation of uncertain data, it correlates uncertainty in query results with uncertainty in the input data, and query processing with lineage and uncertainty together presents computational benefits over treating them separately. We show that the ULDB representation is complete, and that it permits straightforward implementation of many relational operations. We also show how ULDBs enable a new approach to query processing in probabilistic databases.
Finally, we’ll have a look at the current state of our first Trio prototype system, dubbed Trio-One, currently being developed at Stanford. Trio-One, our implementation of ULDBs, is built on top of a conventional DBMS using data and query translation techniques together with a small number of stored procedures.
10/19 |
Web Categorization: Applications and Challenges |
More Info |
» Abstract
We will discuss several real world applications of web categorization including vertical search, home pages for topics and others. We will also discuss challenges in categorizing the web into millions of categories and will draw upon our experience at Kosmix building a scalable web categorization platform.
» Bio
Srinivasan "Sesh" Seshadri has straddled academia and industry over a career spanning Kosmix, Yahoo!, Strand Genomics, Bell Labs and IIT Bombay. Over his career, Sesh's roles have varied from research and teaching to building hi-tech companies. He is currently the CTO of Kosmix.
10/26 |
Experiment-Driven Processing of System-Management Queries |
More Info |
» Abstract
Database-backed Web services (e.g., Amazon, eBay, Yahoo!) play an important role in our daily lives. The performance P (e.g., throughput) of a Web service S is a complex function of its workload W, resource allocation R, and the large number of configuration parameters C that affect S. Furthermore, P may be dictated by unknown interactions among W, R, and C. We have developed a systematic approach based on statistical design of experiments and active machine-learning to discover these dependencies and interactions accurately and comprehensively. Our approach plans a small set of experiments, where each experiment observes P for a selected combination. In this talk, I will describe (i) how we use the experiment-driven approach to process four basic queries in Web-service management; (ii) a harness that leverages virtualization to conduct experiments with specified combinations; and (iii) an empirical evaluation using two multitier Web services that demonstrates the feasibility and usefulness of our approach. I will
conclude by describing how we applied the same experiment-driven approach to tackle challenges in managing scientific applications in a utility computing setting, an NFS file server, and a database management system.
» Bio
Shivnath Babu is an Assistant Professor of Computer Science at Duke University. He received his Ph.D. from Stanford University in 2005. He was awarded a National Science Foundation Early CAREER Award in 2007 for his work on the Ques project on Querying and Controlling Systems.
He is also the recipient of two IBM Faculty Awards. His current research focuses on making large-scale databases and systems easier to manage.
11/02 |
Statistical Analysis for Approximate Query Processing |
More Info |
» Abstract
The biggest obstacle in designing and analyzing randomized techniques like sampling and sketching for complex aggregate queries is the fact that the statistical analysis becomes extremely complicated. The formulas literally explode on paper and are very hard to control. In this talk I will show a number of "tricks" that can be used to keep the analysis under control for sampling estimators for SELECT-FROM-WHERE aggregate queries. This is essentially the statistical analysis that needs to be performed in order to characterize the sampling estimates produced by the DBO engine (developed in collaboration with Chris Jermain). A number of surprising facts will surface about the statistical analysis:
- a general template can be developed that straightforwardly applies to analyze all uniform sampling methods
- significant saving in computation of variance of estimators can be obtained with such statistical analysis
- analysis of estimators like sampling can be mechanized to a large extend if the algebraic structure is exploited.
- while the analysis is somewhat sophisticated, the results are surprisingly useful for designing efficient algorithms and understanding sampling estimators.
Surprisingly, the work on statistical estimators for DBO can be extended to computation of aggregates over probabilistic databases. As I will explain, without such analysis, the computation of the variance of probabilistic aggregates for queries involving two relations or more is hopeless. The same kind of observations that allow efficient estimation for DBO estimates can be used to efficiently compute expectations and variances of probabilistic aggregates, which immediately gives useful confidence bounds. Interestingly, the database queries can be translated into a functional form, simplified using the method we developed and translated back into database queries of just slightly higher complexity (some GROUP BY clauses have to be added).
While the theme of the talk is rather mathematical, I will focus on the basic understanding of the techniques, give a history of how me and my collaborators discovered the techniques and focus on the Database implications of the theoretical results.
11/09 |
Beauty and the Beast: The Theory and Practice of Information Integration |
More Info |
» Abstract
Information integration is becoming a critical problem for businesses and individuals alike. Data volumes are sky-rocketing, and new sources and types of information are proliferating. This talk briefly reviews some of the key research accomplishments in information integration (theory and systems), then describes the current state-of-the-art in commercial practice, and the challenges (still) faced by CIOs and application developers. One critical challenge is choosing the right combination of tools and technologies to do the integration. Although each has been studied separately, we lack a unified (and certainly, a unifying) understanding of these various approaches to integration. Experience with a variety of integration projects suggests that we need a broader framework, perhaps even a theory, which explicitly takes into account requirements on the result of the integration, and considers the entire end-to-end integration process.
11/23 |
Thanksgiving Holiday |
(no seminar) |
|
|
11/30 |
Self-Managing DBMS Technology: The AutoAdmin Experience |
More Info |
» Abstract
The AutoAdmin project at Microsoft Research was started in late 1996. Our goal was to make it easier to monitor the server and develop self-tuning techniques for performance management. The technology from this project has been incorporated in the Microsoft SQL Server 2005 (and earlier releases - SQL Server 7.0 and SQL Server 2000). This talk will take a look at some of the past research results and discuss challenges and opportunities in self-tuning DBMS research.
» To subscribe to the mailing list:
To subscribe to a list, send e-mail to
majordomo@db.cs.berkeley.edu,
with a message (not a Subject line!) containing only the words:
subscribe <list name>
As an example, one database wannabe might send the message:
To: majordomo@db.cs.berkeley.edu
From: turing@acm.org (Alan Turing)
Subject: I wannabe!
subscribe dblunch
Unsubscribing is similar:
To: majordomo@db.cs.berkeley.edu
From: turing@acm.org (Alan Turing)
Subject: can't make it any more
unsubscribe dblunch