|
The DB Seminar this semester has a Scalable Data Analytics theme, with people from Systems, Languages, Algorithms/Machine Learning and Visualization.
All talks are held in
606 Soda Hall from
1:00 to 2:00pm, preceded by lunch from 12:30pm to 1:00pm, unless otherwise specified. On this site, you will find upcoming
talk announcements and abstracts from previous talks in Berkeley's database seminar.
You can also receive talk announcements on our mailing list. If you are interested,
please see the instructions for adding yourself to the mailing list.
|
Speaker: TBA
Title: TBA
Time/Venue: 1:00-2:00pm / 606 Soda
» Fall 2009 Calendar
09/18 |
Democratizing The Cloud With The .NET Reactive Framework Rx |
More Info |
Erik Meijer |
Microsoft Research |
|
» Abstract
Current languages and libraries provide limited support for asynchronous and event-based programming. This forces developers to use explicit continuation passing style by breaking their code into many disjoint event-handlers. The lack of a simple programming model for asynchronous programming is quickly getting problematic for developers because of the inevitable advent of many-core computers, distributed and Cloud computing. Remember that the "A" in AJAX stands for "Asynchronous".
This talks shows that the familiar Subject/Observer design pattern is the mathematical dual of the even more familiar Iterator design pattern. From this abstract duality, we derive a concrete implementation of the LINQ standard sequence operators for "push" based, observable collections based on a pair of IObservable and IObserver interfaces that are the mirror images of the familiar "pull"-based, enumerable collections defined using IEnumerable and IEnumerator pair of interfaces. As a consequence we can elegantly formulate many concurrent asynchronous and event-driven programs using standard LINQ query comprehensions.
» Bio
Erik Meijer is an accomplished programming-language designer who has worked on a wide range of languages, including Haskell, Mondrian, X#, Cω, C#, and Visual Basic. He runs the Cloud Programmability Team in the Business Platform Division at Microsoft, where his primary focus has been democratize the Cloud. One of the fruits of these efforts is LINQ, which not only adds a native querying syntax to .NET languages, such as C# and Visual Basic, but also allows developers to query data sources other than tables, such as objects or XML. Most recently, Erik has been working on preaching the virtues of fundamentalist functional programming in the new age of concurrency and many-core, and pushing the .NET Reactive Framework Rx. Some people might recognize him from his brief stint as the "Head in the Box" on Microsoft VBTV. These days, you can regularly watch Erik's interviews on the "Expert-to-Expert" and "Going Deep" series on Channel 9.
10/02 |
Parallax: a time-oriented progress indicator for MapReduce pipelines |
More Info |
Magdalena Balazinska |
U. of Washington |
|
» Abstract
The ability to analyze massive-scale datasets has become a critical requirement for industry and sciences alike. Because of the magnitude of the data involved in today's applications, users are increasingly turning toward parallel data processing systems running in shared-nothing clusters. These systems provide efficient query
processing facilities, but the magnitude of input data sets still causes most queries to take from minutes to several hours to complete. At this scale, users need more than efficient processing. They also need effective tools for managing their queries at runtime.
In this talk, we will present our ongoing work developing Parallax, the first non-trivial, time-oriented progress indicator for parallel queries. We developed the current version of Parallax for Pig queries running in a Hadoop cluster, an environment that is a popular open-source parallel data-processing engine under active development. As an initial step, we focused on Pig Latin queries that compile into a series of MapReduce jobs. In this talk, we will present the techniques behind Parallax and our preliminary experimental results.
This work is part of the Nuage project at the University of Washington: http://nuage.cs.washington.edu. We will touch upon the larger vision behind the Nuage project and the other problems that we are currently tackling within this project.
» Bio
Magdalena Balazinska is an assistant Professor in the department of Computer Science and Engineering at the University of Washington. Magdalena's research interests are broadly in the fields of databases and distributed systems. Her current research focuses on distributed stream processing, sensor and scientific data management, and cloud computing. Magdalena holds a PhD from the Massachusetts Institute of Technology (2006). She is the recipient of an NSF CAREER Award (2009), a Microsoft Research Faculty Fellowship (2007), the Rogel Faculty Support Award (2006), and a Microsoft Research Graduate Fellowship (2003-2005).
10/16 |
MCDB: The Monte Carlo Database System |
More Info |
Chris Jermaine |
Rice University |
|
» Abstract
Analysts working with large data sets often use statistical models to "guess" at unknown, inaccurate, or missing information associated with
the data stored in a database. For example, an analyst for a manufacturer may wish to know, "What would my profits have been if I'd increased my margins by 5% last year?" The answer to this question naturally depends upon the extent to which the higher prices would have affected each customer's demand, which is undoubtedly guessed via
the application of some statistical model.
In this talk, I'll describe MCDB, which is a prototype database system that is designed for just such a scenario. MCDB allows an analyst to
attach arbitrary stochastic models to the database data in order to "guess" the values for unknown or inaccurate data, such as each customer's unseen demand function. These stochastic models are used to produce multiple possible database instances in Monte Carlo fashion (a.k.a. "possible worlds"), and the underlying database query is run over each instance. In this way, fine-grained stochastic models become first-class citizens within the database. This is in contrast to the "classical" paradigm, where high-level summary data are first
extracted from the database, then taken as input into a separate statistical model which is then used for subsequent analysis.
» Bio
Chris Jermaine is an associate professor in the CS Department at Rice University, where he studies data management and applied statistics. He is the recipient of a 2008 Alfred P. Sloan Foundation Research Fellowship, a National Science Foundation CAREER award, and a 2007 ACM SIGMOD Best Paper Award. He received a BA from the Mathematics Department at UCSD, an MSc from the Computer Science and Engineering Department at OSU, and a PhD from the College of Computing at Georgia
Tech. Chris grew up in Southern California. In his spare time, he enjoys running, gardening, and outdoor activities such as hiking, climbing, and whitewater boating. In one particular exploit, he and his wife floated a whitewater raft (home-made from scratch using a sewing machine, glue, and plastic) over 100 miles down the Nizina River (and beyond) in Alaska.
10/23 |
Why all the fuss with Distributed Key-Value Stores? |
More Info |
Philip Zeyliger |
Cloudera |
|
» Abstract
Bigtable, Dynamo, Dynomite, Cassandra, HBase, Voldemort, HyperTable, MongoDB, Neptune, ... the list goes on. All of a sudden, it seems like everyone (including Google, Amazon, Facebook, PowerSet, Linkedin, ...) is building distributed key-value stores. Why all the fuss? In
this open-ended talk, we'll discuss the APIs, data model, and use cases (including schema design); then we'll foray into how these systems do and don't handle consistency, indexing, replication, and availability. Finally, we'll talk about how this all relates to frameworks like Apache Hadoop.
» Bio
Philip Zeyliger is a software engineer at Hadoop-centered startup Cloudera. Before that, he worked on storage infrastructure for user-facing applications at Google, and as a programmer at D.E. Shaw.
10/30 |
Safely Analyzing Sensitive Network Data |
More Info |
Gerome Miklau |
U. Massachusetts Amherst |
|
» Abstract
Social and communication networks are formed by entities (such as individuals or computer hosts) and their connections (which may be contacts, relationships, or flows of information). Such networks are analyzed to understand the influence of individuals in organizations, the transmission of disease in communities, the operation of computer networks, among many other topics. While network data can now be recorded at unprecedented scale, releasing it can result in unacceptable disclosures about participants and their relationships. As a result, privacy concerns are severely constraining the dissemination of network data and disrupting the emerging field of network science.
Our recent work investigates the properties of a network that can be accurately studied without threatening the privacy of individuals and their connections. We adopt the rigorous condition of differential privacy, and develop algorithms for releasing randomly perturbed statistics about the topology of a sensitive network. This talk will focus on two basic analysis tasks: the estimation of the degree distribution of a network and the study of small structural patterns that occur in a network (sometimes called motif analysis). We show that the degree distribution of a network can be very accurately estimated by a novel technique in which constraints are applied to the noisy output to improve utility. This technique is of general interest, and can be used to boost the accuracy of differentially private output in other tasks as well. We show that studying motifs is fundamentally harder, but can be done with acceptable accuracy if the privacy condition is relaxed.
» Bio
Gerome Miklau is an Assistant Professor at the University of Massachusetts, Amherst. His primary research interest is the secure management of large-scale data. This includes evaluating threats to privacy in published data, devising techniques for the safe publication of social networks, network traces, and audit logs, designing database management systems to implement security policies, and theoretically analyzing information disclosure. He received an NSF CAREER Award in 2007 and won the 2006 ACM SIGMOD Dissertation Award. He received his Ph.D. in Computer Science from the University of Washington in 2005. He earned Bachelor's degrees in Mathematics and in Rhetoric from the University of California, Berkeley, in 1995.
11/06 |
ACDC - Analytics over Continuous and DisContinuous Streams |
More Info |
Sailesh Krishnamurthy |
Truviso |
|
» Abstract
Streaming continuous analytics systems have emerged as key solutions for dealing with massive data volumes and demands for low latency. These systems have been heavily influenced by an assumption that data streams can be viewed as sequences of ordered data. The reality, however, is that streams are not continuous and disruptions of various sorts in the form of either big chunks of late arriving data or arbitrary failures are endemic. We argue, therefore, that stream processing needs a fundamental rethink and advocate a unified approach providing Analytics over Continuous and DisContinuous (ACDC) streams of data. Our approach is based on a simple insight – partially process independent runs of data and defer the consolidation of the associated partial results to when the results are actually used on an on demand basis. Not only does our approach provide the first real solution to the problem of data that arrives arbitrarily late, it also lets us solve a host of hard problems such as parallelism, recovery, transactional consistency and high availability that have been neglected by streaming systems. In this talk we describe the Truviso ACDC approach and outline some of the key technical arguments and insights behind it.
» Bio
Sailesh Krishnamurthy, Vice President of Technology and Founder, Truviso, Inc.
Dr. Sailesh Krishnamurthy, PhD is responsible for setting and driving the overall technical strategy and direction for the Truviso product and solution portfolio. In addition, he works in close collaboration with marketing, sales and engineering teams in managing the product and solution roadmap, performance engineering, and technology evangelism. Previously, he built and managed the initial engineering, services and support teams at Truviso. Sailesh is a leading authority in the field of enterprise data management with over a dozen published academic papers and several U.S. patents. Sailesh investigated the technical ideas at the heart of Truviso's products as part of his doctoral research on stream query processing, earning a PhD. in Computer Science from UC Berkeley in 2006. Prior to graduate work at Berkeley, he worked at the Database Technology Institute at IBM Corporation where he designed and developed advanced features in IBM database products. Earlier, he worked on a Java virtual machine implementation at Netscape Communications. Sailesh has a Master's degree in computer Science from Purdue University and a Bachelor's degree in Electrical Engineering from the Birla Institute of Technology and Science in Pilani, India.
11/13 |
Addressing New Challenges in Data Stream Processing |
More Info |
Yanlei Diao |
U. Massachusetts Amherst |
|
» Abstract
Data stream processing has found application in many areas including environmental monitoring, object tracking and monitoring, and business analytics. While the foundation for data stream processing has been developed in prior work, recent real-world deployments are raising a host of new challenges.
The first challenge that we address regards uncertain data streams, where data can be incomplete, imprecise, and even misleading. Feeding such data streams to existing stream systems produces results of unknown quality. In the main part of the talk, I present the design of a data stream system that captures data uncertainty from data collection to query processing to final result generation, with a focus on its data model and processing algorithms for complex relational operators. Other challenges to data stream systems include the need to extend the data model from set-based to sequence-based and the need to archive and index data streams to answer continuous queries. In the rest of the talk, I survey two other projects that address these challenges.
» Bio
Yanlei Diao is an Assistant Professor at the Department of Computer Science, University of Massachusetts Amherst. Her research interests are in information architectures and data management systems, with a focus on data streams, uncertain data management, flash memory databases, and XML query processing. She received her PhD in Computer Science from the University of California, Berkeley in 2005, her M.S. in Computer Science from the Hong Kong University of Science and Technology in 2000, and her B.S. in Computer Science from Fudan University in China in 1998.
Yanlei Diao is a recipient of the NSF Career Award and finalist for the Microsoft Research New Faculty Fellowship. She spoke at the Distinguished Faculty Lecture Series at the University of Texas at Austin in December 2005. Her PhD dissertation “Query Processing for Large-Scale XML Message Brokering” won the 2006 ACM-SIGMOD Dissertation Award Honorable Mention. She has served on the program committees for many international conferences and the organization committees for SIGMOD and DMSN. She is a main contributor to YFilter 1.0 (http://yfilter.cs.umass.edu/code_release.htm), a high-performance filtering system over XML message streams.
12/04 |
A Faustian Bargain: Terrabytes of Inscrutable Data |
More Info |
Brian Dolan |
Fox Interactive Media |
|
» Abstract
In this talk I will discuss the practical challenge of data anaysis at the terrabyte and even petabyte scale. Using examples from MySpace and Fox Audience Network, I will share methods which yield insight with minimal management overhead.
» Bio
I am currently the Director of Research Analytics at Fox Audience Network (FAN). In this role I support research scientists, marketing analysts and financial strategists. I hold Master's Degrees in both Pure Mathematics and Biomathematics. As a professional research scientist I have developed dozens of methods to statistically analyze massive data sets. Also I tell a lot of good jokes.
12/11 |
Declarative Secure Distributed Systems |
More Info |
Boon Thau Loo |
U. of Pennsylvania |
|
» Abstract
In this talk, I present our recent work on using declarative languages to specify, implement, and analyze secure distributed systems. In the first half of the talk, I first describe Secure Network Datalog (SeNDlog), a declarative language that unifies declarative networking and logic-based access control languages. SeNDlog enables network routing, distributed systems, and their security policies to be specified and implemented within a common declarative framework. I will focus on a specific use-case, based on an extensible platform for Application-Aware Anonymity (A3) that we have developed, and describe ongoing collaborative efforts at integrating our work with LogicBlox, an emerging commercial Datalog-based platform for enterprise software systems.
In the second half of the talk, I introduce the notion of network provenance naturally captured within our declarative framework, and demonstrate its applicability in the areas of network debugging, analysis and trust management. I further discuss ongoing work at optimizing distributed query processors in order to process and maintain network provenance efficiently and securely. Details of this project and our research group are available at http://netdb.cis.upenn.edu/.
» Bio
Boon Thau Loo is an Assistant Professor in the Computer and Information Science department at the University of Pennsylvania. He received his Ph.D. degree in Computer Science from the University of California at Berkeley in 2006. Prior to his Ph.D, he received his M.S. degree from Stanford University in 2000, and his B.S. degree with highest honors from UC Berkeley in 1999. His research focuses on distributed data management systems, Internet-scale query processing, and the application of data-centric techniques and formal methods to the design, analysis and implementation of networked systems. He was awarded the 2006 David J. Sakrison Memorial Prize for the most outstanding dissertation research in the Department of EECS at UC Berkeley, and the 2007 ACM SIGMOD Dissertation Award. He is a recipient of the NSF CAREER award (2009). He is also the program co-chair for the CoNEXT 2008 Student Workshop and the NetDB 2009 workshop co-located with SOSP.
» Previous Semester Schedules:
» To subscribe to the mailing list:
To subscribe to a list, send e-mail to
majordomo@db.cs.berkeley.edu,
with a message (not a Subject line!) containing only the words:
subscribe <list name>
As an example, one database wannabe might send the message:
To: majordomo@db.cs.berkeley.edu
From: turing@acm.org (Alan Turing)
Subject: I wannabe!
subscribe dblunch
Unsubscribing is similar:
To: majordomo@db.cs.berkeley.edu
From: turing@acm.org (Alan Turing)
Subject: can't make it any more
unsubscribe dblunch
» To subscribe to the mailing list:
To subscribe to a list, send e-mail to
majordomo@db.cs.berkeley.edu,
with a message (not a Subject line!) containing only the words:
subscribe <list name>
As an example, one database wannabe might send the message:
To: majordomo@db.cs.berkeley.edu
From: turing@acm.org (Alan Turing)
Subject: I wannabe!
subscribe dblunch
Unsubscribing is similar:
To: majordomo@db.cs.berkeley.edu
From: turing@acm.org (Alan Turing)
Subject: can't make it any more
unsubscribe dblunch