Title: Parallel Index and Query for Large Scale Data Analysis
Speaker: John Wu, Lawrence Berkeley Lab
Date&Time: Friday, October 21, 12:30pm-1:30pm (there will be lunch from 12:15)
Location: Big RADLab meeting room (465H)  

Abstract:
Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies are critical for facilitating interactive exploration of large datasets, but numerous challenges remain in terms of designing a system for process- ing general scientific datasets. The system needs to be able to run on distributed multi-core platforms, efficiently utilize underlying I/O infrastructure, and scale to massive datasets. We present FastQuery, a novel software framework that address these challenges. FastQuery utilizes a state-of-the-art index and query technology (FastBit) and is designed to process mas- sive datasets on modern supercomputing platforms. We apply FastQuery to processing of a massive 50TB dataset generated by a large scale accelerator modeling code. We demonstrate the scalability of the tool to 11,520 cores. Motivated by the scientific need to search for inter- esting particles in this dataset, we use our framework to reduce search time from hours to seconds.

Short bio:
Dr. Wu works on a broad range of topics in scientific data management, data analysis and distributed computing. He has developed bitmap indexing techniques for searching large datasets, restarting strategies for computing extreme engenvalues, and algorithms for image analysis. He has developed a number of open-source software packages, such as FastBit for indexing large datasets and TRLan for computing eigenvalues. The FastBit software has received an R&D 100 Award, and is used by many organizations. For example, University of Hamburg uses FastBit in a drug discovery project, and Yahoo! uses it to sift through terabytes of advertisement related data daily.

 

Comments are closed.