[Increase the Throughput of Non-Relational Databases through Theoretical Modeling and Optimization]:
The explosive growth of data is driving the rapid evolution of massive data-storage systems. These systems are widely used, not only in large-scale Internet services, but also in scientific projects in diverse areas such as astronomy, geography, and genetics. This project will increase the efficiency of these data-storage systems, which will allow processing more data at lower cost. There is the potential for a large societal impact as science and engineering research is made more cost-effective.
More specifically, this project will work on improving non-relational databases with log-structured merge-tree storage architectures. One main focus will be on improving a key component of such systems, namely, compaction policies. Compaction policies are not yet well understood, but are crucial for system performance. To date, compaction policies have been designed by trial and error, guided mainly by empirical experience. The project will develop analytical models for compaction, validate and refine the models with empirical testing, design improved policies that are optimal according to the models, and deploy these policies in live systems. Further, the developed theoretical models will be leveraged to optimize non-relational database systems in handling high volumes of dynamic continuous queries, which arrive and expire rapidly.
[Information Discovery on Domain Data Graphs]:
An increasing amount of data is stored in an interconnected manner. Such data range from the Web - hyperlinked pages - to bibliographical data -
graph of citations - to biological data - associations between proteins, genes, publications - to clinical data - associations between patients,
hospitalizations, exams and diagnoses.
A critical need in order to leverage the available data is the enablement of information discovery, i.e., given a question (query) find pieces of data
or associations between them in the data graph that are "good" (relevant, authoritative and specific) for the query, and rank them according to their "goodness".
Submitting such queries should not require knowledge of a complex query language (e.g., SQL) or of the details of the data (e.g., schema). Unfortunately, little has
been done to provide high-quality information discovery on data graphs in domains other than the Web, where search engines have been successful.
This project will facilitate effective information discovery on domain -biological, clinical, patents, e-commerce, spatial- data, which can lead to cost savings,
and increased research productivity in these domains.
[A Collaborative Adaptive Data Sharing Platform]:
The increased popularity of domain social networking and blogs is creating a huge amount of shared data. Properly annotating this data would
allow its effective searching and analysis. Consider as a specific motivating application a disaster mitigation collaboration network for businesses.
Using keyword search to find open child care locations after a hurricane would require sifting through hundreds of shared documents. Current data sharing platforms
provide little help to the users to effectively and effortlessly annotate their data in a way that will benefit the information demand of other users. The long term goal of this project
is to leverage the collective knowledge of communities to increase the utility of shared information.
The objective of this project is to create the knowledge and techniques to allow the users of an application domain to effectively and effortlessly annotate, share and query data,
by exploiting the past user interactions -- i.e., data annotations, query workload and user query relevance feedback. A key novelty of the proposed Collaborative Adaptive Data Sharing Platform (CADS)
is that the past user interactions are leveraged to effectively annotate the data at insertion-time.
[Biomedical Data Management]:
An increasing amount of data is generated to support health care, ranging from Electronic Health Records (EHRs) to Biomedical literature. The goal of this group is to create tools and methods to maximize the utility of
this data in terms of quality of care and cost savings.
[ASTERIX: A Highly Scalable Parallel Platform for Semistructured Data Management
and Analysis]: Over the past 10-15 years, the evolution of the human side of the Web (powered by HTML and HTTP) has revolutionized the way that
most of us find things, buy things, and interact with our friends and colleagues, both within and across organizations. Behind the scenes,
semistructured data formats and Web services are having a similar impact on the machine side of the Web. In semistructured data formats, of which
XML is the de facto standard, information normally contained in a database schema or type definition is contained within the data, making it
self-describing. XML is enriching the information on the Web and our ability to find it and interchange it meaningfully, as are RDF and JSON. Many
industry verticals have created XML-based standards to support inter-organization data exchange and processes, and XML-based backbones such as
enterprise services busses (ESBs) have gained significant adoption in industry in support of Service-Oriented Architecture (SOA) initiatives. XML is
increasingly being used for document markup as well, which was its original purpose, and the Web-service-driven Software as a Service (SaaS) trend
is changing the way that many organizations will access and use large software applications in the future. As a result, current indications are that
the IT world will soon be awash in a sea of semistructured data ? much of it XML data ? and that semistructured data and services will likely play
an increasingly prominent role in the IT landscape for many years to come.
In anticipation of the semistructured information explosion, this proposal targets the problems of ingesting, storing, indexing, processing,
managing, and monitoring vast quantities of semistructured data with the emphasis being on vastness, i.e., scale. The project involves challenges
related to parallel databases, semistructured data management, and data-intensive computing. To that end, the proposal brings together a team of
five researchers, drawn from three UC campuses, with expertise spanning structured, semistructured, and unstructured data.
[Tools to Mine and Index Trajectories of Physical Artifacts]: The project proposes
to develop computational methods and tools for the discovery of spatio-temporal patterns in the distribution and historical development of physical
artifacts important to anthropology including the development of a shape recognition system that allows researchers to compare numerous projectile
points and petroglyphs according to several criteria. This will involve creation of a set of definitions/ data representations/predicates/algorithms
and intuitive and usable software tools to enable the study of the spatio-temporal spread of physical objects. The proposal uses innovative
technology to apply to questions central to archaeology. but which also has broad applicability to research in other, diverse domains.
[Access Methods for Bitemporal Databases]: Traditional databases capture only the
most current data (snapshot) of the modeled reality. While snapshot information is enough for a number of applications, it is not sufficient for
applications that require past and/or future data. Instead, a bitemporal database is needed, i.e., a database that supports both valid time (the
time when an event was valid in the modeled reality) and transaction time (the time when the database was updated). Much research has been performed
recently on access methods that support transaction time, however, not much has been done for bitemporal indexes, i.e., methods that support both
transaction and valid time on the same index structure. The objective of this project is to design efficient access methods for bitemporal
databases. A novel approach is used that reduces bitemporal queries to problems of partial persistence for which efficient access methods are then
designed. Various basic bitemporal queries are addressed, like the bitemporal pure and range timeslice queries. We have also examined the problem of
temporal hashing, i.e., membership queries over sets time-evolving sets. Currently we examine efficient ways to perform bitemporal joins. We are
also looking into indexing spatiotemporal databases. The results of this project aim at more efficient implementation of Temporal DBMS.
[An Adaptive and Scalable Architecture for Dynamic Sensor Networks]: The purpose of
this research is to develop a robust, adaptive and scalable infrastructure for a self-organizing and highly dynamic sensor network. The
distinguishing characteristic of our project is that we take a holistic approach that addresses multiple levels of the sensor network (namely the
network communication, the operating system and the analysis of the data) in an integrated way. The network communication and the operating system
should be managed in an integrated manner so as to provide a robust and adaptive infrastructure for the development of computing applications. The
inferences made by the higher layer applications and data analysis functions will determine future trajectories of the mobile agents, and may invoke
the tuning of certain parameters that determine the extent to which data is being collected and fused. These events would require the network to
re-organize itself. On the other hand, the network itself may impose constraints on where the mobile agents can move, where data may be fused, and
how different entities co-ordinate in order to make operations efficient. The broader impact of this work will be solutions with wide applicability,
from civil applications (such as disaster recovery management) to military applications (situation awareness in battlefield management). The
educational component of the project aims to develop a strong curriculum and activities that will increase educational awareness in sensor networks.
[Data Mining Techniques for Geospatial Applications]: The goal of this
research is to develop fundamental techniques to allow efficient and interactive knowledge discovery from large multidimensional datasets with
spatial and temporal attributes. There are two general research goals. The first involves the investigation of effective density approximation
techniques to approximate very large geospatial datasets. The proposed techniques are used both to facilitate simple exploratory data mining tasks
on large geospatial datasets, and to efficiently provide accurate approximate solutions to general data mining tasks, such as clustering,
classification and outlier detection. The second involves designing and implementing new algorithms and techniques for similarity queries in large
datasets. The educational component of the project aims to develop courses that emphasize the fundamental ideas in data mining, and introduce
students to real data mining problems. The results of this project will have a significant impact on how large multidimensional datasets are
analyzed, with applications in the fields of Geographic Information Systems, Epidemiology, and Environmental research.
[Indexing Spatiotemporal Data]: Indexing spatiotemporal data is an
important problem for many applications (global change, transportation, social and multimedia applications). The goal of this project is to provide
efficient access methods for data whose geometry changes over time. Two time-varying spatial attributes are considered, the object position and
extent. Based on the rate by which these spatial attributes change, the discrete and continuous spatiotemporal environments are identified. In the
discrete environment, spatiotemporal data changes in discrete steps. Efficient ways to answer historical queries on any past state of such
spatiotemporal data are examined. In particular, selection, neighbor, aggregate, join and similarity queries are addressed using a "partial
persistence" methodology. In the continuous spatiotemporal environment, data changes continuously. Instead of keeping the data position/extent at
discrete times (which would result in enormous update/storage requirements) the functions by which this data changes are stored. This introduces the
novel problem of indexing functions. Using this approach, selection, neighbor and aggregation queries about future locations of moving objects in
one and two dimensions are addressed. The methods used in this project are expected to achieve at least 30% improvement over traditional access
methods. The applicability of the completed work reaches multiple settings, including Geographic Information Systems, multimedia databases and
[Knowledge Management of Time-Varying Geospatial Data]: Geospatial datasets are
collected and processed by a variety of Federal Agencies. Such data and the information contained therein are of use to a practically limitless
array of Federal and State Agencies, and private companies. Advancements in sensor technology, computer hardware and software have resulted in the
availability of huge amounts of diverse types of geospatial datasets. Our objective in this project is to facilitate the integration of those
datasets across space and time, and to improve knowledge management over such time-varying geospatial datasets. In doing so, we will improve
accessibility to the information they contain, making it more useful to groups of users that are constantly increasing and diversifying.
[Support for Design of Evolving Information Systems]: A perennial problem in
designing information systems is enabling them to cope with continuous evolution. This problem manifests itself in many settings, including software
management, configuration management and web-page evolution. The problem is particularly important as information systems are being designed (or
re-designed) with a web-centered (typically XML-based) focus.
[Support of Historical References in Databases]: Supporting historical references
is an important problem in several areas of computer science and engineering. The significance of the database versioning problem (i.e., keeping and
accessing old versions) has been particularly mentioned in recent advances in the database community. In the rapidly growing area of object oriented
programming, historical references enable programmers to create new objects based on previous object versions. Traditional approaches to supporting
historical references require either large space or long reconstruction times, making the extensive use of such references prohibitive. This project
provides efficient support of historical references, i.e., fast reconstruction to any past state without sacrificing a large amount of storage
space. New, optimal ways are investigated to compress temporal data, while still providing practically random access to any past reference on these
data. Moreover, the problems of distributing the history, keeping the history of a system that evolves like a graph, keeping histories in limited
space and other general historical queries (not indexed by time but instead using historical correlation) are analyzed. The results of this project
will greatly impact the use of the recorded past history for different programmers or database users. Other applications of these results to be
considered, include network management tools, animation storing and archiving the UNIX directory system.
[Understanging Change in Spatiotemporal Data ]: Spatiotemporal data appears in many
real-life applications (global change, surveillance, transportation etc.) Together with regular attributes such data contains topological as well as
temporal attributes. This combination creates novel interesting problems. Moreover, spatiotemporal data is usually presented in 'streams' which
drastically affects the data processing methods. We propose general exploratory techniques that will allow the user not only to verify specific
hyphotheses, but more importantly, to understand the underlying process that controls the changes recorded in the spatiotemporal datasets.