Projects

  • [Joint Modeling and Querying of Social Media and Video]
    As the amount of user generated data increases, it becomes more challenging to effectively search this data for useful information. There has been work on how to search text social media posts, such as Tweets, or videos. However, searching on these sources using separate tools is ineffective because the information links between them are lost; for instance, one cannot automatically match social network posts with activities seen on a video. As an example, consider a set of tweets and videos (which may be posted on Twitter or other media) generated during a riot. A police detective would like to jointly search this data to find material related to a specific incident like a car fire. Some tweets (with no contained video) may comment on the car fire, while a video segment from another tweet shows the car during or after the fire. Linking the videos with the relevant social media posts, which is the focus of this project, can greatly reduce the effort in searching for useful information. The successful completion of this project has the potential to improve the productivity of people who search in social media, such as police detectives, journalists of disaster management authorities. This project will also strengthen and extend the ongoing undergraduate research and high school outreach activities of the investigators.

    The objective of this project is to focus on the fundamental research tasks that would allow for joint modeling of social network and video data. Then, given a set of posts, the system would find relevant video segments and vice versa, by defining a common feature space for social media and video data. This proof-of-concept project will be evaluated on posts and videos shared on the Twitter platform. This is the right time to tackle this problem given the recent advances in deep learning and big data management technologies. A key risk is that the semantics in a tweet may not be enough to map it to a video segment; for that, the context (e.g., tweets from closely related users) of the tweet may need to be leveraged.

  • [Increase the Throughput of Non-Relational Databases through Theoretical Modeling and Optimization]
    The explosive growth of data is driving the rapid evolution of massive data-storage systems. These systems are widely used, not only in large-scale Internet services, but also in scientific projects in diverse areas such as astronomy, geography, and genetics. This project will increase the efficiency of these data-storage systems, which will allow processing more data at lower cost. There is the potential for a large societal impact as science and engineering research is made more cost-effective.

    More specifically, this project will work on improving non-relational databases with log-structured merge-tree storage architectures. One main focus will be on improving a key component of such systems, namely, compaction policies. Compaction policies are not yet well understood, but are crucial for system performance. To date, compaction policies have been designed by trial and error, guided mainly by empirical experience. The project will develop analytical models for compaction, validate and refine the models with empirical testing, design improved policies that are optimal according to the models, and deploy these policies in live systems. Further, the developed theoretical models will be leveraged to optimize non-relational database systems in handling high volumes of dynamic continuous queries, which arrive and expire rapidly.

  • [Information Discovery on Domain Data Graphs]
    An increasing amount of data is stored in an interconnected manner. Such data range from the Web - hyperlinked pages - to bibliographical data - graph of citations - to biological data - associations between proteins, genes, publications - to clinical data - associations between patients, hospitalizations, exams and diagnoses.

    A critical need in order to leverage the available data is the enablement of information discovery, i.e., given a question (query) find pieces of data or associations between them in the data graph that are "good" (relevant, authoritative and specific) for the query, and rank them according to their "goodness". Submitting such queries should not require knowledge of a complex query language (e.g., SQL) or of the details of the data (e.g., schema). Unfortunately, little has been done to provide high-quality information discovery on data graphs in domains other than the Web, where search engines have been successful.

    This project will facilitate effective information discovery on domain -biological, clinical, patents, e-commerce, spatial- data, which can lead to cost savings, and increased research productivity in these domains.

  • [A Collaborative Adaptive Data Sharing Platform]
    The increased popularity of domain social networking and blogs is creating a huge amount of shared data. Properly annotating this data would allow its effective searching and analysis. Consider as a specific motivating application a disaster mitigation collaboration network for businesses. Using keyword search to find open child care locations after a hurricane would require sifting through hundreds of shared documents. Current data sharing platforms provide little help to the users to effectively and effortlessly annotate their data in a way that will benefit the information demand of other users. The long term goal of this project is to leverage the collective knowledge of communities to increase the utility of shared information.

    The objective of this project is to create the knowledge and techniques to allow the users of an application domain to effectively and effortlessly annotate, share and query data, by exploiting the past user interactions -- i.e., data annotations, query workload and user query relevance feedback. A key novelty of the proposed Collaborative Adaptive Data Sharing Platform (CADS) is that the past user interactions are leveraged to effectively annotate the data at insertion-time.

  • [Biomedical Data Management]
    An increasing amount of data is generated to support health care, ranging from Electronic Health Records (EHRs) to Biomedical literature. The goal of this group is to create tools and methods to maximize the utility of this data in terms of quality of care and cost savings.

  • [ASTERIX: A Highly Scalable Parallel Platform for Semistructured Data Management and Analysis]
    Over the past 10-15 years, the evolution of the human side of the Web (powered by HTML and HTTP) has revolutionized the way that most of us find things, buy things, and interact with our friends and colleagues, both within and across organizations. Behind the scenes, semistructured data formats and Web services are having a similar impact on the machine side of the Web. In semistructured data formats, of which XML is the de facto standard, information normally contained in a database schema or type definition is contained within the data, making it self-describing. XML is enriching the information on the Web and our ability to find it and interchange it meaningfully, as are RDF and JSON. Many industry verticals have created XML-based standards to support inter-organization data exchange and processes, and XML-based backbones such as enterprise services busses (ESBs) have gained significant adoption in industry in support of Service-Oriented Architecture (SOA) initiatives. XML is increasingly being used for document markup as well, which was its original purpose, and the Web-service-driven Software as a Service (SaaS) trend is changing the way that many organizations will access and use large software applications in the future. As a result, current indications are that the IT world will soon be awash in a sea of semistructured data ? much of it XML data ? and that semistructured data and services will likely play an increasingly prominent role in the IT landscape for many years to come.

    In anticipation of the semistructured information explosion, this proposal targets the problems of ingesting, storing, indexing, processing, managing, and monitoring vast quantities of semistructured data with the emphasis being on vastness, i.e., scale. The project involves challenges related to parallel databases, semistructured data management, and data-intensive computing. To that end, the proposal brings together a team of five researchers, drawn from three UC campuses, with expertise spanning structured, semistructured, and unstructured data.

  • [Tools to Mine and Index Trajectories of Physical Artifacts]
    The project proposes to develop computational methods and tools for the discovery of spatio-temporal patterns in the distribution and historical development of physical artifacts important to anthropology including the development of a shape recognition system that allows researchers to compare numerous projectile points and petroglyphs according to several criteria. This will involve creation of a set of definitions/ data representations/predicates/algorithms and intuitive and usable software tools to enable the study of the spatio-temporal spread of physical objects. The proposal uses innovative technology to apply to questions central to archaeology. but which also has broad applicability to research in other, diverse domains.

  • [Access Methods for Bitemporal Databases]
    Traditional databases capture only the most current data (snapshot) of the modeled reality. While snapshot information is enough for a number of applications, it is not sufficient for applications that require past and/or future data. Instead, a bitemporal database is needed, i.e., a database that supports both valid time (the time when an event was valid in the modeled reality) and transaction time (the time when the database was updated). Much research has been performed recently on access methods that support transaction time, however, not much has been done for bitemporal indexes, i.e., methods that support both transaction and valid time on the same index structure. The objective of this project is to design efficient access methods for bitemporal databases. A novel approach is used that reduces bitemporal queries to problems of partial persistence for which efficient access methods are then designed. Various basic bitemporal queries are addressed, like the bitemporal pure and range timeslice queries. We have also examined the problem of temporal hashing, i.e., membership queries over sets time-evolving sets. Currently we examine efficient ways to perform bitemporal joins. We are also looking into indexing spatiotemporal databases. The results of this project aim at more efficient implementation of Temporal DBMS.

  • [An Adaptive and Scalable Architecture for Dynamic Sensor Networks]
    The purpose of this research is to develop a robust, adaptive and scalable infrastructure for a self-organizing and highly dynamic sensor network. The distinguishing characteristic of our project is that we take a holistic approach that addresses multiple levels of the sensor network (namely the network communication, the operating system and the analysis of the data) in an integrated way. The network communication and the operating system should be managed in an integrated manner so as to provide a robust and adaptive infrastructure for the development of computing applications. The inferences made by the higher layer applications and data analysis functions will determine future trajectories of the mobile agents, and may invoke the tuning of certain parameters that determine the extent to which data is being collected and fused. These events would require the network to re-organize itself. On the other hand, the network itself may impose constraints on where the mobile agents can move, where data may be fused, and how different entities co-ordinate in order to make operations efficient. The broader impact of this work will be solutions with wide applicability, from civil applications (such as disaster recovery management) to military applications (situation awareness in battlefield management). The educational component of the project aims to develop a strong curriculum and activities that will increase educational awareness in sensor networks.

  • [Data Mining Techniques for Geospatial Applications]
    The goal of this research is to develop fundamental techniques to allow efficient and interactive knowledge discovery from large multidimensional datasets with spatial and temporal attributes. There are two general research goals. The first involves the investigation of effective density approximation techniques to approximate very large geospatial datasets. The proposed techniques are used both to facilitate simple exploratory data mining tasks on large geospatial datasets, and to efficiently provide accurate approximate solutions to general data mining tasks, such as clustering, classification and outlier detection. The second involves designing and implementing new algorithms and techniques for similarity queries in large datasets. The educational component of the project aims to develop courses that emphasize the fundamental ideas in data mining, and introduce students to real data mining problems. The results of this project will have a significant impact on how large multidimensional datasets are analyzed, with applications in the fields of Geographic Information Systems, Epidemiology, and Environmental research.

  • [Indexing Spatiotemporal Data]
    Indexing spatiotemporal data is an important problem for many applications (global change, transportation, social and multimedia applications). The goal of this project is to provide efficient access methods for data whose geometry changes over time. Two time-varying spatial attributes are considered, the object position and extent. Based on the rate by which these spatial attributes change, the discrete and continuous spatiotemporal environments are identified. In the discrete environment, spatiotemporal data changes in discrete steps. Efficient ways to answer historical queries on any past state of such spatiotemporal data are examined. In particular, selection, neighbor, aggregate, join and similarity queries are addressed using a "partial persistence" methodology. In the continuous spatiotemporal environment, data changes continuously. Instead of keeping the data position/extent at discrete times (which would result in enormous update/storage requirements) the functions by which this data changes are stored. This introduces the novel problem of indexing functions. Using this approach, selection, neighbor and aggregation queries about future locations of moving objects in one and two dimensions are addressed. The methods used in this project are expected to achieve at least 30% improvement over traditional access methods. The applicability of the completed work reaches multiple settings, including Geographic Information Systems, multimedia databases and transportation systems.

  • [Knowledge Management of Time-Varying Geospatial Data]
    Geospatial datasets are collected and processed by a variety of Federal Agencies. Such data and the information contained therein are of use to a practically limitless array of Federal and State Agencies, and private companies. Advancements in sensor technology, computer hardware and software have resulted in the availability of huge amounts of diverse types of geospatial datasets. Our objective in this project is to facilitate the integration of those datasets across space and time, and to improve knowledge management over such time-varying geospatial datasets. In doing so, we will improve accessibility to the information they contain, making it more useful to groups of users that are constantly increasing and diversifying.

  • [Support for Design of Evolving Information Systems]
    perennial problem in designing information systems is enabling them to cope with continuous evolution. This problem manifests itself in many settings, including software management, configuration management and web-page evolution. The problem is particularly important as information systems are being designed (or re-designed) with a web-centered (typically XML-based) focus.

  • [Support of Historical References in Databases]
    Supporting historical references is an important problem in several areas of computer science and engineering. The significance of the database versioning problem (i.e., keeping and accessing old versions) has been particularly mentioned in recent advances in the database community. In the rapidly growing area of object oriented programming, historical references enable programmers to create new objects based on previous object versions. Traditional approaches to supporting historical references require either large space or long reconstruction times, making the extensive use of such references prohibitive. This project provides efficient support of historical references, i.e., fast reconstruction to any past state without sacrificing a large amount of storage space. New, optimal ways are investigated to compress temporal data, while still providing practically random access to any past reference on these data. Moreover, the problems of distributing the history, keeping the history of a system that evolves like a graph, keeping histories in limited space and other general historical queries (not indexed by time but instead using historical correlation) are analyzed. The results of this project will greatly impact the use of the recorded past history for different programmers or database users. Other applications of these results to be considered, include network management tools, animation storing and archiving the UNIX directory system.

  • [Understanging Change in Spatiotemporal Data]

    Spatiotemporal data appears in many real-life applications (global change, surveillance, transportation etc.) Together with regular attributes such data contains topological as well as temporal attributes. This combination creates novel interesting problems. Moreover, spatiotemporal data is usually presented in 'streams' which drastically affects the data processing methods. We propose general exploratory techniques that will allow the user not only to verify specific hyphotheses, but more importantly, to understand the underlying process that controls the changes recorded in the spatiotemporal datasets.