Graduate Information Science and Technology Program

    School of Information Sciences

    University of Pittsburgh

Network-Aware Data Management Group

    

 

Hi-DFusion: Historical Data Fusion

Overview

People

Projects
Publications
Resources

 

Data Integration Systems provide users with uniform data access and efficient information sharing.  The ability to share information is particularly important for interdisciplinary research, where a comprehensive picture of the subject requires large amounts of historical data from disparate data sources from a variety of disciplines. For example, epidemiological data analysis often relies upon knowledge of population dynamics, climate change, migration of biological species, drug development, etc.  As another example, consider the task of exploring long-term and short-term social changes which requires consolidation of a comprehensive set of data on social-scientific, health, and environmental dynamics.  In this project, we address the challenges in developing a global integrated repository of historical data to support a wide range of interdisciplinary research. The major tasks of the project are:

Task 1: Scalable architectures for historical data integration

Task 2: Reliable fusion of historical data

 

Task 1: Scalable architectures for historical data integration

Nowadays, there are numerous historical data sources available from various groups worldwide. Such data sources, however, cannot be easily consolidated, metadata-indexed, and maintained by small groups of developers.  The solution that we propose is to engage a large community of researches to share their data, collectively resolve the data heterogeneities, and harmonize their efforts in data reliability assessment and data  fusion.  We propose an approach, based on the collective intelligence of research communities, which supports efficient “crowdsourcing” of the large-scale historical data integration task. This research is undertaken in conjunction with the World-Historical Dataverse project (www.dataverse.pitt.edu) of the World History Center at the University of Pittsburgh and the Collaborative for Historical Information and Analysis (CHIA) initiative (http://chia.pitt.edu).  CHIA currently involves nine different research groups throughout the U.S. and Europe; it aims to create a major repository of consolidated global historical data from the past several centuries. In particular, my group is developing an advanced Col*Fusion infrastructure for systematic accumulation, integration and utilization of historical data.

Task 2: Reliable fusion of historical data

Historical data sources may have different levels of reliability for many reasons, e.g., issues with the primary sources of information, faulty data collection methodology, etc. Integration of the historical data sources may also face severe data conflicts. It is common to have multiple reports about the same event within overlapping time intervals. We may also have multiple reports on historical statistics for overlapping locations. Another challenge is overlapping names: evolving concepts may be reported under different names and categories co-existing at different time intervals. Note that historical data conflicts do not necessary imply data inconsistency. If the overlapping historical reports are accurate, the conflicts reflect data redundancy which prevents researchers from obtaining reliable aggregate query results. Meanwhile, data inconsistency is caused by inaccurate reports. In many cases, such inconsistency can be discovered through analysis of relationships between existing reports in the integrated database. In this task we devise a systematic and efficient approach to address the problem of large-scale historical data fusion to ensure data reliability.  We develop integrated data reliability analysis methods to explore data conflicts and data inconsistencies so as to provide automatic information reliability assessment.

PhD Students:

Ying-Feng Hsu

Evgeny Karataev

Julian Lee

Fatimah Radwan

Selected References:

  1. van Panhuis,W.G., Grefenstette, J., Jung, S.Y., Chok, N.S. Cross, A., Eng, H., Lee, B., Zadorozhny, V., Brown, S., Cummings, D., Burke, D. Surveillance and control of contagious diseases in the United States from 1888 to the present. To appear in The New England Journal of Medicine, 2013

  2. Zadorozhny, V., Manning, P., Bain, D., Mostern, R. Collaborative for Historical Information and Analysis: Vision and Work Plan. Journal of World-Historical Information, v. 1, N.1, 2013.

  3. Zadorozhny,  V, Hsu, Y.-F., Conflict-Aware Fusion of Historical Data. Proc. of 5th International Conference on Scalable Uncertainty Management (SUM'11), 2011.

  4. Pelechrinis, K., Zadorozhny, V., Oleshchuk, V., Collaborative Assesment of Information Provider's Reliability and Expertise using Subjective Logic. Proc. of the 7th International Conference on Collaborative Computing (CollaborateCom'11), 2011.

List of publications