Safely Searching Among Sensitive Content

Safely Searching Among Sensitive Content

Today's search engines are designed principally to help people find what they want to see. Paradoxically, the fact that search engines do this well means that there are many collections that can't be searched. Citizens can not yet search some government records because of intermixed information that may need to be protected. Scholars are not yet allowed to see much of the growing backlog of unprocessed archival collections for similar reasons. These limitations, and many more, are direct consequences of the fact that current search engines can only protect sensitive content if that sensitive content has been marked in advance. As the volume of digital content continues to increase, current approaches based on manually finding and marking all of the sensitive content in a collection simply cannot affordably accommodate the scale of the challenge. This project will address that challenge by creating a new class of search algorithms that are designed to balance the searcher's interest in finding relevant content with the content provider's interest in protecting sensitive content. This technology will benefit society by dramatically altering the way we approach challenges such as government transparency, personal and enterprise information management, civil litigation and regulatory investigations, and scholarly access to archival materials. The project will leverage evaluation-driven information retrieval techniques to optimize a unified objective function that balances the value of finding relevant content with the imperative to protect sensitive information. This will require developing a new class of evaluation measures that are sensitive to both value (relevance) and cost (sensitivity). Factorial vignette survey techniques will be used to elicit the context-appropriate balance of access and restriction for representative applications. The survey results will then be used to inform the design of the feature sets on which evaluation-driven information retrieval techniques depend. Initial experiments will be conducted in protected environments, both locally and as shared-task evaluations on collections that can be licensed for research use under terms that preclude inappropriate disclosure. Ultimately, the project seeks to develop a process for evaluating algorithms for search among sensitive content using an algorithm deposit model in which the executable search algorithm is sent to the protected data, and only manually vetted evaluation results will be returned to participants.

Duration: 
September 2016 - August 2019
Funder: 
National Science Foundation
Total Award Amount: 
$532,000

Principal Investigator:

Additional Investigators