iSchool's "Brown Dog" Project Develops Cloud Services for Un-curated Big Data

iSchool's "Brown Dog" Project Develops Cloud Services for Un-curated Big Data

Institutions of all types—government agencies, research centers, corporations and cultural heritage institutions—are dealing with a deluge of data, leaving a wealth of knowledge trapped in legacy file formats that are no longer accessible. Researchers at the University of Maryland's College of Information Studies, Maryland’s iSchool, have teamed up with the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign (UIUC) to combat this problem by providing scalable infrastructure services in the cloud that re-open existing big data collections for appraisal, analysis and reuse.

Led by Richard Marciano at the University of Maryland, professor and director of the newly formed Digital Curation Innovation Center (DCIC), with Michael Kurtz, visiting professor and associate director, and Greg Jansen, research software architect, and by Kenton McHenry at NCSA, with his team of developers and domain scientists, this partnership between two Big 10 / Committee on Institutional Cooperation (CIC) universities, focuses on file format conversion and information extraction services. Named "Brown Dog", this project is creating key information extraction services that unlock un-curated data and make them useful again. These services open up the intellectual content trapped in legacy data, so that it is available to scientists and cultural institutions, allowing scientists meaningful access to data that can be used to reproduce results and conduct new data-driven research.

The UMD iSchool team is assembling a unique data training and testing facility called the Cyber Infrastructure for Billions of Electronic Records (CI-BER) testbed. The CI-BER testbed currently represents 72 million records and 57 terabytes of data from decades of legacy records across 135 government agencies. The team's goal is to enable new forms of archival analytics at scale.

“We hope to provide access to big data training sets, accelerate the development of digital curation algorithms and services, and teach students practical digital curation skills,” says Richard Marciano.  

This project is supported through a five-year, $10.5 million award from the National Science Foundation (NSF) through its new Data Infrastructure Building Blocks (DIBBs) program. Out of a total of 25 projects $63 million of funding, it is the largest of the implementation awards to date. The University of Maryland portion of the project represents $1.4 million of the funding.

The University of Maryland team is teaming with corporate partners Archive Analytics Solutions and NetApp to develop technology solutions for the project. Archive Analytics is an archival software solutions provider and developer of the standards-based Alloy network appliance for controlled archiving of file-based content, and has been brought into the project as a vendor-partner to help set up a petabyte-class archive and object hosting for the CI-BER testbed collection.  The team is calling this new system the “DataCave”.  It is located at the UMD Cyberinfrastructure Center at the Rivertech Building.

"Many of us strive to create something that will live on to have the broad impact that the NCSA Mosaic Web browser did," says Kenton McHenry of the University of Illinois, referring to the world's first Web browser, also developed at NCSA. "It is our hope that Brown Dog will serve as the beginnings of yet another such indispensable component for the internet of tomorrow."

For more information, visit browndog.ncsa.illinois.edu or dcic.umd.edu.