Maryland Today: The National Archives Has Over 10B Undigitized Pages (ft. Doug Oard, Diana Marsh, and Katrina Fenlon)

Maryland Today Staff | UMD - October 22, 2024

These UMD experts are working to make finding records easier

National Archives at College Park

Boxes of documents fill a room of the National Archives in College Park. A UMD researcher and colleagues plan to use AI to begin getting a handle on billions of undigitized documents so retrieval is easier. Photo via Wikimedia Commons

Imagine the frustration of a scholar who, after traveling thousands of miles to the National Archives in College Park, must wait hours to access a single cardboard box that may or may not contain the specific declassified meeting notes, diplomatic messages, obscure staff reports or other materials they might be seeking.

These boxes of the nation’s records fill vast storehouses of 43 National Archives and Records Administration (NARA) facilities—well over 10 billion pages in all. Each box can easily contain 100 documents, with most having little to none of their hard copy documents digitized. It can feel like searching for a historical needle in a very large haystack.

Researchers at the University of Maryland, with assistance from information experts from Japan, are working to offer digital solutions to this dilemma, developing information retrieval tools that will rely on machine learning algorithms and sophisticated cameras to better identify and categorize important historical records.

Douglas W. Oard, a professor in the College of Information and the University of Maryland Institute for Advanced Computer Studies (UMIACS), is leading the project. An expert in information retrieval, Oard envisions an advanced system where even before arriving at the Archives, scholars can input their requests using natural language. With the help of the system, they will then be able to work with NARA staff to more easily identify some of the boxes most likely to contain content relevant to the scholar’s query.

This is a massive undertaking, Oard said, with only about 2% of the vast collection currently digitized. The key, he explained, will involve combining his team’s collective expertise in information retrieval, digital curation, data mining, computer vision, pattern recognition and artificial intelligence (AI) to develop an efficient, accurate and user-friendly archival query platform.

First, Oard will seek to use machine learning algorithms to assist in predicting the contents of undigitized boxes based on what is known about neighboring records in NARA’s giant College Park facility.

“We need to develop a system capable of understanding how the Archives is organized,” Oard explained. “This way, for example, we might infer that the box in the middle is likely to contain materials related to both the box on its left and the box on its right.”

Other assistance on the project comes from College of Information faculty members Diana Marsh, an assistant professor of archives and digital curation, and Katrina Fenlon, an assistant professor specializing in digital collections.

Additional input will come from Tokinori Suzuki, an assistant professor of informatics at Kyushu University in Japan, who currently has a visiting appointment at UMIACS. Supported by a $132,000 grant from the Japan Society for the Promotion of Science, Suzuki’s novel approach to this challenge is based on mining academic literature that has been published online for references to materials held by the archival institutions. By analyzing these references, Suzuki and two colleagues from Kyushu University will gather relevant information about documents that historians have already seen, but have not yet been individually described by that archive.

David Doermann, a longtime UMIACS research scientist who is now chair of the Department of Computer Science and Engineering at the University at Buffalo, is developing a sophisticated camera system that will expand knowledge of what’s where at the Archives by leveraging the way scholars sift through the collection.

The computer vision software in the cameras will register the content of historical documents being viewed by scholars, literally looking over their shoulder (with permission) as they work. It works similarly to the human eye, shifting its gaze and adjusting its focus to see documents with the acuity needed to digitize those documents in real time. Then the software will add that data to the repository of information, from which Oard’s team can predict where unseen documents might be found.

Despite the hurdles ahead—due in no small part to the sheer volume of data involved—the research team believes its work holds profound promise.

“The project poses new challenges, but the potential to transform access to historical documents is what drives us,” Oard said. “We’re not just building technology—we’re working to enrich the future of historical research, one box at a time.”


The original article was written by Melissa Brachfeld and published by Maryland Today on October 17, 2024.