English was the founding language of the internet, and in the late 1990s, it’s estimated that 80% of web content was in English. Today, with the growth of the web around the globe, we’re seeing a boom of content in other languages. It’s now estimated that only a little over half of web content is in English. Of the non-English content, we see amazing diversity – Russian 6%, German 5.9%, Spanish 5%, French 4%, Japanese 3.4%, and the remaining 22% is divided among another 34 languages.
For English speakers, this growing language diversity creates a challenge, from booking a local hotel in Croatia to looking up facts about a Norwegian brewery. There are very few online English translation tools available for languages that have a small digital footprint. For academic and national security translation needs, as it now stands, analysts often must wade through multilingual content manually.
The research team, including project lead Dr. Kathy McKeown of Columbia University and UMD project lead Dr. Douglas Oard of the UMD College of Information Studies, is tasked with creating a powerful online search engine tool with built-in multilingual-to-English translation. The $13.4M project is funded by the U.S. Intelligence Advanced Research Project Activity (IARPA), which invests in high-risk, high-payoff research that addresses some of the most difficult scientific challenges faced by the U.S. intelligence community.
The system they are building, called SCRIPTS (System for Cross Language Information Processing, Translation and Summarization), uses machine-learning algorithms to sift through large amounts of human language. SCRIPTS incorporates four key areas of language processing into one robust platform: speech recognition, machine translation, cross-language retrieval, and information summarization.
When completed, SCRIPTS will allow users to search through foreign language web content - including videos, news broadcasts, newspapers, social media posts, and text documents - using English search terms. SCRIPTS will also translate search result summaries into English.
Although translation into English is the current focus, the team imagines that future evolutions could be used to translate web content from any language into any language – opening up the web to the 80% of non-English speaking users who have limited web content available in their native language.