Editor's Picks

Technology

One of the world’s largest digital libraries opens doors to text-mining scholars

May 18, 2016

Who influenced Charles Darwin when he was writing his pioneering theory of evolution, "On the Origin of Species?" IU professor Colin Allen wants to know, and the HathiTrust Research Center may now hold the answer.

computer search

HathiTrust’s collections include over 14 million digitized volumes, including more than 7 million books, more than 725,000 US federal government documents, and more than 350,000 serial publications.

The HathiTrust Research Center, a cooperative service of IU, University of Illinois, and HathiTrust, has expanded its services to support computational research on the entire collection of one of the world’s largest digital libraries, held by HathiTrust. HathiTrust’s collections include over 14 million digitized volumes, including more than 7 million books, more than 725,000 US federal government documents, and more than 350,000 serial publications. HathiTrust’s collections are drawn from some of the largest research libraries in North America, including Indiana University and the University of Illinois.

Previously the HathiTrust Research Center supported analysis of only the public domain subset of the HathiTrust collection. The center is now the only place where scholars like Allen can perform text mining on the entire HathiTrust collection. In other words, researchers can now explore the entire collection, run algorithms against all 14 million volumes, and make new connections and discoveries in the process.

Text mining is crucial to Allen’s research. As a member of the IU Department of History and Philosophy of Science and Medicine and IU’s cognitive science program, he is collaborating with informatics professor Simon DeDeo and graduate student Jaimie Murdock to research how what Darwin read influenced his theory of evolution. They can now use the HathiTrust collection, developing algorithms to analyze the books and journals Darwin himself read in the 1800s.

"We have only scratched the surface of what is possible," Allen said. "Using advanced computing, scholars will be able to analyze patterns in millions of books and understand how individual authors, who are limited to selectively reading just a few thousand of them, nevertheless manage to make creative and innovative contributions that ripple throughout the entire culture." 

"Supporting innovative uses of the collections we are preserving is a vital part of our mission," said Mike Furlough, executive director of HathiTrust. "The HathiTrust Research Center is an essential part of the HathiTrust partnership. Its secure environment for computational analysis, coupled with the expanded services, is an absolute game changer for science and scholarship." 

Staff members of the IU Pervasive Technology Institute and the Data to Insight Center have helped expand the service to support 14 million volumes.

"The big data infrastructure of HTRC ensures that researchers will retain access to the collection even as it grows in size," said Beth Plale, Indiana co-director of the HathiTrust Research Center and professor of informatics and computing at IU. "A researcher carrying out text mining on millions of texts needs both tools and the help of HTRC experts in high performance mining techniques. HTRC research staff bridge the gap between the researcher and the data."

At first, researchers will be able to access the center's collection through its Advanced Collaborative Services grants. This peer-reviewed grant process gives awardees dedicated HathiTrust staff time.

The center expects to make the full collection available through its secure data capsules in spring 2017. A features data set, derived from the full collection at both volume level and page level, will be released in fall 2016.

"The upcoming release of the extracted features data derived from the full collection will enable researchers to have hands-on access to HathiTrust materials allowing scholars to refine their research questions for the corpus in the comfort of their own labs. Another game-changing breakthrough for HTRC," said J. Stephen Downie, the Illinois co-director of the center and a professor at the Graduate School of Library and Information Science at the University of Illinois.

"This step exemplifies how researchers combine computer science, informatics, humanities, and cyberinfrastructure in ways that enable new forms of scholarship," said Brad Wheeler, IU vice president for information technology and interim dean of the IU School of Informatics and Computing. "IU is proud to be a co-founder, operator, and research partner in all that the HathiTrust has accomplished as one of the world’s foremost digital libraries."

Read more Technology stories »