Skip to content


3.5 Million Books 1800-2015: GDELT Processes Internet Archive and HathiTrust Book Archives and Available In Google BigQuery

2015-internet-archive-hathitrust-books

3.5 Million Books 1800-2015: GDELT Processes Internet Archive and HathiTrust Book Archives and Available In Google BigQuery

Today we are enormously excited to announce that more than 3.5 million digitized books stretching back two centuries, encompassing the complete English-language public domain collections of the Internet Archive (1.3M volumes) and HathiTrust (2.2 million volumes), have been processed using the GDELT Global Knowledge Graph and are now available in Google BigQuery.  More than a billion pages stretching back 215 years have been examined to compile a list of all people, organizations, and other names, fulltext geocoded to render them fully mappable, and more than 4,500 emotions and themes compiled.  All of this computed metadata is combined with all available book-level metadata, including title, author, publisher, and subject tags as provided by the contributing libraries.  Even more excitingly, the complete fulltext of all Internet Archive books published 1800-1922 are included to allow you to perform your own near-realtime analyses.  All of this is housed in Google BigQuery, making it possible to perform sophisticated analyses across 122 years of history in just seconds.  A single line of SQL can execute even the most complex regular expression or complete JavaScript algorithm over nearly half a terabyte of fulltext in just 11 seconds and combine it with all of the extracted data above.  Track emotions or themes over time or map the geography of the world as seen through books – the sky is the limit!

The Internet Archive books include all books in the Archive’s American Libraries collection for which English-language fulltext was available using the search “collection:(americana)”.  In addition, from 1800-1922 a few hundred books per year met these criteria, but the combined size of the extracted metadata, library-provided metadata, and fulltext exceeded 2MB, which is the maximum record size for Google BigQuery, and so are not included here.  For HathiTrust, all English-language public domain books 1800-2015 were provided by HathiTrust as part of a special research extract.  Only public domain volumes were requested.”

Stephen

Posted on: September 16, 2015, 6:51 am Category: Uncategorized

One Response

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Continuing the Discussion