UC Berkeley’s library buildings are open! Learn more.
The HathiTrust Research Center (HTRC) provides computational research access to the HathiTrust Digital Library, a shared digital library with over 14 million volumes that's similar to Google Books, but focused on scholarly materials.
There are three primary modes of access to texts in HTRC:
Note: if you are familiar with Jupyter and Python, see the UC Berkeley Data Science/Library HTRC Module in GitHub for a more detailed walk-through of the information below.
You can export Extracted Features for volumes in your worksets that include volume metadata, tokens (unigrams) and sentence count per page, an unordered list of all tokens and frequency, and more. This data would not allow you to analyze the text at the level of syntax, but would enable "bag-of-words" methods such as topic modeling. In the steps below you will need to be comfortable with the command line, use the HT Feature Reader, Rsync. To work with Extracted Features:
The HTRC Data Capsule gives a researcher a secure, virtual computer for non-consumptive analytical access to the full OCR text of public works (and eventually all works) in the HathiTrust Digital Library. Data capsules are restricted, particularly in limiting how and when the products created by analysis tools leave the capsule. Data products leaving a data capsule must undergo results review prior to release. To get started with the Data Capsule (via the HTRC data capsule tutorial):