Skip to main content

Text Mining & Computational Text Analysis: HathiTrust Research Center

Working with HTRC

hathitrust logoThe HathiTrust Research Center (HTRC) provides computational research access to the HathiTrust Digital Library, a shared digital library with over 14 million volumes that's similar to Google Books, but focused on scholarly materials.

To get started sign up for a free account at analytics.hathitrust.org.

There are three primary modes of access to texts in HTRC:

  1. Extracted Features - You can download ngrams from over 13 million volumes in the HT library, including in-copyright works, to analyze in the computing environment of your choice. Software exists to help you use HTRC Features in Python and R. If you are familiar with Jupyter and Python, see the UC Berkeley Data Science/Library HTRC Modules in GitHub for a detailed walk-through.
  2. Data Capsule - A secure, virtual computer for non-consumptive analytical access to the full OCR text of works in the HT Library.
  3. Workset Builder - You can use the HTRC interface to select public domain volumes and use canned-algorithms for quick analysis. This tool is currently under development and does not currently include in-copyright works.

Extracted Features
Note: if you are familiar with Jupyter and Python, see the UC Berkeley Data Science/Library HTRC Modules in GitHub for a  more detailed walk-through of the information below.

You can export Extracted Features for volumes in your worksets that include volume metadata, tokens (unigrams) and sentence count per page, an unordered list of all tokens and frequency, and more. This data would not allow you to analyze the text at the level of syntax, but would enable "bag-of-words" methods such as topic modeling. In the steps below you will need to be comfortable with the command line, use the HT Feature Reader, Rsync. To work with Extracted Features:

  1. To build your corpus, go to the HathiTrust Library, and find the volume ID for each volume you'd like to include:
    • Search for your book, and copy the URL from the Limited (Search Only) or Full View links under the work.
      screenshot of HT
    • The final string of characters after the final / is your volume ID:
      • For https://hdl.handle.net/2027/mdp.39015070698322
        mdp.39015070698322 is the volume ID.
  2. In the command line, use Rsync to pull down the Extracted Features for each volume:
    htid2rsync mdp.39015070698322 | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
  3. To begin to work with these files, check out the Programming Historian's tutorial:

Data Capsule
The HTRC Data Capsule gives a researcher a secure, virtual computer for non-consumptive analytical access to the full OCR text of public works (and eventually all works) in the HathiTrust Digital Library. Data capsules are restricted, particularly in limiting how and when the products created by analysis tools leave the capsule.  Data products leaving a data capsule must undergo results review prior to release. To get started with the Data Capsule (via the HTRC data capsule tutorial):

  1. Install a VNC Client (such as VNC View for Chrome) on your computer to enable the communication between your computer and the capsule. 
  2. Create a capsule.
  3. Start the capsule.
  4. Connect to the capsule using your VNC Client.
  5. Navigate between maintenance and secure modes in the capsule. 
  6. Run experiments and release your results.

Workset Builder

  1. You can select public domain volumes to analyze using the Workset Builder or you can upload your own workset.
  2. Use HTRC-designed preset algorithms to explore your workset corpora.

HTRC Help Files

Copyright © 2014-2016 The Regents of the University of California. All rights reserved. Except where otherwise noted, this work is subject to a Creative Commons Attribution-Noncommercial 4.0 License.