Skip to Main Content

Text Mining & Computational Text Analysis

Working with HTRC

hathitrust logo

The HathiTrust Research Center (HTRC) provides computational research access to the HathiTrust Digital Library, a shared digital library with over 17 million volumes that's similar to Google Books, but focused on scholarly materials.

There are three primary modes of access to texts in HTRC:

  1. Datasets (Extracted Features and more) - You can download ngrams from over 17 million volumes in the HT library, including in-copyright works, to analyze in the computing environment of your choice. 
  2. Data Capsules - A secure, virtual computer for non-consumptive analytical access to the full OCR text of works in the HT Library.
  3. Text Analysis Algorithms and Worksets - Web-based, click-and-run tools that perform computational text analysis on a set of texts you choose (worksets). No programming required. 

Datasets (Extracted Features)

Datasets for Download 

The HTRC Extracted Features Dataset is composed of page-level features for 17.1 million volumes in the HathiTrust Digital Library. This version contains non-consumptive features for both public-domain and in-copyright books. Features include part-of-speech tagged term token counts, header/footer identification, marginal character counts, and much more. You download the data to your own computer and run analyses as you wish.

Additionally, HTRC has partnered with advanced researchers to release derived datasets: 

  • Word Frequencies in English-Language Literature, 1700-1922.
  • Geographic Locations in English-Language Literature, 1701-2011

In the steps below you will need to be comfortable with the command line, use the HT Feature Reader, Rsync. To work with Extracted Features:

  1. To build your corpus, go to the HathiTrust Library, and find the volume ID for each volume you'd like to include:
    • Search for your book, and copy the URL from the Limited (Search Only) or Full View links under the work.
      screenshot of HT
    • The final string of characters after the final / is your volume ID:
      • For https://hdl.handle.net/2027/mdp.39015070698322
        mdp.39015070698322 is the volume ID.
  2. In the command line, use Rsync to pull down the Extracted Features for each volume:
    htid2rsync mdp.39015070698322 | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
  3. To begin to work with these files, check out the Programming Historian's tutorial:

Note: if you are familiar with Jupyter and Python, see the UC Berkeley Data Science/Library HTRC Module in GitHub for a more detailed walk-through of the information.

Data Capsules

Data Capsules

HathiTrust Data Capsules are secure virtual environments for non-consumptive text analysis, where researchers can implement their own data analysis and visualization tools.

In other words, you log into a virtual machine where you will have access to OCRed texts from the HathiTrust Digital Library. You can run your own analyses on this data. You export your results, but not the corpus itself.

Anyone can use the data capsule and work with public domain materials. In addition, since UC Berkeley is a HathiTrust member, UC Berkeley researchers can include in their corpus material still in copyright.

Text Analysis Algorithms

Text Analysis Algorithms and Worksets

Web-based, click-and-run tools that perform computational text analysis on worksets, which are user-created collections of volumes. No programming required