Library Guides: Text Mining & Computational Text Analysis: HathiTrust Research Center

Working with HTRC

NOTE: The HathiTrust Research Center (HTRC) will be suspended at the end of December 2026 as HathiTrust reallocates resources.

The HathiTrust Research Center (HTRC) provides computational research access to the HathiTrust Digital Library, a shared digital library with over 17 million volumes that's similar to Google Books, but focused on scholarly materials.

There are three primary modes of access to texts in HTRC:

Datasets (Extracted Features and more) - You can download ngrams from over 17 million volumes in the HT library, including in-copyright works, to analyze in the computing environment of your choice.
Data Capsules - A secure, virtual computer for non-consumptive analytical access to the full OCR text of works in the HT Library.
Text Analysis Algorithms and Worksets - Web-based, click-and-run tools that perform computational text analysis on a set of texts you choose (worksets). No programming required.

Datasets (Extracted Features)

Datasets for Download

The HTRC Extracted Features Dataset is composed of page-level features for 17.1 million volumes in the HathiTrust Digital Library. This version contains non-consumptive features for both public-domain and in-copyright books. Features include part-of-speech tagged term token counts, header/footer identification, marginal character counts, and much more. You download the data to your own computer and run analyses as you wish.

Additionally, HTRC has partnered with advanced researchers to release derived datasets:

Word Frequencies in English-Language Literature, 1700-1922.
Geographic Locations in English-Language Literature, 1701-2011

In the steps below you will need to be comfortable with the command line, use the HT Feature Reader, Rsync. To work with Extracted Features:

To build your corpus, go to the HathiTrust Library, and find the volume ID for each volume you'd like to include:
- Search for your book, and copy the URL from the Limited (Search Only) or Full View links under the work.
- The final string of characters after the final / is your volume ID:
  - For https://hdl.handle.net/2027/mdp.39015070698322
    mdp.39015070698322 is the volume ID.
In the command line, use Rsync to pull down the Extracted Features for each volume:
htid2rsync mdp.39015070698322 | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
- More information on using Rsync with HTRC.
To begin to work with these files, check out the Programming Historian's tutorial:
- Text Mining in Python through the HTRC Feature Reader (Nov 2016). This tutorial relies on the HTRC Feature Reader (GitHub), which is Python library making heavy use of Pandas.
- An alternative tool is HTRC Book Models (GitHub), which combines Python + Mallet + R for "within-book topic modeling."

Note: if you are familiar with Jupyter and Python, see the UC Berkeley Data Science/Library HTRC Module in GitHub for a more detailed walk-through of the information.

Data Capsules

Data Capsules

HathiTrust Data Capsules are secure virtual environments for non-consumptive text analysis, where researchers can implement their own data analysis and visualization tools.

In other words, you log into a virtual machine where you will have access to OCRed texts from the HathiTrust Digital Library. You can run your own analyses on this data. You export your results, but not the corpus itself.

Anyone can use the data capsule and work with public domain materials. In addition, since UC Berkeley is a HathiTrust member, UC Berkeley researchers can include in their corpus material still in copyright.

Text Analysis Algorithms

Text Analysis Algorithms and Worksets

Web-based, click-and-run tools that perform computational text analysis on worksets, which are user-created collections of volumes. No programming required

Secondary menu

Text Mining & Computational Text Analysis

Working with HTRC

Datasets (Extracted Features)

Data Capsules

Text Analysis Algorithms