The HathiTrust Research Center (HTRC) provides computational research access to the HathiTrust Digital Library, a shared digital library with over 17 million volumes that's similar to Google Books, but focused on scholarly materials.
There are three primary modes of access to texts in HTRC:
The HTRC Extracted Features Dataset is composed of page-level features for 17.1 million volumes in the HathiTrust Digital Library. This version contains non-consumptive features for both public-domain and in-copyright books. Features include part-of-speech tagged term token counts, header/footer identification, marginal character counts, and much more. You download the data to your own computer and run analyses as you wish.
Additionally, HTRC has partnered with advanced researchers to release derived datasets:
Note: if you are familiar with Jupyter and Python, see the UC Berkeley Data Science/Library HTRC Module in GitHub for a more detailed walk-through of the information.
HathiTrust Data Capsules are secure virtual environments for non-consumptive text analysis, where researchers can implement their own data analysis and visualization tools.
In other words, you log into a virtual machine where you will have access to OCRed texts from the HathiTrust Digital Library. You can run your own analyses on this data. You export your results, but not the corpus itself.
Anyone can use the data capsule and work with public domain materials. In addition, since UC Berkeley is a HathiTrust member, UC Berkeley researchers can include in their corpus material still in copyright.