Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Update: Moffitt Library is closed for seismic work, but most other libraries are open. Learn more.

Data H195: Library Sources for Text Mining

Guide for Data Science Honors Thesis Seminar

Library Resources for text analysis: Large collections and platform-wide access

Many library-licensed online resources do not support text mining applications. The publishers, vendors, or collections listed here are exceptions, and offer some mode of access to texts for large-scale analysis. 


Library Resources for Text Analysis: Smaller collections and individual publications

Library Resources: Available by request

Text mining access to the following resources will require mediation by the Library and vendors involved. Researchers should expect to provide a description of their research, and depending on the scope of the request there may be associated costs. Please contact tdm-access@berkeley.edu for more information. 

ProQuest Historical Newspapers
Researchers may request OCR full text from any of the following specific newspapers for a specific time period:

  • Chicago Defender (1910-1975)‎
  • Chicago Tribune (1849-1930*)‎
  • Los Angeles Times (1881-1930*)‎
  • The New York Times (1851-1933*) 
  • The Wall Street Journal (1889-1932*)‎
  • The Washington Post (1877-1932*)‎
  • The Baltimore Afro-American (1893-1998)
  • The Times of India (1838-2005)
  • The Guardian (1821-1906)
  • The Observer (1791-1906) 

Gale Digital Collections
Request access to Gale content for text analysis purposes, including access to OCR text from databases like the Eighteenth Century and Nineteenth Century Collections Online, as well as content from Gale’s newspaper archives. See Gale's FAQ (pdf) or brief description for more information.

Adam Matthew Digital
Contact History Librarian Jennifer Dorner (dorner@berkeley.edu) to request access to OCR text and full metadata from any of Adam Matthew Digital's primary source databases.

Please scrape responsibly

Using Python, Selenium, or other programmatic tools to scrape database search results (even cleverly) can result in access being shut down for the entire campus. : (

For legal and access issues:

What is text mining?

"Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources... The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts."

- from What is Text Mining? by Marti Hearst (2003)