Skip to main content

You can still access the UC Berkeley Library’s services and resources during the closure. Here’s how.

Text Mining & Computational Text Analysis

What Is Text Mining?

"Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources... The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts."

- from What is Text Mining? by Marti Hearst (2003)

Where Can I Find Data & Texts for Analysis?

We've compiled sources for data, texts, and more available through the UC Berkeley Library and on the open web. Use the menu to navigate our listings by category. To suggest or request data not listed on this guide, please email tdm-access@berkeley.edu.

Please Scrape Responsibly

Help us keep library databases available for everyone. Before you programatically scrape a library database or website, check its terms of service, APIs, or contact the library for help. Using Python, Selenium, or other programmatic tools to scrape database search results, no matter how carefully, can result in access being shut down for the entire campus.

Software & Tools

OCR: Tools for Making PDFs and Images of Text Usable

Simple Jobs:

Complex Jobs: OCR Desktop

  • ABBYY FineReader: accurate, supports 190 languages
  • Tesseract OCR: process large corpora in bulk
  • Supported by Research-IT

Why OCR?

  • As a humanities researcher, you may have a large number of PDFs, photos of archival documents, or other images of text that are not yet machine readable.
  • Optical Character Recognition (OCR) converts images of scanned text into machine-readable text, so you can copy and paste, search, or edit.
  • Research-level Optical Character Recognition requires accuracy, multiple language support, and bulk processing.

Learn more

Campus resources