Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Update: Moffitt Library is closed for seismic work, but most other libraries are open. Learn more.

Text Mining & Computational Text Analysis

What Is Text Mining?

"Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources... The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts."

- from What is Text Mining? by Marti Hearst (2003)

Where Can I Find Data & Texts for Analysis?

  • TDM platforms like TDM Studio and Gale Digital Scholar Lab
  • Library datasets (listed in this guide)
  • Web sources (we've listed some of them on this guide)
  • To suggest or request data not listed on this guide, please email

Please Scrape Responsibly

  • Help us keep library databases available for everyone.
  • Before you programatically scrape a library database or website, check its terms of service, APIs, or contact the library for help. 
  • Using Python, Selenium, or other programmatic tools to scrape database search results, no matter how carefully, can result in access being shut down for the entire campus.

Software & Tools

OCR: Tools for Making PDFs and Images of Text Usable

Simple Jobs:

Complex Jobs: OCR Desktop

  • ABBYY FineReader: accurate, supports 190 languages
  • Tesseract OCR: process large corpora in bulk
  • Supported by Research-IT

Why OCR?

  • As a humanities researcher, you may have a large number of PDFs, photos of archival documents, or other images of text that are not yet machine readable.
  • Optical Character Recognition (OCR) converts images of scanned text into machine-readable text, so you can copy and paste, search, or edit.
  • Research-level Optical Character Recognition requires accuracy, multiple language support, and bulk processing.

Learn more

Campus resources