Skip to Main Content

Text Mining & Computational Text Analysis

Where Can I Find Data & Texts?

Right here! The Library offers a wealth of texts and data for your TDM research. We have also included datasets available freely on the web. Use the navigation menu to browse by type of data. Data include:

To suggest or request data not listed on this guide, please email tdm-access at berkeley.edu.

What about Copyright?

Can I Web Scrape?

  • Help us keep library databases available for everyone.
  • Using Python, Selenium, or other programmatic tools to scrape database search results, no matter how carefully, can result in access being shut down for the entire campus. That’s because license agreements and other laws affect what and how you can mine.
  • Before you programmatically scrape a library database or website, make sure you’re following the tips in this guide, which walks you through everything you need to know. And if you’re wondering whether you should use an API, you can check out this flowchart.
  • Have more questions? Contact tdm-access at berkeley.edu for help. 

Can I Do TDM on Material under Copyright?

Our Copyright and Text Mining Guide explains everything you need to know about doing TDM on material under copyright or with licensing restrictions, both when you are running your analyses and when you are publishing your results. 

Can I Do TDM on eBooks & DVDs Protected by DRM?

Some materials may have an added technological protection layer of "digital rights management" (DRM). There are some situations in which it’s permitted to “break” eBook and DVD DRM to conduct TDM, but there are very specific rules you must follow. Check out the DRM parameters in our TDM law and policy guide. And if you have any questions, please get in touch at tdm-access at berkeley.edu.

How are legal aspects of TDM related to using or training artificial intelligence (AI)?

Scholars have relied upon non-generative (sometimes called "analytical") AI for many years to extract information from copyrighted works as part of text and data mining processes. Scholars should be able to rely on fair use to perform the component acts of computational research with or without AI, in the same way they have for TDM. Licensing agreements, privacy concerns, and ethical considerations are non-copyright issues that can affect scholars' use of AI. Check out the Artificial Intelligence guide to explore these issues in greater detail. 

Before you use any Library-licensed databases or data sources for TDM and AI, see the Library’s web page on use and licensing restrictions for electronic resources and contact tdm-access at berkeley.edu with any questions.

Software & Tools

Programming

Cloud-Based Tools

OCR: Tools for Making PDFs and Images of Text Usable

Simple Jobs:

  • Library scanners will output OCR for most English language documents. Results with other languages may vary.

Complex Jobs

  • ABBYY FineReader: accurate, supports 190 languages
  • Tesseract OCR: process large corpora in bulk

Why OCR?

  • As a humanities researcher, you may have a large number of PDFs, photos of archival documents, or other images of text that are not yet machine readable.
  • Optical Character Recognition (OCR) converts images of scanned text into machine-readable text, so you can copy and paste, search, or edit.
  • Research-level Optical Character Recognition requires accuracy, multiple language support, and bulk processing.

Learn about TDM

What Is Text Data Mining (TDM)?

"Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources... The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts."

- from What is Text Mining? by Marti Hearst

Guide Books and Online Tutorials

Workshops and Training On Campus

DH on Campus