Library Guides: Text Mining & Computational Text Analysis: Home

Where Can I Find Data & Texts?

Right here! The Library offers a wealth of texts and data for your TDM research. We have also included datasets available freely on the web. Use the navigation menu to browse by type of data. Data include:

Library datasets
Web sources
TDM platforms like TDM Studio and Gale Digital Scholar Lab

To suggest or request data not listed on this guide, please email tdm-access at berkeley.edu.

What about Copyright?

Can I Web Scrape?

Help us keep library databases available for everyone.
Using Python, Selenium, or other programmatic tools to scrape database search results, no matter how carefully, can result in access being shut down for the entire campus. That’s because license agreements and other laws affect what and how you can mine.
Before you programmatically scrape a library database or website, make sure you’re following the tips in this guide, which walks you through everything you need to know. And if you’re wondering whether you should use an API, you can check out this flowchart.
Have more questions? Contact tdm-access at berkeley.edu for help.

Can I Do TDM on Material under Copyright?

Our Copyright and Text Mining Guide explains everything you need to know about doing TDM on material under copyright or with licensing restrictions, both when you are running your analyses and when you are publishing your results.

Can I Do TDM on eBooks & DVDs Protected by DRM?

Some materials may have an added technological protection layer of "digital rights management" (DRM). There are some situations in which it’s permitted to “break” eBook and DVD DRM to conduct TDM, but there are very specific rules you must follow. Check out the DRM parameters in our TDM law and policy guide. And if you have any questions, please get in touch at tdm-access at berkeley.edu.

How are legal aspects of TDM related to using or training artificial intelligence (AI)?

Scholars have relied upon non-generative (sometimes called "analytical") AI for many years to extract information from copyrighted works as part of text and data mining processes. Scholars should be able to rely on fair use to perform the component acts of computational research with or without AI, in the same way they have for TDM. Licensing agreements, privacy concerns, and ethical considerations are non-copyright issues that can affect scholars' use of AI. Check out the Artificial Intelligence guide to explore these issues in greater detail.

Before you use any Library-licensed databases or data sources for TDM and AI, see the Library’s web page on use and licensing restrictions for electronic resources and contact tdm-access at berkeley.edu with any questions.

Software & Tools

Programming

Python
Python is a free open source and general-purpose programming language that often serves as a foundation for text analysis projects.
R
R is a free software environment for statistical computing and graphics. RStudio is a development interface for R featuring a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.
Mallet
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
Natural Language Toolkit (Python)
NLTK is a free open source platform for building Python programs to work with human language data.

Cloud-Based Tools

Digital Scholar Lab (Gale)
The Digital Scholar Lab has cloud-based tools for automatically performing common computational queries on and creating visualizations from content sets built with Gale primary source collections.
Primary source collections include: American Fiction
17th and 18th Century Burney Collection, American Civil Liberties Union Papers, 1912-1990, American Fiction, Archives Unbound, Archives of Sexuality & Gender, British Library Newspapers, The Economist Historical Archive, Eighteenth Century Collections Online, Indigenous Peoples: North America, The Making of Modern Law, The Making of the Modern World, Nineteenth Century Collections Online, Nineteenth Century U.S. Newspapers, Sabin Americana, 1500-1926, The Times Digital Archive, The Times Literary Supplement Historical Archive, U.S. Declassified Documents Online

Voyant
A cloud-based tool for performing some automated computational text analysis on documents that you upload.

OCR: Tools for Making PDFs and Images of Text Usable

Simple Jobs:

Library scanners will output OCR for most English language documents. Results with other languages may vary.

Complex Jobs

ABBYY FineReader: accurate, supports 190 languages
Tesseract OCR: process large corpora in bulk

Why OCR?

As a humanities researcher, you may have a large number of PDFs, photos of archival documents, or other images of text that are not yet machine readable.
Optical Character Recognition (OCR) converts images of scanned text into machine-readable text, so you can copy and paste, search, or edit.
Research-level Optical Character Recognition requires accuracy, multiple language support, and bulk processing.

Learn about TDM

What Is Text Data Mining (TDM)?

"Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources... The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts."

- from What is Text Mining? by Marti Hearst

Guide Books and Online Tutorials

Natural Language Processing with Python by Steven Bird; Ewan Klein; Edward Loper
This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic summarization and translation.
Text Analysis with R for Students of Literature by Matthew L. Jockers
Text Analysis with R for Students of Literature is written with students and scholars of literature in mind but will be applicable to other humanists and social scientists wishing to extend their methodological tool kit to include quantitative and computational approaches to the study of text.

Introduction to the tm Package: Text Mining in R
A short 2015 introduction to text mining in R by Ingo Feinerer. Includes data import, corpus handling, preprocessing, metadata management, and creation of term-document matrices.
Extracting Data from the Internet in Python
Course materials from Rochelle Terman's May 2016 D-Lab workshop.
Text Analysis in Python Intensive for Digital Humanities
Course materials from Teddy Roland's May 2016 D-Lab workshop.

Workshops and Training On Campus

D-Lab Trainings
D-Lab helps Berkeley faculty, staff, and graduate students move forward with world-class research in data intensive social science.
Data + Digital Library Workshops
Workshops on data, programming, legal issues in TDM research, and more.

DH on Campus

DH Working Group
The Berkeley Digital Humanities Working Group is a research community founded to facilitate interdisciplinary conversations in the digital humanities and cultural analytics. Our biweekly meetings are participant driven and provide a place for sharing research ideas (including brainstorming new ideas and receiving feedback from others), learning about the intersection of computational methods and humanistic inquiry, and connecting with others working in this space at Berkeley.
Computational Text Analysis Working Group (D-Lab)
This group of graduate students, faculty and researchers meet twice a month during the semester to share research, provide technical workshops, and offer space to support ongoing analysis.
Digital Humanities Listserv
The Digital Humanities Listserv is open to members of the UC Berkeley community interested in DH.

UC Berkeley Library Dataverse

To discover data that has been licensed for research, teaching, and learning at UC Berkeley, please visit UC Berkeley Library's Dataverse. All data in the UC Berkeley Library Dataverse has been acquired and licensed by the library for use by UC Berkeley and Lawrence Berkeley National Lab students, faculty, and staff. To view data files, login using Calnet or LBNL authentication.

Dataverse content includes all acquired data that has a license signed by UC Berkeley Library AND:

Is available to all users without restriction to specific user groups (note: there are some exceptions with data provided by the Business Library)
Is stored locally OR Is a subscription (must be data only content e.g. Sage Data, ICPSR, Nielsen marketing data, WRDS data)
Can be manipulated either in the platform or after download

Not included in UC Berkeley Library Dataverse:

GIS data and data resources. Please visit UC Berkeley Library GeoData Portal
Databases that have data and additional content in the form of reports, articles, etc.
Research data (data, code, and other outputs generated during the research process). For more information on publishing and sharing research data, please visit publishing data or email librarydataservices@berkeley.edu.

If you cannot find the data you are looking for, please contact your subject librarian to identify a suitable alternative or suggest a purchase.

Secondary menu

Text Mining & Computational Text Analysis

For help

Where Can I Find Data & Texts?

What about Copyright?

Software & Tools

Learn about TDM

UC Berkeley Library Dataverse