Skip to main content
Computational Text Analysis Working Group (D-Lab)
This group of graduate students, faculty and researchers meet twice a month during the semester to share research, provide technical workshops, and offer space to support ongoing analysis.
D-Lab helps Berkeley faculty, staff, and graduate students move forward with world-class research in data intensive social science.
Data Acquisition and Access Program
Recommend a dataset for purchase! This program is focused on datasets that require license or user agreements to access.
Digital Humanities Consulting
The Research IT group in the Office of the CIO offers digital humanities consulting at no cost to UCB faculty, students, and staff. See also their Text Analysis resource guide
Lit+DH Working Group
A working group for researchers of all levels who hope to familiarize themselves with some of the discourse, ideas, debates and tools relevant to current work in digital humanities.
- As a humanities researcher, you may have a large number of PDFs, photos of archival documents, or other images of text that are not yet machine readable.
- Optical Character Recognition (OCR) converts images of scanned text into machine-readable text, so you can copy and paste, search, or edit.
- Research-level Optical Character Recognition requires accuracy, multiple language support, and bulk processing.
OCR Desktop Program
- ABBYY FineReader: accurate, supports 190 languages
- Tesseract OCR: process large corpora in bulk
- Supported by Research-IT
Python is a free open source and general-purpose programming language that often serves as a foundation for text analysis projects.
R is a free software environment for statistical computing and graphics. RStudio
is a development interface for R featuring a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
Natural Language Toolkit (Python)
NLTK is a free open source platform for building Python programs to work with human language data.
Natural Language Processing with Python
This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic summarization and translation.
Text Analysis with R for Students of Literature
Text Analysis with R for Students of Literature is written with students and scholars of literature in mind but will be applicable to other humanists and social scientists wishing to extend their methodological tool kit to include quantitative and computational approaches to the study of text.
For help with TDM access:
Send questions about text and data mining access to library resources to this shared email above, which brings together librarians and campus partners with subject, copyright, technical, and licensing expertise.
- For help with text mining tools and software, check out the D-Lab.
- Questions and suggestions related to this guide can go to Cody Hennesy.