Skip to main content

Need a book from the UC Berkeley Library during the shelter-in-place order? Check here first.

Text Mining & Computational Text Analysis: Tools & Support

Campus resources

Tools for Making PDFs and Images of Text Usable

Library scanners will output OCR for most English language documents. For more complex jobs and other languages, however, you might want to use the OCR Desktop Program.

OCR Desktop

  • ABBYY FineReader: accurate, supports 190 languages
  • Tesseract OCR: process large corpora in bulk
  • Supported by Research-IT


Why OCR?

  • As a humanities researcher, you may have a large number of PDFs, photos of archival documents, or other images of text that are not yet machine readable.
  • Optical Character Recognition (OCR) converts images of scanned text into machine-readable text, so you can copy and paste, search, or edit.
  • Research-level Optical Character Recognition requires accuracy, multiple language support, and bulk processing.

Software & Tools

Learn more

For help with TDM access:
Send questions about text and data mining access to library resources to this shared email above, which brings together librarians and campus partners with subject, copyright, technical, and licensing expertise. 

  • For help with text mining tools and software, check out the D-Lab.
  • Questions and suggestions related to this guide can go to Stacy Reardon.
Copyright © 2014-2019 The Regents of the University of California. All rights reserved. Except where otherwise noted, this work is subject to a Creative Commons Attribution-Noncommercial 4.0 License.