Skip to main content

Text Mining & Computational Text Analysis: Tools & Support

Campus resources

OCR

Library scanners will output OCR for most English language documents. For more complex jobs and other languages, however, you might want to use the OCR Desktop Program.


Why OCR?

  • As a humanities researcher, you may have a large number of PDFs, photos of archival documents, or other images of text that are not yet machine readable.
  • Optical Character Recognition (OCR) converts images of scanned text into machine-readable text, so you can copy and paste, search, or edit.
  • Research-level Optical Character Recognition requires accuracy, multiple language support, and bulk processing.

OCR Desktop Program

  • ABBYY FineReader: accurate, supports 190 languages
  • Tesseract OCR: process large corpora in bulk
  • Supported by Research-IT

Software

Learn more

For help with TDM access:

tdm-access@berkeley.edu
Send questions about text and data mining access to library resources to this shared email above, which brings together librarians and campus partners with subject, copyright, technical, and licensing expertise. 

  • For help with text mining tools and software, check out the D-Lab.
  • Questions and suggestions related to this guide can go to Cody Hennesy.
Copyright © 2014-2016 The Regents of the University of California. All rights reserved. Except where otherwise noted, this work is subject to a Creative Commons Attribution-Noncommercial 4.0 License.