Skip to Main Content

Digital Humanities

This guide is a sandbox for exploring digital humanities samples and tasks at UC Berkeley as supported by the libraries.

What is OCR?

Optical Character Recognition (OCR) is a technology that converts images of text into a machine-readable format. This guide will introduce several OCR tools and compare their relative strengths and use cases.

OCR Tools Comparison

  Cost File size limit Speed Input file types Output file types
Abbyy Finereader Free trial (7 days), $ 2 GB Moderate pdf, doc, jpg, png, gif, pptx, epub, and more json, pdf, text, docx, xlsx, jpeg, csv, and more
Amazon Textract $ 5 MB, < 11 pages Fast png, jpeg, tiff, pdf png, jpeg, tiff, pdf, json, csv,  txt
ChatGPT OCR Free, $

Free version: 100 MB, 3 files/day

Paid version: 1 GB, 50 files/day

Fast png, jpg, pdf pdf, docx, csv, xlsx, md, txt, html, json, and more?
SensusAccess Free to UCB users 64 MB Slow doc, pdf, jpg, png, gif, html, and more mp3, daisy, epub, Braille, pdf, doc, docx, xml, xls, xlsx, csv, text, rtf, html
Transkribus Free trial with credits, $ 200 MB, 3,000 pages Moderate jpeg, png, pdf, docx, txt, and more jpeg, docx, pdf, txt, xml

 

Abbyy Finereader

Features

  • Lays out original document and editable text side by side for ease of comparison
  • Can generate searchable PDFs

Limitations

  • May struggle with maps, tables, and handwriting
  • Can only detect one language per document

Getting Started

ChatGPT

Features

  • Customizable output that is human-readable by default
  • User-friendly chatbot; doesn't require extensive learning to use the tool
  • Excels at transcribing from tables, and can detect most handwriting

Limitations

  • Draws information from external contexts when it cannot detect text clearly (possible hallucinations/biases)
  • Output can vary based on prompt (not standardized/reproducible)

Getting Started

SensusAccess

Features

  • Specializes at generating output for conversion to audio files

Limitations

  • Raw output not very human-readable
  • Larger files can take longer to process

Getting Started

Additional Resources