Library Guides: Text Mining & Computational Text Analysis: Historical &amp; Archival

Historical & Archival Collections

Chronicling America (Library of Congress)
The Chronicling America API provides access to information about historic U.S. newspapers and millions of digitized newspaper pages and their OCR data is available for bulk download. See the full list of digitized newspaper titles (1836-1922) for more information.
Constellate (JSTOR and Portico)
Constellate is both a data source and a Jupyter-based platform for TDM analysis. Content includes: JSTOR content, Protico content, Chronicling America, and more. You can download up to 25k documents at a time. There are some restrictions related to copyright. The Jupyter-based platform is only available to participating institutions. Note: UC Berkeley is not a participating institution at this time and the platform will be sunset on July 1, 2025.

Digital Scholar Lab (Gale)
Build textual content sets from Gale primary source collections for data visualization and text data mining. Users will need to click Sign In and login with their @berkeley.edu account
Primary source collections include: American Fiction
17th and 18th Century Burney Collection, American Civil Liberties Union Papers, 1912-1990, American Fiction, Archives Unbound, Archives of Sexuality & Gender, British Library Newspapers, The Economist Historical Archive, Eighteenth Century Collections Online, Indigenous Peoples: North America, The Making of Modern Law, The Making of the Modern World, Nineteenth Century Collections Online, Nineteenth Century U.S. Newspapers, Sabin Americana, 1500-1926, The Times Digital Archive, The Times Literary Supplement Historical Archive, U.S. Declassified Documents Online

Documenting the American South Digital Collections
Multiple collections of digitized primary sources related to southern history, literature, and culture. Some collections offer plain-text downloads in their entirety: The Church in the Southern Black Community, First-Person Narratives of the American South, Library of Southern Literature, North American Slave Narratives.
HathiTrust Research Center (HTRC)
Nearly 14 million books from the HathiTrust Library are currently available for analysis, offering various levels of immediate access. Check the HTRC tab on this guide for more information to help you get started.
Los Angeles Sentinel (1934-2005)
ProQuest Historical Newspaper data for the los Angeles Sentinel 1934 - 2005, OCR'ed content (results from automated Optical Character Recognition - quality varies).
Old Bailey Online
The Proceedings of the Old Bailey (1674-1913) and of the Ordinary of Newgate's Accounts (1676-1772), containing records from 197,745 criminal trials held at London's central criminal court. It allows access to over 197,000 trials and biographical details of approximately 2,500 men and women executed at Tyburn. Use the site API or download XML files.
Project Gutenberg (Mirror sites)
Project Gutenberg hosts over 50k ebooks, most of which are older books in the public domain. If you want to download more than about 100 books/day, use one of the mirror sites listed from the link above.
San Francisco Chronicle Archive (1865 - 1922)
Request access from the D-Lab for ProQuest Historical Newspaper data for the San Francisco Chronicle (1865-1922). Note that the quality of the OCR (results from automated Optical Character Recognition) is quite low and varies from paper to paper.
Text Creation Partnership
Standardized, accurate, and faithful XML/SGML-encoded electronic text editions of early printed books. We’ve transcribed and marked up text — through manual keying, rather than optical character recognition (OCR) — from millions of static page images in ProQuest’s Early English Books Online, Gale Cengage’s Eighteenth Century Collections Online, and Readex’s Evans Early American Imprints. Raw transcripts are available for bulk download as zipped files for those wishing to do text mining or similar projects. https://textcreationpartnership.org/faq/#faq05
more...less...
standardized, accurate, and faithful XML/SGML-encoded electronic text editions of early printed books. We’ve transcribed and marked up text — through manual keying, rather than optical character recognition (OCR) — from millions of static page images in ProQuest’s Early English Books Online, Gale Cengage’s Eighteenth Century Collections Online, and Readex’s Evans Early American Imprints. Raw transcripts are available for bulk download as zipped files for those wishing to do text mining or similar projects.

Available by Request:

Adam Matthew Digital (AM)
Contact the Library (tdm-access@berkeley.edu) to facilitate access to OCR text, metadata, or media files from any of Adam Matthew Digital's (AM) primary source databases. You can also directly request access via AM's Text and data mining form, though the Library is available to help answer many of the data security questions on the form.

Secondary menu

Text Mining & Computational Text Analysis

Historical & Archival Collections

Available by Request: