ArXiv Bulk Data
Full-text and metadata bulk downloads for open access scholarship in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, Statistics, Electrical Engineering and Systems Science, and Economics.
Awesome Public Datasets
Check out the Natural Language category for a list of text corpora and ngrams for text analysis.
Caselaw Access Project & CourtListener
The Caselaw Access Project (CAP) expands public access to U.S. law., and contains over 360 years (going back to 1658) of published U.S. court decisions, digitized from the collection of the Harvard Law Library.
Full-page images and article images from the Chicago Defender under all its title variants from 1910-1975.
Chronicling America (Library of Congress)
The Chronicling America Historic American Newspapers collection provides access to select digitized newspaper pages produced by the National Digital Newspaper Program (NDNP), a partnership between the National Endowment for the Humanities (NEH) and the Library of Congress.
The Congress.gov API includes bills, amendments, summaries, Congress, members, the Congressional Record, committee reports, nominations, treaties, and House Communications. Over time we will be adding hearing transcripts and Senate Communications.
CORE: Open Access Research Papers
CORE provides a central API to access full content from tens of thousands of openly available scientific publications from thousands of OA repositories. They also provide full datasets by request.
Corpus Resource Database (CoRD)
CoRD provides links to and descriptions of a large number of linguistic corpora, subcorpora, and related databases.
Dataverse (UC Berkeley Library)
Discover data for research, teaching, and learning licensed by UC Berkeley. Find data across disciplines, preview metadata, and download files.
DeepMind Q&A Dataset (CNN & Daily Mail)
The datasets used for a deep learning project includes 90,000 CNN articles and over 190,000 Daily Mail articles downloaded from the Wayback Machine and available for bulk download.
The platform offers licensed access to datasets provided by corporate partners, covering a range of data types and subject areas.
Gale Digital Scholar Lab allows you to run simple text and data mining (TDM) analyses in your web browser using a wide range of primary source collections on various topics. These collections are licensed from Gale by the Library.
Documenting the American South Digital Collections
Multiple collections of digitized primary sources related to southern history, literature, and culture.
Encyclopaedia Britannica (1768-1860)
The complete digital edition of the Encyclopaedia Britannica from 1768-1860, available for bulk download in XML, image files, and/or plain-text.
UC Berkeley has licensed access to the following English Corpora datasets and are available for download:
Elsevier (ScienceDirect, Scopus)
ScienceDirect eBooks, Reaxys, Compendix:
ScienceDirect eJournals:
SCOPUS:
FRASER API (U.S. economy, banking...)
Use this REST API to access full-text and metadata from FRASER, a digital library of U.S. economic, financial, and banking history—particularly the history of the Federal Reserve System.
Bulk data downloads of major US Government publications including Congressional Bills, Commerce Business Daily, Federal Register, Public Papers of the Presidents of the United States, Supreme Court Decisions 1937-1975 (FLITE) and more.
The General Index (Internet Archive / Public Resource)
The General Index is an open-access collection of n-grams (words and phrases) and metadata from over 100 million journal articles to support text mining. Articles include both open access and paywalled content, available here as derived data (not human-readable text).
The Guardian and The Observer Archive (1791-1909)
Access to facts, firsthand accounts, and opinions of the day about the most significant and fascinating political, business, sports, literary, and entertainment events from the Guardian and the Observer newspapers from 1791-1909.
The HathiTrust Research Center (HTRC)
HTRC provides computational research access to the HathiTrust Digital Library, a shared digital library with over 17 million volumes that's similar to Google Books, but focused on scholarly materials. Note: HTRC will cease operations at the end of 2026.
The Internet Archive encourages users to consume and repurpose metadata and media from their online library.
Inter-university Consortium for Political and Social Research (ICPSR)
ICPSR receives, processes, and distributes data on social phenomena in various countries. ICPSR maintains a data archive on topics in the social and behavioral sciences, including specialized collections from a wide range of fields.
JSTOR Text Analysis Support
JSTOR text analysis support accommodates text analysis and digital humanities research by providing datasets of full-text for journals, books, research reports, and pamphlets on JSTOR.
The LexisNexis Web Services API (WSAPI) enables researchers to download and build text corpuses from Nexis Uni (including many major world news sources) for further analysis.
Library of Congress: 25 million bibliographic metadata records
The LoC release of 25 million open access MARC records for free bulk download. MARC (Machine Readable Cataloging Records) is an international metadata standard for the representation and communication of bibliographic and related information.
Linguistic Data Consortium
Language data from written texts and transcriptions of speech, in various languages, to support corpus linguistics. If you don't see an LDC dataset in UC Berkeley Library's Dataverse search the LDC catalog.
Los Angeles Sentinel (1934-2005)
ProQuest Historical Newspaper data for the Los Angeles Sentinel 1934 - 2005
API service that allows you to query online news sources from the past month including major publications such as the New York Times, ABC News, and Al Jazeera. Register for a free API key to get started.
NY Times APIs
The Article Search API provides access to headlines, abstracts, lead paragraphs and more (but NOT full-text articles) from the New York Times, 1851 to present.
Old Bailey Online
The Proceedings of the Old Bailey (1674-1913) and of the Ordinary of Newgate's Accounts (1676-1772), containing records from 197,745 criminal trials held at London's central criminal court. It allows access to over 197,000 trials and biographical details of approximately 2,500 men and women executed at Tyburn.
Online Congressional Record (Bound Edition)
The Congressional Record is the official record of the proceedings and debates of the United States Congress, 1873- present. U.S. Government Publishing Office
Downloadable datasets for citations drawn from two large academic graphs: Microsoft Academic Graph (MAG) and AMiner.
OpenAlex is a map of the world's research ecosystem, linking components (like papers, institutions, journals, topics, SDGs, authors, etc.) to one another.
PLOS (Public Library of Science) - allofplos Python package
Python package for downloading, updating, and maintaining a repository of all PLOS XML article files.
Project Gutenberg (Robot Access)
Project Gutenberg hosts over 50,000 ebooks, most of which are older books in the public domain.
ProQuest Congressional Record (1789-2005)
The Congressional data derived from the Annals of Congress (1789-1824), Register of Debates (1824-1837), Congressional Globe (1833-1873), and Congressional Record (1873-2005)
Pubmed Article Datasets
Over four million articles from full-text biomedical and life sciences journal articles in PubMed Central
Many (but not all) of the digital archives that we subscribe to via Readex are available to explore using Voyant Tools.
San Francisco Chronicle (1865-1922)
Downloadable full text corpus of the San Francisco chronicle and its predecessor titles, the Daily dramatic chronicle and the Daily morning chronicle, covering 1865 to 1922.
Scottish Corpus of Text & Speech (1945 - Present)
The Scottish Corpora project has created large electronic corpora (over 4.5 million words) of written and spoken texts for the languages of Scotland. See also the Corpus of Modern Scottish Writing (1700-1945).
Springer eBooks, Springer Nature Protocols, Scientific American
Stanford Large Network Dataset Collection
The SNAP library collects data on large social and information networks since 2004.
Write queries that compute the amount of time people appear and the amount of time words are heard in cable TV news. Data is compiled from Internet Archive's collection of 24-7 recordings of CNN, Fox News, and MSNBC between January 1, 2010 to present, and updates daily (with a 24-36 hour lag of original air date).
Text Creation Partnership (Early print books)
The Text Creation Partnership includes full texts of the following: Early English Books Online (ProQuest), Eighteenth Century Collections Online (Gale Cengage), and Evans Early American Imprints (Readex/Newsbank).
ProQuest Historical Newspaper data covering the study of colonial and post-colonial times, class and gender issues, religion, as well as international economics, international relations and cultural studies from 1838 to 2005.
LDC's TIPSTER corpus was compiled to advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections. A list of sources included can be found at TIPSTER Complete.
TDM Studio (ProQuest)
TDM Studio includes (1) a virtual Workbench environment and (2) a browser-based Visualization dashboard to run text data mining analyses using ProQuest materials licensed by UC Berkeley. Includes ProQuest Historical Newspapers, Dissertation and Theses, many scholarly journals, as well as the Web of Science XML citation data.
TDM Studio Workbench
TDM Studio Visualization
Vogue Archive
Contains the entire run of Vogue magazine (US edition), from the first issue in 1892 to the current month, reproduced in high-resolution color page images. Every page, advertisement, cover and fold-out has been included, with rich indexing enabling you to find images by garment type, designer and brand names.
Web of Science XML Data
The Web of Science XML Data includes metadata from over 12,500 journals spanning over 250 science, social science and humanities disciplines. Data are available back to 1900 and include over 63 million article records and 1 billion cited references to date. Visit UC Berkeley Library's Dataverse to view the editions and date ranges included.
Monthly database backups of all Wikimedia wikis in various formats.
Public streams provide access to public data flowing through Twitter. Suitable for following specific users or topics, and data mining. You can also access single-user streams, containing roughly all of the data corresponding with a single user’s view of Twitter.