Full-text and metadata bulk downloads for open access scholarship in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, Statistics, Electrical Engineering and Systems Science, and Economics.
Constellate is both a data source and a Jupyter-based platform for TDM analysis. Content includes: JSTOR content, Protico content, Chronicling America, and more. You can download up to 25k documents at a time. There are some restrictions related to copyright. The Jupyter-based platform is only available to participating institutions. Note: UC Berkeley is not a participating institution at this time and the platform will be sunset on July 1, 2025.
CORE provides a central API to access full content from tens of thousands of openly available scientific publications from thousands of OA repositories. They also provide full datasets by request.
Researchers can text mine UCB-subscribed journals and books published by Elsevier on the ScienceDirect full-text platform. Sign up for a developer account to use the Elsevier APIs for non-commercial purposes, and make sure to query the API from UC Berkeley IPs to ensure full access. You can also use the APIs to access citation data and abstracts from scholarly journals indexed by Scopus. For more information, see their Text Mining documentation.
Python tool for downloading/updating/maintaining a repository of all PLOS XML article files. Use this program to download all PLOS XML article files instead of doing web scraping. See also: PLOS APIs to query content from the seven open-access peer-reviewed journals from the Public Library of Science using any of the twenty-three terms in the PLOS Search.
ProQuest provides limited access to an "XML Gateway/Z39.50 Federated Search interface," that allows individual researchers to program searches across some, but not all, PQ databases, and get the result sets back as XML. Login and password access is only available by request at tdm-access@berkeley.edu.
"Individual researchers are encouraged to download subscription and open access content for TDM purposes directly from the SpringerLink platform. No registration or API key is required. Full-text content can be accessed easily and programmatically at friendly URLs based on the content’s Digital Object Identifier (DOI)." via Springer's Text and Data Mining Policy.
Data for Research allows you to download word frequencies, citations, key terms, and ngrams for up to 25,000 JSTOR articles at a time, or to easily submit requests for larger sets of articles. See also:
Request a Dataset: if you need more documents than Constellate can provide, you can request a custom dataset.
JSTOR DfR in GitHub - A number of Python and R packages to work with JSTOR DfR data.
JSTOR's Text Analyzer, a reverse search engine that analyzes documents that you upload (your own, or other articles) to find related materials in JSTOR.
Public domain and OA datasets include full OCR text from early journals and current academic press open access titles