Library Guides: Text Mining and AI Research Resources: Resources

arXiv Bulk Data

ArXiv Bulk Data
Full-text and metadata bulk downloads for open access scholarship in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, Statistics, Electrical Engineering and Systems Science, and Economics.

Format and access points: ArXiv provides several API endpoints for metadata, and a number of full-text repositories for PDFs and LaTeX.
TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

Awesome Public Datasets

Awesome Public Datasets
Check out the Natural Language category for a list of text corpora and ngrams for text analysis.

TDM/AI Use: These resources are not licensed by the UC Berkeley Library. Please check the provider's websites for terms and conditions regarding TDM and AI.

Caselaw Access Project & CourtListener

Caselaw Access Project & CourtListener

The Caselaw Access Project (CAP) expands public access to U.S. law., and contains over 360 years (going back to 1658) of published U.S. court decisions, digitized from the collection of the Harvard Law Library.

Access points: CourtListener provides APIs for accessing CAP data, and CAP provides bulk download options.
TDM/AI Use: These resources are not licensed by the UC Berkeley Library. Please check the provider's websites for terms and conditions regarding TDM and AI.

Chicago Defender (1910-1975)

Chicago Defender (1910-1975)

Full-page images and article images from the Chicago Defender under all its title variants from 1910-1975.

Format: The collection includes digital reproductions of every page from every issue in PDF format.
Access Point: Many newspapers and news sources can be accessed through UC Berkeley Library's TDM Platforms and APIs.
TDM/AI Use: TDM and information extraction is permitted for personal use, provided downloading is not systematic and does not create a comprehensive (or nearly comprehensive) collection. Use of AI within TDM is not prohibited.

Chronicling America (Library of Congress)

Chronicling America (Library of Congress)

The Chronicling America Historic American Newspapers collection provides access to select digitized newspaper pages produced by the National Digital Newspaper Program (NDNP), a partnership between the National Endowment for the Humanities (NEH) and the Library of Congress.

TDM/AI Use: These resources are not licensed by the UC Berkeley Library. Please check the provider's websites for terms and conditions regarding TDM and AI.
Help: For search tips, frequently asked questions, and more, please visit Chronicling America: A Guide for Researchers.

Congress.gov API

Congress.gov API

The Congress.gov API includes bills, amendments, summaries, Congress, members, the Congressional Record, committee reports, nominations, treaties, and House Communications. Over time we will be adding hearing transcripts and Senate Communications.

Access Point: Sign up for an API key from api.data.gov that you can use to access web services provided by Congress.gov.
TDM/AI Use: These resources are not licensed by the UC Berkeley Library. Please check the provider's websites for terms and conditions regarding TDM and AI.

CORE: Open Access Research Papers

CORE: Open Access Research Papers

CORE provides a central API to access full content from tens of thousands of openly available scientific publications from thousands of OA repositories. They also provide full datasets by request.

TDM/AI Use: These resources are not licensed by the UC Berkeley Library. Please check the provider's websites for terms and conditions regarding TDM and AI.

Corpus Resource Database (CoRD)

Corpus Resource Database (CoRD)

CoRD provides links to and descriptions of a large number of linguistic corpora, subcorpora, and related databases.

TDM/AI Use: These resources are not licensed by the UC Berkeley Library. Please check the provider's websites for terms and conditions regarding TDM and AI.

Dataverse (UC Berkeley Library)

Dataverse (UC Berkeley Library)

Discover data for research, teaching, and learning licensed by UC Berkeley. Find data across disciplines, preview metadata, and download files.

Access Point: To view data files, log in using Calnet or LBNL authentication.
TDM/AI Use: Dataverse content includes all acquired data that has a license signed by UC Berkeley Library. Visit the 'Terms' section of each data set to learn about TDM and AI terms and conditions.
Help: If you cannot find the data you are looking for, please contact your subject librarian to identify a suitable alternative or suggest a purchase.

DeepMind Q&A Dataset (CNN & Daily Mail)

DeepMind Q&A Dataset (CNN & Daily Mail)
The datasets used for a deep learning project includes 90,000 CNN articles and over 190,000 Daily Mail articles downloaded from the Wayback Machine and available for bulk download.

TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

Dewey Data

Dewey Data

The platform offers licensed access to datasets provided by corporate partners, covering a range of data types and subject areas.

Access Point: Self-administered, create an account with your UC Berkeley credentials.
For more information: Read the following Dewey Data Guide for the dataset list, terms & conditions, and FAQ. Some datasets are small enough to be downloaded locally, but most will require use of the Dewey Data Bulk API.
TDM/AI: TDM is allowed for academic research and other non-commercial educational purposes without having to obtain the Licensor's prior written consent. Use of AI within TDM is not prohibited.

Digital Scholar Lab (Gale)

Digital Scholar Lab (Gale)

Gale Digital Scholar Lab allows you to run simple text and data mining (TDM) analyses in your web browser using a wide range of primary source collections on various topics. These collections are licensed from Gale by the Library.

Access: Log in with your UC Berkeley credentials, and create a Digital Scholar Lab account using your UC Berkeley email. The Lab includes different TDM access modes:
- Cloud-Based: Web app, with the option to download up to 5,000 documents at a time for local analysis.
- Research Product: You can download both your results and your corpus (there are limits for how many documents you can download at once.)
TDM/AI uses: UC Berkeley users may perform text and data mining for non-commercial research purposes. Authorized Users may download no more than 1000 documents per content set per session. Robots, spiders or other automated downloading programs, algorithms or devices are prohibited. Any snippets or document metadata must be accompanied by a digital object identifier (DOI) that links back to the original resource. If images or text excerpts are used, users must secure intellectual property or other rights for reuse from the rights holder to the extent needed (i.e. beyond fair use). AI is not prohibited by agreement, but may be limited by operation within the API.
Content notes: Among other notable collections, the Lab includes TDM access to American Fiction, British Library Newspapers, The Economist Historical Archive, Nineteenth Century U.S. Newspapers, and the (London) Times Digital Archive.
Help: Log into Digital Scholar Center, then click on Learning Center

Documenting the American South Digital Collections

Documenting the American South Digital Collections

Multiple collections of digitized primary sources related to southern history, literature, and culture.

Access points & format: Some collections offer zip files with plain-text files of the complete works, including the Church in the Southern Black Community, First-Person Narratives of the American South, Library of Southern Literature, and North American Slave Narratives.
TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

Encyclopaedia Britannica (1768-1860)

Encyclopaedia Britannica (1768-1860)
The complete digital edition of the Encyclopaedia Britannica from 1768-1860, available for bulk download in XML, image files, and/or plain-text.

TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

English-Corpora.org

English-Corpora.org

UC Berkeley has licensed access to the following English Corpora datasets and are available for download:

Collections:
Access Point: Visit UC Berkeley Library's Dataverse.
TDM/AI Use: TDM and AI uses are permitted. Online access to any of the seventeen corpora from English-Corpora.org is available; however, data downloads are not included with the current subscription.

Elsevier (ScienceDirect, Scopus)

Elsevier (ScienceDirect, Scopus)

ScienceDirect eBooks, Reaxys, Compendix:

TDM/AI Use: TDM is permitted. You may use this content with AI tools that you have developed and non-generative third-party-developed AI tools, provided they don't create competing products, disrupt functionality, or redistribute the content to third parties. However, for third-party-developed generative AI tools (e.g. ChatGPT, Claude, Copilot), you may not use content with the tool unless the AI tool operates in a closed/self-hosted environment, isn't trained on the subscribed content (unless within a University enterprise environment), and the tool doesn't share the content with third parties.

ScienceDirect eJournals:

Access Point: Sign up for a developer account to use the Elsevier APIs for non-commercial purposes, and make sure to query the API from UC Berkeley IPs to ensure full access.
TDM/AI Use: You may access Elsevier's ScienceDirect journals TDM service via API for internal use only. You may distribute the TDM outputs externally, including limited excerpts ("snippets") and bibliographic metadata that don't substitute for the original articles, provided they include proper attribution and DOI links where feasible. However, you cannot use the API or TDM services to create competing products, perform mining for third parties, or store any TDM outputs or content on any non-UC server (except in summary form and as needed for research replication or publication purposes). AI is not prohibited by agreement, but may be limited by operation within the API.
Help: For more information, see their Text Mining documentation. Note that their policy documentation may differ from rights we expressly negotiated for you.

SCOPUS:

TDM/AI Use: Use of SCOPUS differs for UC Berkeley vs. Lawrence Berkeley National Laboratory (LBNL). Please contact tdm-access@berkeley.edu

FRASER API

FRASER API (U.S. economy, banking...)

Use this REST API to access full-text and metadata from FRASER, a digital library of U.S. economic, financial, and banking history—particularly the history of the Federal Reserve System.

TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

GovInfo: Bulk Data

GovInfo: Bulk Data

Bulk data downloads of major US Government publications including Congressional Bills, Commerce Business Daily, Federal Register, Public Papers of the Presidents of the United States, Supreme Court Decisions 1937-1975 (FLITE) and more.

Access point & format: XML files arranged into web directories by publication title.
TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

The General Index (Internet Archive)

The General Index (Internet Archive / Public Resource)

The General Index is an open-access collection of n-grams (words and phrases) and metadata from over 100 million journal articles to support text mining. Articles include both open access and paywalled content, available here as derived data (not human-readable text).

TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

The Guardian and The Observer Archive (1791-1909)

The Guardian and The Observer Archive (1791-1909)

Access to facts, firsthand accounts, and opinions of the day about the most significant and fascinating political, business, sports, literary, and entertainment events from the Guardian and the Observer newspapers from 1791-1909.

Format: Structured XML files of OCRed text
Access Point: Log in to UC Berkeley Library's Dataverse
TDM/AI Use: UC Berkeley users may perform text and data mining for non-commercial research purposes. Authorized Users may download no more than 1000 documents per content set per session. Robots, spiders or other automated downloading programs, algorithms or devices are prohibited. Any snippets or document metadata must be accompanied by a digital object identifier (DOI) that links back to the original resource. If images or text excerpts are used, users must secure intellectual property or other rights for reuse from the rights holder to the extent needed (i.e. beyond fair use). AI is not prohibited by agreement, but may be limited by operation within the API.

HathiTrust Research Center

The HathiTrust Research Center (HTRC)

HTRC provides computational research access to the HathiTrust Digital Library, a shared digital library with over 17 million volumes that's similar to Google Books, but focused on scholarly materials. Note: HTRC will cease operations at the end of 2026.

Format: The three primary modes of access to text in HTRC are datasets, data capsules, and text analysis algorithms & worksets.
- Text Analysis Algorithms and Worksets: Web-based, click-and-run tools that perform computational text analysis on worksets, which are user-created collections of volumes. No programming required
- Data Capsules: HathiTrust Data Capsules are secure virtual environments for non-consumptive text analysis, where researchers can implement their own data analysis and visualization tools.
TDM/AI Use: These resources are not licensed by the UC Berkeley Library. Please check the provider's websites for terms and conditions regarding TDM and AI.

Internet Archive

Internet Archive Developers

The Internet Archive encourages users to consume and repurpose metadata and media from their online library.

Access point & format: Varies based on format and collection.
TDM/AI Use: These resources are not licensed by the UC Berkeley Library. Please check the provider's websites for terms and conditions regarding TDM and AI.
Help: See the Developers portal.

Inter-university Consortium for Political and Social Research (ICPSR)

Inter-university Consortium for Political and Social Research (ICPSR)

ICPSR receives, processes, and distributes data on social phenomena in various countries. ICPSR maintains a data archive on topics in the social and behavioral sciences, including specialized collections from a wide range of fields.

Access Point: Sign in through Google using your UC Berkeley email. Direct download access to data sets requires the creation of a personal account.
TDM/AI Use: These resources are not licensed by the UC Berkeley Library. Please check the provider's websites for terms and conditions regarding TDM and AI.
Help: For more information on this process, please consult the ICPSR Get Help page or email librarydataservices@berkeley.edu.

JSTOR Text Analysis Support

JSTOR Text Analysis Support
JSTOR text analysis support accommodates text analysis and digital humanities research by providing datasets of full-text for journals, books, research reports, and pamphlets on JSTOR.

Access points & format: Metadata for JSTOR content is available to download, and full text can be requested.
TDM/AI Use: Please consult JSTOR's Text Analysis Terms of Use. As of August 27, 2025: Terms of Use permit text and data mining for academic research, scholarship, and educational purposes, and allow you to utilize and share the results of your analysis in scholarly work. However, AI may not be used to the extent it is to build, train, fine-tune, or otherwise enhance an AI tool. Use of AI that does not improve, build, or fine-tune an AI tool is not prohibited.
Help: See JSTOR's text analysis guides for more information.

LexisNexis Web Services API

The LexisNexis Web Services API (WSAPI) enables researchers to download and build text corpuses from Nexis Uni (including many major world news sources) for further analysis.

Access Point: Visit the following LexisNexis Web Services API Guide to learn how to gain access and for more information.
TDM/AI Uses: UC Berkeley users may conduct text and data mining, but they may not use any materials to train (or facilitate the training of) large language models, machine learning models, generative AI, or other similar technologies.
API limits:
- Searches will be scheduled to run over the weekend to minimize service disruption.
- May not initiate more than 749 searches per hour nor retrieve more than 3000 documents at a time.
Help: See the LexisNexis Web Services API guide. Contact consultants at the D-Lab if you would like assistance with R, Python, JSON, XML or API calls to be able to use this tool.

Library of Congress: 25 million bibliographic metadata records

Library of Congress: 25 million bibliographic metadata records

The LoC release of 25 million open access MARC records for free bulk download. MARC (Machine Readable Cataloging Records) is an international metadata standard for the representation and communication of bibliographic and related information.

Access point & format: Links to download UTF-8, MARC, and XML files are available on the site.
TDM/AI Use: These resources are not licensed by the UC Berkeley Library. Please check the provider's websites for terms and conditions regarding TDM and AI.
Help: Contact cdsinfo@loc.gov with questions

Linguistic Data Consortium (LDC)

Linguistic Data Consortium
Language data from written texts and transcriptions of speech, in various languages, to support corpus linguistics. If you don't see an LDC dataset in UC Berkeley Library's Dataverse search the LDC catalog.

Format: Varies
Access point: Visit UCB Library Dataverse
TDM/AI uses: TDM and AI use is permitted. All LDC datasets must be properly cited when used in scholarly outputs (presentations, papers, monographs, etc) and many LDC datasets have their own terms. Please check in Dataverse for specific terms.
Help: Email librarydataservices@berkeley.edu with any questions.

Los Angeles Sentinel (1934-2005)

Los Angeles Sentinel (1934-2005)

ProQuest Historical Newspaper data for the Los Angeles Sentinel 1934 - 2005

Access Point: Visit UCB Library Dataverse
TDM/AI Use: Authorized users may download and digitally copy reasonable portions of materials and may undertake extraction and manipulation of information for the purpose of academic, educational, or research uses within Fair Use. Use of AI is not prohibited. Note that some databases licensed from ProQuest are not available for TDM within ProQuest TDM Studio.

NewsAPI.org

NewsAPI.org

API service that allows you to query online news sources from the past month including major publications such as the New York Times, ABC News, and Al Jazeera. Register for a free API key to get started.

TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

NY Times APIs

NY Times APIs
The Article Search API provides access to headlines, abstracts, lead paragraphs and more (but NOT full-text articles) from the New York Times, 1851 to present.

Access Point: Visit Get-Started Steps.
TDM/AI Use: The UC Berkeley Library does not license the use of the resource. Please check on the provider's website for Terms of Use.
Help: Visit the FAQ

Old Bailey Online

Old Bailey Online
The Proceedings of the Old Bailey (1674-1913) and of the Ordinary of Newgate's Accounts (1676-1772), containing records from 197,745 criminal trials held at London's central criminal court. It allows access to over 197,000 trials and biographical details of approximately 2,500 men and women executed at Tyburn.

Access points and format: Use the Old Bailey API to access data, or bulk download the complete corpus XML files.
TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

Online Congressional Record (Bound Edition)

Online Congressional Record (Bound Edition)

The Congressional Record is the official record of the proceedings and debates of the United States Congress, 1873- present. U.S. Government Publishing Office

TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

Open Academic Graph

Open Academic Graph

Downloadable datasets for citations drawn from two large academic graphs: Microsoft Academic Graph (MAG) and AMiner.

TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

OpenAlex

OpenAlex

OpenAlex is a map of the world's research ecosystem, linking components (like papers, institutions, journals, topics, SDGs, authors, etc.) to one another.

Access point: Create a free account and use the OpenAlex API. Includes options for bulk data downloads.
TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.
Help: See their technical documentation.

PLOS (Public Library of Science)

PLOS (Public Library of Science) - allofplos Python package

Python package for downloading, updating, and maintaining a repository of all PLOS XML article files.

Access point & format: Use this program to download all PLOS XML article files instead of web scraping.
TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.
See also: PLOS APIs to query content from the seven open-access peer-reviewed journals from the Public Library of Science using any of the twenty-three terms in the PLOS Search.

Project Gutenberg

Project Gutenberg (Robot Access)

Project Gutenberg hosts over 50,000 ebooks, most of which are older books in the public domain.

Access point: To download more than about 100 books/day, you can set up a mirror site or use wget.
TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

ProQuest Congressional Record (1789-2005)

ProQuest Congressional Record (1789-2005)
The Congressional data derived from the Annals of Congress (1789-1824), Register of Debates (1824-1837), Congressional Globe (1833-1873), and Congressional Record (1873-2005)

Format: These files were derived from one large, unstructured XML file obtained from ProQuest, which was parsed to create a single file for each day sorted into folders based on Congressional sessions, representing two-year spans. See Dataverse record for further documentation.
Access point: Visit UCB Library Dataverse
TDM/AI Use: Authorized users may download and digitally copy reasonable portions of materials and may undertake extraction and manipulation of information for the purpose of academic, educational, or research uses within Fair Use. Use of AI is not prohibited. Note that some databases licensed from ProQuest are not available for TDM within ProQuest TDM Studio.

PubMed Article Datasets

Pubmed Article Datasets
Over four million articles from full-text biomedical and life sciences journal articles in PubMed Central

Access point & format: Available in XML and plain text formats via Amazon Web Services.
TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

Readex Text Explorer

Readex Text Explorer (RTE)

Many (but not all) of the digital archives that we subscribe to via Readex are available to explore using Voyant Tools.

Access point: Choose the Text Explorer tab from any Readex database to launch an interface to allow you to analyze and visualize the texts.
TDM/AI uses: Readex databases embed Readex Text Explorer, which is the only permissible way to conduct TDM within Readex databases.

San Francisco Chronicle Archive (1865-1922)

San Francisco Chronicle (1865-1922)

Downloadable full text corpus of the San Francisco chronicle and its predecessor titles, the Daily dramatic chronicle and the Daily morning chronicle, covering 1865 to 1922.

Format: Each article is contained in a separate XML-encoded text file; quality of OCR text varies.
Access Point: Visit UCB Library Dataverse
TDM/AI Use: Authorized users may download and digitally copy reasonable portions of materials and may undertake extraction and manipulation of information for the purpose of academic, educational, or research uses within Fair Use. Use of AI is not prohibited

Scottish Corpus of Text & Speech (1945-Present)

Scottish Corpus of Text & Speech (1945 - Present)

The Scottish Corpora project has created large electronic corpora (over 4.5 million words) of written and spoken texts for the languages of Scotland. See also the Corpus of Modern Scottish Writing (1700-1945).

TDM/AI Use: UC Berkeley Library does not license use of the resource. Please check on the provider's website for terms and conditions.

Springer Digital Content

Springer Digital Content

Access Point: No registration or API key is required. Full-text content can be accessed easily and programmatically at friendly URLs based on the content’s Digital Object Identifier (DOI).
TDM/AI Use: TDM is permitted and the content may be stored locally only for the duration of the TDM project and deleted thereafter. Use of licensed content with AI tools is not prohibited.
Help: To read more about text and data mining at Springer Nature, visit Springer's Text and Data Mining Policy. Note that we have negotiated express TDM terms for Authorized Users which control over any conflicting terms in Springer's general TDM Policy.

Springer eBooks, Springer Nature Protocols, Scientific American

TDM/AI Use: TDM is permitted. Use of licensed content with AI tools created by the Authorized User or with non-generative third-party AI tools is generally permitted provided that reasonable security measures are used and access to the underlying content is not disseminated; the resulting AI tool may be disseminated. For third-party generative AI, AI tools may be trained or fine-tuned only in self-hosted or closed environments where neither the AI tool nor the underlying content can be shared beyond the Authorized Users.

Stanford Large Network Dataset Collection

Stanford Large Network Dataset Collection

The SNAP library collects data on large social and information networks since 2004.

TDM/AI Use: UC Berkeley Library does not license use of the resource. Please check on the provider's website for terms and conditions.

Stanford Cable TV Analyzer

Stanford Cable TV Analyzer

Write queries that compute the amount of time people appear and the amount of time words are heard in cable TV news. Data is compiled from Internet Archive's collection of 24-7 recordings of CNN, Fox News, and MSNBC between January 1, 2010 to present, and updates daily (with a 24-36 hour lag of original air date).

TDM/AI Use: UC Berkeley Library does not license use of the resource. Please check on the provider's website for terms and conditions.

Text Creation Partnership

Text Creation Partnership (Early print books)

The Text Creation Partnership includes full texts of the following: Early English Books Online (ProQuest), Eighteenth Century Collections Online (Gale Cengage), and Evans Early American Imprints (Readex/Newsbank).

Access point & format:
TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

Times of India (1838-2005)

Times of India (1838-2005)

ProQuest Historical Newspaper data covering the study of colonial and post-colonial times, class and gender issues, religion, as well as international economics, international relations and cultural studies from 1838 to 2005.

Access Point: Visit UC Berkeley Library's Dataverse
TDM/AI Use: Authorized users may download and digitally copy reasonable portions of materials and may undertake extraction and manipulation of information for the purpose of academic, educational, or research uses within Fair Use. Use of AI is not prohibited. Note that some databases licensed from ProQuest are not available for TDM within ProQuest TDM Studio.

TIPSTER Complete

TIPSTER Complete

LDC's TIPSTER corpus was compiled to advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections. A list of sources included can be found at TIPSTER Complete.

Format: The documents in the test collection are varied in style, size and subject domain.
Access Point: Log in to UC Berkeley Library's Dataverse to download files.
TDM/AI Use: Users may extract and compile from locally-loaded copies for teaching, learning, and research. Summaries, analyses, and interpretations of the linguistic properties of the Information may be derived and published in a scientific or technical context, and shall not infringe the rights of any third party including the authors and publishers.

TDM Studio (ProQuest)

TDM Studio (ProQuest)
TDM Studio includes (1) a virtual Workbench environment and (2) a browser-based Visualization dashboard to run text data mining analyses using ProQuest materials licensed by UC Berkeley. Includes ProQuest Historical Newspapers, Dissertation and Theses, many scholarly journals, as well as the Web of Science XML citation data.

Access point: Sign up for an account with your UC Berkeley email.
TDM/AI uses: UC Berkeley users may create derived data from the text content discovered within TDM studio. This includes summaries of, and extracts from, editorial content and metadata. Derived data cannot be considered a substitute for reading the full-text story and may only be used for teaching, learning, research and analysis. Note that some databases licensed from ProQuest are not available for TDM within ProQuest TDM Studio.
Help: https://proquest.libguides.com/tdmstudio

TDM Studio Workbench

Cloud-Based: Virtual environment equipped with Jupyter notebooks
Coding Skills: Requires Python or R knowledge
Research Product: You can download results and derived data in small capacities, but you can not download a full corpus for local analysis.

TDM Studio Visualization

Cloud-Based: Web-based tool
Coding Skills: None needed
Research Product: You can download your visualization results, but not download the data.

Vogue Archive

Vogue Archive
Contains the entire run of Vogue magazine (US edition), from the first issue in 1892 to the current month, reproduced in high-resolution color page images. Every page, advertisement, cover and fold-out has been included, with rich indexing enabling you to find images by garment type, designer and brand names.

Format: XML and JPEG files
Access point: Log in to UCB Library Dataverse
TDM/AI Use: Authorized users may download and digitally copy reasonable portions of materials and may undertake extraction and manipulation of information for the purpose of academic, educational, or research uses within Fair Use. Use of AI is not prohibited. Note that some databases licensed from ProQuest are not available for TDM within ProQuest TDM Studio.

Web of Science XML Data

Web of Science XML Data

The Web of Science XML Data includes metadata from over 12,500 journals spanning over 250 science, social science and humanities disciplines. Data are available back to 1900 and include over 63 million article records and 1 billion cited references to date. Visit UC Berkeley Library's Dataverse to view the editions and date ranges included.

Access Point: Access the data by visiting UC Berkeley Library's Dataverse, ProQuest TDM Studio (files are directly exported to your computer), or Savio, the campus' computing cluster.
TDM/AI Use: Please email librarydataservices@berkeley.edu for API access to download files and for Terms of Use.
Help: Web of Science Core Collection XML User Guide
- Clarivate GitHub
- University of Chicago Knowledge Lab WoS Builder: create and populate a MySQL database from Web of Science XML data

Wikipedia Data Dumps

Wikipedia Data Dumps

Monthly database backups of all Wikimedia wikis in various formats.

TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

X Developer Platform

X Developer Platform

Public streams provide access to public data flowing through Twitter. Suitable for following specific users or topics, and data mining. You can also access single-user streams, containing roughly all of the data corresponding with a single user’s view of Twitter.

TDM/AI Use: The resource is not licensed by the UC Berkeley Library. Please check the provider's website for terms and conditions regarding TDM and AI.

Secondary menu

Text Mining and AI Research Resources