Skip to Main Content

Data Sources for MIDS Capstone Projects: Text & Data Mining Platforms

A guide to research data available to UC Berkeley students

Text & Data Mining at UC Berkeley

Below are the major text and data mining platforms we offer at UC Berkeley. Since automated scraping of subscription websites is always prohibited, these platforms provide an authorized way to analyze large amounts of text. Learn more on the Library's Text Mining and Computational Text Analysis Guide

TDM Studio (ProQuest)

TDM Studio Workbench

  • Cloud-Based: virtual environment equipped with Jupyter notebooks
  • Coding Skills: Requires Python or R knowledge
  • Content Available: Most materials licensed by ProQuest that the Library subscribes to, including ProQuest Historical Newspapers, Dissertation and Theses, many scholarly journals, and more.
  • Research Product: You can download your results, but not download your corpus.
  • Access: Sign up for an account
  • Helphttps://proquest.libguides.com/tdmstudio

TDM Studio Visualization

  • Cloud-Based: web-based tool
  • Coding Skills: none needed
  • Content Available: Most materials licensed by ProQuest that the Library subscribes to, including ProQuest Historical Newspapers, Dissertation and Theses, many scholarly journals, and more.
  • Research Product: You can download your results, but not download the data.
  • Access: Create an account with your UC Berkeley email address here.
  • Helphttps://proquest.libguides.com/tdmstudio

Digital Scholar Lab (Gale)

  • Cloud-Based: Web app, with the option to download up to 5,000 documents at a time for local analysis.
  • Coding Skills: None required. Out-of-the-box tools are: document clustering, named entity recognition, ngrams, parts of speech analysis, sentiment analysis, and topic modeling. 
  • Content Available: Most materials licensed by Gale that the Library subscribes to, including American Fiction
    17th and 18th Century Burney Collection, American Civil Liberties Union Papers, 1912-1990, American Fiction, Archives Unbound, Archives of Sexuality & Gender, British Library Newspapers, The Economist Historical Archive, Eighteenth Century Collections Online, Indigenous Peoples: North America, The Making of Modern Law, The Making of the Modern World, Nineteenth Century Collections Online, Nineteenth Century U.S. Newspapers, Sabin Americana, 1500-1926, The Times Digital Archive, The Times Literary Supplement Historical Archive, U.S. Declassified Documents Online.
  • Research Product: You can download both your results and your corpus (there are limits for how many documents you can download at once.)
  • Access: Self -administered: simply log in with your UC Berkeley credentials, and then create a Digital Scholar Lab account using your @berkeley.edu address.
  • Help: Log into Digital Scholar Center, then click on Learning Center 

Note: If you prefer to run analyses on your own computer using your own code, you can download up to 5000 documents at a time.

About

What are TDM platforms?

  • Cloud-Based: Instead of downloading data, you work with data in the cloud or through a virtual environment. You don't have to store or manage data on your own computer. You run all analyses on the platform.
  • Coding Skills: Some platforms require you to know Python or R, while others do not require any coding knowledge.
  • Content Available: The data and texts available varies from one platform to another. Evaluate the content on each platform to determine if it fits your research needs. 
  • Research Product: When you're done with your research, you download your results, but in most cases you do not download your corpus.