Skip to Main Content

Text Mining & Computational Text Analysis

About

What are TDM platforms?

  • Cloud-Based: Instead of downloading data, you work with data in the cloud or through a virtual environment. You don't have to store or manage data on your own computer. You run all analyses on the platform.
  • Coding Skills: Some platforms require you to know Python or R, while others do not require any coding knowledge.
  • Content Available: The data and texts available varies from one platform to another. Evaluate the content on each platform to determine if it fits your research needs. 
  • Research Product: When you're done with your research, you download your results, but in most cases you do not download your corpus.

TDM Studio (ProQuest)

TDM Studio Workbench

  • Cloud-Based: virtual environment equipped with Jupyter notebooks
  • Coding Skills: Requires Python or R knowledge
  • Content Available: Most materials licensed by ProQuest that the Library subscribes to, including ProQuest Historical Newspapers, Dissertation and Theses, many scholarly journals, and more.
  • Research Product: You can download your results, but not download your corpus.
  • Access: Sign up for an account
  • Helphttps://proquest.libguides.com/tdmstudio

TDM Studio Visualization

  • Cloud-Based: web-based tool
  • Coding Skills: none needed
  • Content Available: Most materials licensed by ProQuest that the Library subscribes to, including ProQuest Historical Newspapers, Dissertation and Theses, many scholarly journals, and more.
  • Research Product: You can download your results, but not download the data.
  • Access: Create an account with your UC Berkeley email address here.
  • Helphttps://proquest.libguides.com/tdmstudio

Digital Scholar Lab (Gale)

  • Cloud-Based: Web app, with the option to download up to 5,000 documents at a time for local analysis.
  • Coding Skills: None required. Out-of-the-box tools are: document clustering, named entity recognition, ngrams, parts of speech analysis, sentiment analysis, and topic modeling. 
  • Content Available: Most materials licensed by Gale that the Library subscribes to, including American Fiction
    17th and 18th Century Burney Collection, American Civil Liberties Union Papers, 1912-1990, American Fiction, Archives Unbound, Archives of Sexuality & Gender, British Library Newspapers, The Economist Historical Archive, Eighteenth Century Collections Online, Indigenous Peoples: North America, The Making of Modern Law, The Making of the Modern World, Nineteenth Century Collections Online, Nineteenth Century U.S. Newspapers, Sabin Americana, 1500-1926, The Times Digital Archive, The Times Literary Supplement Historical Archive, U.S. Declassified Documents Online.
  • Research Product: You can download both your results and your corpus (there are limits for how many documents you can download at once.)
  • Access: Self -administered: simply log in with your UC Berkeley credentials, and then create a Digital Scholar Lab account using your @berkeley.edu address.
  • Help: Log into Digital Scholar Center, then click on Learning Center 

Note: If you prefer to run analyses on your own computer using your own code, you can download up to 5000 documents at a time.

LexisNexis Web Services API

The LexisNexis Web Services API (WSAPI) is a subscription service that enables researchers to download and build text corpuses from the Nexis Uni subscribed collection for further analysis. The WSAPI is provided by the UC Berkeley Library. 

 

Tool limitations:

  • Searches will be scheduled to run over the weekend to minimize service disruption.

  • May not initiate more than 749 searches per hour nor retrieve more than 3000 documents at a time. 

 

Skills required to use the API:

Please contact consultants at the D-Lab if you would like assistance with any of the below requirements.

  • Knowledge of R or Python, JSON, and XML

  • Familiarity with API calls

Please visit the LexisNexis Web Services API guide for more information. 

HathiTrust Data Capsules

Data Capsules

HathiTrust Data Capsules are secure virtual environments for non-consumptive text analysis, where researchers can implement their own data analysis and visualization tools.

In other words, you log into a virtual machine where you will have access to OCRed texts from the HathiTrust Digital Library. You can run your own analyses on this data. You export your results, but not the corpus itself.

Anyone can use the data capsule and work with public domain materials. In addition, since UC Berkeley is a HathiTrust member, UC Berkeley researchers can include in their corpus material still in copyright.