Skip to Main Content

Text Mining & Computational Text Analysis

About

What are TDM platforms?

  • Cloud-Based: Instead of downloading data, you work with data in the cloud or through a virtual environment. You don't have to store or manage data on your own computer. You run all analyses on the platform.
  • Coding Skills: Some platforms require you to know Python or R, while others do not require any coding knowledge.
  • Content Available: The data and texts available varies from one platform to another. Evaluate the content on each platform to determine if it fits your research needs. 
  • Research Product: When you're done with your research, you download your results, but in most cases you do not download your corpus.

TDM Studio (ProQuest)

TDM Studio Workbench

  • Cloud-Based: virtual environment equipped with Jupyter notebooks
  • Coding Skills: Requires Python or R knowledge
  • Content Available: Most materials licensed by ProQuest that the Library subscribes to, including ProQuest Historical Newspapers, Dissertation and Theses, many scholarly journals, and more.
  • Research Product: You can download your results, but not download your corpus.
  • Access: Sign up for an account
  • Helphttps://proquest.libguides.com/tdmstudio

TDM Studio Visualization

  • Cloud-Based: web-based tool
  • Coding Skills: none needed
  • Content Available: Most materials licensed by ProQuest that the Library subscribes to, including ProQuest Historical Newspapers, Dissertation and Theses, many scholarly journals, and more.
  • Research Product: You can download your results, but not download the data.
  • Access: Create an account with your UC Berkeley email address here.
  • Helphttps://proquest.libguides.com/tdmstudio

Digital Scholar Lab (Gale)

  • Cloud-Based: Web app, with the option to download up to 5,000 documents at a time for local analysis.
  • Coding Skills: None required. Out-of-the-box tools are: document clustering, named entity recognition, ngrams, parts of speech analysis, sentiment analysis, and topic modeling. 
  • Content Available: Most materials licensed by Gale that the Library subscribes to, including American Fiction
    17th and 18th Century Burney Collection, American Civil Liberties Union Papers, 1912-1990, American Fiction, Archives Unbound, Archives of Sexuality & Gender, British Library Newspapers, The Economist Historical Archive, Eighteenth Century Collections Online, Indigenous Peoples: North America, The Making of Modern Law, The Making of the Modern World, Nineteenth Century Collections Online, Nineteenth Century U.S. Newspapers, Sabin Americana, 1500-1926, The Times Digital Archive, The Times Literary Supplement Historical Archive, U.S. Declassified Documents Online.
  • Research Product: You can download both your results and your corpus (there are limits for how many documents you can download at once.)
  • Access: Self -administered: simply log in with your UC Berkeley credentials, and then create a Digital Scholar Lab account using your @berkeley.edu address.
  • Help: Log into Digital Scholar Center, then click on Learning Center 

Note: If you prefer to run analyses on your own computer using your own code, you can download up to 5000 documents at a time.

Nexis Data Lab (LexisNexis)

Nexis Data Lab (LexisNexis) allows you to do TDM research on materials licensed by LexisNexis that the Library subscribes to. Analyses can be conducted using either Python or R in Jupyter notebooks. Nexis Data Lab is offered by the Library.

  • Cloud-Based: virtual environment (workspace) equipped with Jupyter notebooks. Your account can have up to 6 workspaces.
  • Coding Skills: Requires Python or R knowledge
  • Content Available: Materials licensed by LexisNexis that the Library subscribes to, including newspapers, transcripts of video/audio news, company & financial information, and legal content. See a partial list. Note that these publications are NOT available: The New York Times (NDL does include NYT International), The New York Times Blogs, Wall Street Journal Abstracts, Information Base Abstracts, and Jane’s Defence Weekly. The size of your corpus has a limit of 100,000 documents.
  • Research Product: You can download your results, but not download your corpus.
  • Access: Sign up for an account via ​​http://ucblib.link/ndl-request (limited seats available)
  • Helphttps://www.lexisnexis.com/en-us/professional/academic/nexis-data-lab.page or email librarydataservices@berkeley.edu

HathiTrust Data Capsules

Data Capsules

HathiTrust Data Capsules are secure virtual environments for non-consumptive text analysis, where researchers can implement their own data analysis and visualization tools.

In other words, you log into a virtual machine where you will have access to OCRed texts from the HathiTrust Digital Library. You can run your own analyses on this data. You export your results, but not the corpus itself.

Anyone can use the data capsule and work with public domain materials. In addition, since UC Berkeley is a HathiTrust member, UC Berkeley researchers can include in their corpus material still in copyright.