Library Guides: Text Mining & Computational Text Analysis: TDM Platforms

About

What are TDM platforms?

Cloud-Based: Instead of downloading data, you work with data in the cloud or through a virtual environment. You don't have to store or manage data on your own computer. You run all analyses on the platform.
Coding Skills: Some platforms require you to know Python or R, while others do not require any coding knowledge.
Content Available: The data and texts available varies from one platform to another. Evaluate the content on each platform to determine if it fits your research needs.
Research Product: When you're done with your research, you download your results, but in most cases you do not download your corpus.

TDM Studio (ProQuest)

TDM Studio
TDM Studio (ProQuest) allows you to do TDM research on materials licensed by ProQuest that the Library subscribes to. TDM Studio has two options for researchers: TDM Studio Workbench is a virtual environment equipped with Jupyter notebooks. TDM Studio Visualization is a web-based visualization tool.
TDM Studio supports two paths to new discoveries. TDM Studio Workbench is designed for experienced researchers who use their own coding methodologies. TDM Studio Visualization is designed for users of all levels to quickly spot trends and generate insights. Content available in TDM Studio include current and historical newspapers, dissertations and theses, scholarly articles, and primary source material that the Library has subscribed to through ProQuest. You must first request an account through Proquest using your @berkeley.edu email address. Read more documentation.

TDM Studio Workbench

Cloud-Based: virtual environment equipped with Jupyter notebooks
Coding Skills: Requires Python or R knowledge
Content Available: Web of Science XML citation data, most materials licensed by ProQuest that the Library subscribes to, including ProQuest Historical Newspapers, Dissertation and Theses, many scholarly journals, and more.
Research Product: You can download your results, but not download your corpus.
Access: Sign up for an account
Help: https://proquest.libguides.com/tdmstudio

TDM Studio Visualization

Cloud-Based: web-based tool
Coding Skills: none needed
Content Available: Web of Science XML citation data, most materials licensed by ProQuest that the Library subscribes to, including ProQuest Historical Newspapers, Dissertation and Theses, many scholarly journals, and more.
Research Product: You can download your results, but not download the data.
Access: Create an account with your UC Berkeley email address here.
Help: https://proquest.libguides.com/tdmstudio

Digital Scholar Lab (Gale)

Digital Scholar Lab (Gale)
Gale Digital Scholar Lab allows you to run simple TDM analyses in a web browser on materials licensed by Gale that the Library subscribes to.
Primary source collections include: American Fiction
17th and 18th Century Burney Collection, American Civil Liberties Union Papers, 1912-1990, American Fiction, Archives Unbound, Archives of Sexuality & Gender, British Library Newspapers, The Economist Historical Archive, Eighteenth Century Collections Online, Indigenous Peoples: North America, The Making of Modern Law, The Making of the Modern World, Nineteenth Century Collections Online, Nineteenth Century U.S. Newspapers, Sabin Americana, 1500-1926, The Times Digital Archive, The Times Literary Supplement Historical Archive, U.S. Declassified Documents Online

Cloud-Based: Web app, with the option to download up to 5,000 documents at a time for local analysis.
Coding Skills: None required. Out-of-the-box tools are: document clustering, named entity recognition, ngrams, parts of speech analysis, sentiment analysis, and topic modeling.
Content Available: Most materials licensed by Gale that the Library subscribes to, including American Fiction
17th and 18th Century Burney Collection, American Civil Liberties Union Papers, 1912-1990, American Fiction, Archives Unbound, Archives of Sexuality & Gender, British Library Newspapers, The Economist Historical Archive, Eighteenth Century Collections Online, Indigenous Peoples: North America, The Making of Modern Law, The Making of the Modern World, Nineteenth Century Collections Online, Nineteenth Century U.S. Newspapers, Sabin Americana, 1500-1926, The Times Digital Archive, The Times Literary Supplement Historical Archive, U.S. Declassified Documents Online.
Research Product: You can download both your results and your corpus (there are limits for how many documents you can download at once.)
Access: Self -administered: simply log in with your UC Berkeley credentials, and then create a Digital Scholar Lab account using your @berkeley.edu address.
Help: Log into Digital Scholar Center, then click on Learning Center

Note: If you prefer to run analyses on your own computer using your own code, you can download up to 5000 documents at a time.

LexisNexis Web Services API

The LexisNexis Web Services API (WSAPI) is a subscription service that enables researchers to download and build text corpuses from the Nexis Uni subscribed collection for further analysis. The WSAPI is provided by the UC Berkeley Library.

Tool limitations:

Searches will be scheduled to run over the weekend to minimize service disruption.
May not initiate more than 749 searches per hour nor retrieve more than 3000 documents at a time.

Skills required to use the API:

Please contact consultants at the D-Lab if you would like assistance with any of the below requirements.

Knowledge of R or Python, JSON, and XML
Familiarity with API calls

Please visit the LexisNexis Web Services API guide for more information.

HathiTrust Data Capsules

Data Capsules

HathiTrust Data Capsules are secure virtual environments for non-consumptive text analysis, where researchers can implement their own data analysis and visualization tools.

In other words, you log into a virtual machine where you will have access to OCRed texts from the HathiTrust Digital Library. You can run your own analyses on this data. You export your results, but not the corpus itself.

Anyone can use the data capsule and work with public domain materials. In addition, since UC Berkeley is a HathiTrust member, UC Berkeley researchers can include in their corpus material still in copyright.

Secondary menu

Text Mining & Computational Text Analysis

About

TDM Studio (ProQuest)

Digital Scholar Lab (Gale)

LexisNexis Web Services API

HathiTrust Data Capsules