Library Resources for text analysis: Large collections and platform-wide access
Many library-licensed online resources do not support text mining applications. The publishers, vendors, or collections listed here are exceptions, and offer some mode of access to texts for large-scale analysis.
Researchers can text mine UCB-subscribed journals and books published by Elsevier on the ScienceDirect full-text platform. Sign up for a developer account to use the Elsevier APIs for non-commercial purposes. You can also use the APIs to access citation data and abstracts from scholarly journals indexed by Scopus. For more information, see their Text Mining documentation.
Nearly 14 million books from the HathiTrust Library are currently available for analysis, offering various levels of immediate access. Check the HTRC tab on this guide for more information to help you get started.
Data for Research allows you to download word frequencies, citations, key terms, and ngrams for up to 1,000 JSTOR articles at a time, or to easily submit requests for larger sets of articles. See also:
JSTORr, a package of simple functions in R to work with DFR output.
JSTOR's Text Analyzer, a reverse search engine that analyzes documents that you upload (your own, or other articles) to find related materials in JSTOR.
The LDC collects language data from both written texts and transcriptions of speech, in various languages, to support corpus linguistics. The Library subscription begins from 2016, and the Library is currently working to migrate legacy collections from the Berkeley Language Center. If you don't see an LDC dataset in OskiCat, search the LDC catalog and email email@example.com with any questions.
"Individual researchers are encouraged to download subscription and open access content for TDM purposes directly from the SpringerLink platform. No registration or API key is required. Full-text content can be accessed easily and programmatically at friendly URLs based on the content’s Digital Object Identifier (DOI)." via Springer's Text and Data Mining Policy.
Library Resources for Text Analysis: Smaller collections and individual publications
Digital "genetic" editions of the original manuscripts for five major works by Samuel Beckett, featuring full TEI/XML downloads.
The purpose of the Beckett Digital Manuscript Project is to reunite the manuscripts of Samuel Beckett's works in a digital way, and to facilitate genetic research: the project brings together digital facsimiles of documents that are now preserved in different holding libraries, and adds transcriptions of Beckett's manuscripts, tools for bilingual and genetic version comparison, a search engine, and an analysis of the textual genesis of his works.
UC Berkeley has licensed access to the full-text corpus data from BYU's COCA: Corpus of Contemporary American English, COHA: Corpus of Historical American English, and GloWbE: Global Web-based English.
Some datasets from the ICPSR include corpora assembled to support data analyses, and include sources such as survey text, text messages, the Congressional Record, political speeches and more.
Direct download access to data sets requires the creation of a personal account. In addition, analysis of ICPSR data sets requires the use of specialized software. For more information on this process, please consult the ICPSR Get Help page or schedule an appointment with the Library Data Lab.
Available in the Library Data Lab (189 Doe Annex).
The Penn Historical Corpora, including the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English (PPCMBE), are syntactically annotated corpora of prose text samples of English from the indicated time periods.
Request access from the D-Lab for ProQuest Historical Newspaper data for the San Francisco Chronicle (1865-1922). Note that the quality of the OCR (results from automated Optical Character Recognition) is quite low and varies from paper to paper.
Library Resources: Available by request
Text mining access to the following resources will require mediation by the Library and vendors involved. Researchers should expect to provide a description of their research, and depending on the scope of the request there may be associated costs. Please contact firstname.lastname@example.org for more information.
ProQuest Historical Newspapers
Researchers may request OCR full text from any of the following specific newspapers for a specific time period:
"Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources... The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts."
Send questions about text and data mining access to library resources to this shared email above, which brings together librarians and campus partners with subject, copyright, technical, and licensing expertise.
For help with text mining tools and software, check out the D-Lab.
Questions and suggestions related to this guide can go to Cody Hennesy.