It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Update: Moffitt Library is closed for seismic work, but most other libraries are open. Learn more.
Library Resources for text analysis: Large collections and platform-wide access
Many library-licensed online resources do not support text mining applications. The publishers, vendors, or collections listed here are exceptions, and offer some mode of access to texts for large-scale analysis.
Researchers can text mine UCB-subscribed journals and books published by Elsevier on the ScienceDirect full-text platform. Sign up for a developer account to use the Elsevier APIs for non-commercial purposes, and make sure to query the API from UC Berkeley IPs to ensure full access. You can also use the APIs to access citation data and abstracts from scholarly journals indexed by Scopus. For more information, see their Text Mining documentation.
Nearly 14 million books from the HathiTrust Library are currently available for analysis, offering various levels of immediate access. Check the HTRC tab on this guide for more information to help you get started.
Data for Research allows you to download word frequencies, citations, key terms, and ngrams for up to 25,000 JSTOR articles at a time, or to easily submit requests for larger sets of articles. See also:
The LDC collects language data from both written texts and transcriptions of speech, in various languages, to support corpus linguistics. The Library subscription begins from 2016, and the Library is currently working to migrate legacy collections from the Berkeley Language Center. If you don't see an LDC dataset in OskiCat, search the LDC catalog and email firstname.lastname@example.org with any questions.
"Individual researchers are encouraged to download subscription and open access content for TDM purposes directly from the SpringerLink platform. No registration or API key is required. Full-text content can be accessed easily and programmatically at friendly URLs based on the content’s Digital Object Identifier (DOI)." via Springer's Text and Data Mining Policy.
Library Resources for Text Analysis: Smaller collections and individual publications
Digital "genetic" editions of the original manuscripts for five major works by Samuel Beckett, featuring full TEI/XML downloads.
The purpose of the Beckett Digital Manuscript Project is to reunite the manuscripts of Samuel Beckett's works in a digital way, and to facilitate genetic research: the project brings together digital facsimiles of documents that are now preserved in different holding libraries, and adds transcriptions of Beckett's manuscripts, tools for bilingual and genetic version comparison, a search engine, and an analysis of the textual genesis of his works.
UC Berkeley has licensed access to the full-text corpus data from BYU's COCA: Corpus of Contemporary American English, COHA: Corpus of Historical American English, and GloWbE: Global Web-based English.
Some datasets from the ICPSR include corpora assembled to support data analyses, and include sources such as survey text, text messages, the Congressional Record, political speeches and more.
Direct download access to data sets requires the creation of a personal account. In addition, analysis of ICPSR data sets requires the use of specialized software. For more information on this process, please consult the ICPSR Get Help page or schedule an appointment with the Library Data Lab.
19MB zip file containing an XML document for every full text article from Godey's Lady Book, Parts I-III (Accessible Archives). The magazine was intended to entertain, inform and educate the women of America and covers fashion, biographical sketches, articles about mineralogy, handcrafts, female costume, the dance, equestrienne procedures, health and hygiene, recipes, remedies, and the like.
Linguistic Data Consortium's NY Times Corpus contains over 1.8 million articles from the New York Times between January 1, 1987 and June 19, 2007. The corpus includes: over 1.8 million articles (excluding wire service articles); over 650,000 article summaries; human- and algorithm-assigned tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors; Java tools for parsing corpus documents from .xml into a memory resident object.
Available in the Library Data Lab (189 Doe Annex).
The Penn Historical Corpora, including the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English (PPCMBE), are syntactically annotated corpora of prose text samples of English from the indicated time periods.
105MB zip file containing an XML document for every full text article from the Pennsylvania Gazette (Accessible Archives). This paper provides a first-hand view of colonial America, the American Revolution and the New Republic, offering important social, political and cultural perspectives of each of these periods.
Request access from the D-Lab for ProQuest Historical Newspaper data for the San Francisco Chronicle (1865-1922). Note that the quality of the OCR (results from automated Optical Character Recognition) is quite low and varies from paper to paper.
Library Resources: Available by request
Text mining access to the following resources will require mediation by the Library and vendors involved. Researchers should expect to provide a description of their research, and depending on the scope of the request there may be associated costs. Please contact email@example.com for more information.
ProQuest Historical Newspapers
Researchers may request OCR full text from any of the following specific newspapers for a specific time period:
"Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources... The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts."