It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
UC Berkeley has licensed access to the full-text corpus data from BYU's COCA: Corpus of Contemporary American English, COHA: Corpus of Historical American English, and GloWbE: Global Web-based English.
The LDC collects language data from both written texts and transcriptions of speech, in various languages, to support corpus linguistics. The Library subscription begins from 2016, and the Library is currently working to migrate legacy collections from the Berkeley Language Center. If you don't see an LDC dataset in OskiCat, search the LDC catalog and email firstname.lastname@example.org with any questions.
Linguistic Data Consortium's NY Times Corpus contains over 1.8 million articles from the New York Times between January 1, 1987 and June 19, 2007. The corpus includes: over 1.8 million articles (excluding wire service articles); over 650,000 article summaries; human- and algorithm-assigned tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors; Java tools for parsing corpus documents from .xml into a memory resident object.
15 million words of American English automatically annotated for logical structure, word and sentence boundaries, part of speech (multiple tag sets), shallow parse (noun and verb chunks), and named entities.