Library Guides: Text Mining & Computational Text Analysis: Linguistic Corpora

Linguistic Corpora

BYU Corpus Data
UC Berkeley has licensed access to the full-text corpus data from BYU's COCA: Corpus of Contemporary American English, COHA: Corpus of Historical American English, and GloWbE: Global Web-based English. See full details in section below.
Corpus Resource Database (CoRD)
CoRD provides links to and descriptions of a large number of corpora, subcorpora and databases. (University of Helsinki)
Linguistic Data Consortium Corpora
The LDC collects language data from both written texts and transcriptions of speech, in various languages, to support corpus linguistics. The Library subscription begins from 2016, and the Library is currently working to migrate legacy collections from the Berkeley Language Center. If you don't see an LDC dataset in UC Library Search, search the LDC catalog and email tdm-access at berkeley.edu with any questions.
Open American National Corpus
15 million words of American English automatically annotated for logical structure, word and sentence boundaries, part of speech (multiple tag sets), shallow parse (noun and verb chunks), and named entities.
Scottish Corpus of Text & Speech (1945-present)
The Scottish Corpora project has created large electronic corpora (over 4.5 million words) of written and spoken texts for the languages of Scotland. See also the Helsinki Corpus of Older Scots (1450-1700) and the Corpus of Modern Scottish Writing (1700-1945).

BYU Corpus Data

English language corpora from BYU

UC Berkeley has licensed access to the full-text corpus data for the following BYU English language collections. You can search these corpora online without accessing the full-text data:

COCA: Corpus of Contemporary American English
The corpus contains more than 520 million words of text (20 million words each year 1990-2015) and it is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.
COHA: Corpus of Historical American English
COHA contains more than 400 million words of text from the 1810s-2000s and is balanced by genre, decade by decade.
GloWbE: Global Web-based English
GloWbE contains about 1.9 billion words of text from twenty different countries.

Full-text corpus data

The full-text corpus data for COCA, COHA and GloWbE are each available.

COCA: Corpus of Contemporary American English - Apply for Access

COHA: Corpus of Historical American English - Apply for Access

GloWbE: Global Web-based English - Apply for Access

Note that each dataset is available in three different formats: Database, Word/lemma/PoS, and Linear text.
For more information about the data formats see corpus.byu.edu.

NOW: Corpus of News on the Web

NOW: Corpus of News on the Web
NOW contains 3.7 billion words of data from web-based newspapers and magazines from 2010 to the present time. The corpus grows by about 4-5 million words of data each day (from about 10,000 new articles), or about 130 million words each month. Note: full-text data for this corpus is not available.

Secondary menu

Text Mining & Computational Text Analysis

Linguistic Corpora

BYU Corpus Data