Library Guides: Text Mining & Computational Text Analysis: Newspapers &amp; Magazines

Newspapers & Magazines

Chronicling America (Library of Congress)
The Chronicling America API provides access to information about historic U.S. newspapers and millions of digitized newspaper pages and their OCR data is available for bulk download. See the full list of digitized newspaper titles (1836-1922) for more information.
The Guardian and The Observer Archive 1791-1909
Full-text, structured XML files of OCRed text from the Guardian and the Observer newspapers during the years 1791-1909. 205,357 pages. From ProQuest Historical Newspapers. To access, fill out the form linked from the catalog record.
Los Angeles Sentinel (1934-2005)
ProQuest Historical Newspaper data for the los Angeles Sentinel 1934 - 2005, OCR'ed content (results from automated Optical Character Recognition - quality varies).
NY Times APIs
The Article Search API provides access to headlines, abstracts, lead paragraphs and more (but NOT full-text articles) from the New York Times, 1851 to present.
San Francisco Chronicle Archive (1865 - 1922)
Request access from the D-Lab for ProQuest Historical Newspaper data for the San Francisco Chronicle (1865-1922). Note that the quality of the OCR (results from automated Optical Character Recognition) is quite low and varies from paper to paper.
Times of India (1838 - 2005)
ProQuest Historical Newspaper data for the Times of India from 1838-2005. OCR content quality varies.
Vogue Archive
The Vogue Archive contains the entire run of Vogue magazine (US edition), from the first issue in 1892 to the current month, reproduced in high-resolution color page images. Every page, advertisement, cover and fold-out has been included, with rich indexing enabling you to find images by garment type, designer and brand names. XML and JPEG files.
Wall Street Journal (1987-92), Associated Press (1988-90)... (TIPSTER)
LDC's TIPSTER corpus was compiled to advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections. Among other sources it includes portions of the Wall Street Journal, San Jose Mercury News, and the AP Newswire from the late 80s and early 90s. (Read more about TIPSTER)

Available by request: ProQuest Historical Newspapers

Researchers may request OCR full text from any of the following specific newspapers for a specific time period, though requests will require significant processing time. The following sets are already available for TDM use:

Secondary menu

Text Mining & Computational Text Analysis

Newspapers & Magazines

Available by request: ProQuest Historical Newspapers