The Chronicling America API provides access to information about historic U.S. newspapers and millions of digitized newspaper pages and their OCR data is available for bulk download. See the full list of digitized newspaper titles (1836-1922) for more information.
Full-text, structured XML files of OCRed text from the Guardian and the Observer newspapers during the years 1791-1909. 205,357 pages. From ProQuest Historical Newspapers. To access, fill out the form linked from the catalog record.
ProQuest Historical Newspaper data for the los Angeles Sentinel 1934 - 2005, OCR'ed content (results from automated Optical Character Recognition - quality varies).
The Article Search API provides access to headlines, abstracts, lead paragraphs and more (but NOT full-text articles) from the New York Times, 1851 to present.
Request access from the D-Lab for ProQuest Historical Newspaper data for the San Francisco Chronicle (1865-1922). Note that the quality of the OCR (results from automated Optical Character Recognition) is quite low and varies from paper to paper.
The Vogue Archive contains the entire run of Vogue magazine (US edition), from the first issue in 1892 to the current month, reproduced in high-resolution color page images. Every page, advertisement, cover and fold-out has been included, with rich indexing enabling you to find images by garment type, designer and brand names. XML and JPEG files.
LDC's TIPSTER corpus was compiled to advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections. Among other sources it includes portions of the Wall Street Journal, San Jose Mercury News, and the AP Newswire from the late 80s and early 90s. (Read more about TIPSTER)
Available by request: ProQuest Historical Newspapers
Researchers may request OCR full text from any of the following specific newspapers for a specific time period, though requests will require significant processing time. The following sets are already available for TDM use: