Skip to main content
Popular APIs and datasets for text analysis
This page includes free datasets and public APIs for a variety of popular sources that may suit computational text analysis projects.
Blogger Corpus (2004)
The collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts.
Chronicling America (Library of Congress)
The Chronicling America API provides access to information about historic U.S. newspapers and millions of digitized newspaper pages and their OCR data
is available for bulk download. See the full list of digitized newspaper titles
(1836-1922) for more information.
Congress and Government Information APIs (Sunlight Foundation)
Sunlight Foundation provides APIs to a variety government information sources including: Congress API v3: information on legislators, districts, committees, bills, votes, as well as real-time notice of hearings, floor activity and upcoming bills. Open States API: Information on the legislators and activities of all 50 state legislatures, Washington, D.C. and Puerto Rico. Real-Time Federal Campaign Finance API: A JSON and CSV API that delivers up-to-the-minute campaign finance information on federal candidates, committees, PACs and other groups that file electronically with the Federal Election Commission.
CourtListener APIs and Bulk Legal Data
Opinions, docket files, and more from 420 courts.
Delpher (Dutch language resources)
Dutch newspapers, books, journals and radio bulletins available in full-text, along with rich datasets, APIs and other digital humanities tools for interaction.
Library of Congress: 25 million bibliographic metadata records New
The LOC release of 25 million MARC records for free bulk download. MARC (Machine Readable Cataloging Records) is a international metadata standard for the representation and communication of bibliographic and related information.
NY Times APIs
The Article Search API provides access to headlines, abstracts, lead paragraphs and more (but NOT full-text articles) from the New York Times, 1851 to present.
Query content from the seven open-access peer-reviewed journals from the Public Library of Science using any of the twenty-three terms in the PLOS Search.
Access data from posts, threads, comments, users and more from reddit and subreddits.
Twitter Streaming APIs
Public streams provide access to public data flowing through Twitter. Suitable for following specific users or topics, and data mining. You can also access single-user streams, containing roughly all of the data corresponding with a single user’s view of Twitter.
Access to business data, including location, photos, Yelp rating, price levels, hours of operation, and types of transactions. Also includes a Review API, which returns up to 3 review excerpts for a business. See also: the Yelp Dataset Academic Challenge