Library Guides: Text Mining & Computational Text Analysis: Government Documents

Government Documents

Caselaw Access Project
The Caselaw Access Project (“CAP”) expands public access to U.S. law., and contains over 360 years (going back to 1658) of published U.S. court decisions, digitized from the collection of the Harvard Law Library.
Congress.gov API
The Congress.gov API includes bills, amendments, summaries, Congress, members, the Congressional Record, committee reports, nominations, treaties, and House Communications. Over time we will be adding hearing transcripts and Senate Communications. Sign up for a free API key to use.
CourtListener APIs and Bulk Legal Data
Opinions, docket files, and more from 420 courts.
FDSys: Bulk Data
Bulk data downloads of major US Government publications including Congressional Bills, Commerce Business Daily, Federal Register, Public Papers of the Presidents of the United States, Supreme Court Decisions 1937-1975 (FLITE) and more.
Congress and Government Information APIs (Sunlight Foundation) Static Resource
This source is no longer updated as of 2020. Sunlight Foundation provides APIs to a variety government information sources including: Congress API v3: information on legislators, districts, committees, bills, votes, as well as real-time notice of hearings, floor activity and upcoming bills. Open States API: Information on the legislators and activities of all 50 state legislatures, Washington, D.C. and Puerto Rico. Real-Time Federal Campaign Finance API: A JSON and CSV API that delivers up-to-the-minute campaign finance information on federal candidates, committees, PACs and other groups that file electronically with the Federal Election Commission.

Congressional Record text corpus (ProQuest)

ProQuest Congressional Record (1789-2005): Alpha release

These files were derived from one large, unstructured XML file obtained from ProQuest. Harrison Dekker loaded the original file into BaseX, and Scott McGinnis queried the BaseX database to

create a single file for each day
sort those files into folders based on Congressional sessions, representing two-year spans.

The Congressional data here were derived from the following publications:

Annals of Congress (1789-1824)
Register of Debates (1824-1837)
Congressional Globe (1833-1873)
Congressional Record (1873-2005)

More information about the Congressional Record is available on the US Congress library guide.

Please note:

OCR quality varies, and is particularly problematic for older material.
The ProQuest data did not include any content from 2003-2005 (the 108th Congress) so that folder is currently missing.
There are cases where the XML in certain files may be missing a <fulltext> wrapper.
Folder naming convention: yyyy-yyyy-^session_number^
File naming convention: yyyymmdd-PQCR^serial_number^

Helpful information:

GPO Website

Online Congressional Record (Bound Edition)
The Congressional Record is the official record of the proceedings and debates of the United States Congress, 1873- present. U.S. Government Publishing Office

Other Tools and Resources

the @unitedstates project
A shared commons of data and tools for accessing data from the United States government.

Secondary menu

Text Mining & Computational Text Analysis

Government Documents

Congressional Record text corpus (ProQuest)

GPO Website

Other Tools and Resources