Skip to Main Content

Text Mining & Computational Text Analysis

Government Documents

Congressional Record text corpus (ProQuest)

ProQuest Congressional Record (1789-2005): Alpha release UCB access only

These files were derived from one large, unstructured XML file obtained from ProQuest. Harrison Dekker loaded the original file into BaseX, and Scott McGinnis queried the BaseX database to

  1. create a single file for each day
  2. sort those files into folders based on Congressional sessions, representing two-year spans. 

The Congressional data here were derived from the following publications:

  • Annals of Congress (1789-1824)
  • Register of Debates (1824-1837)
  • Congressional Globe (1833-1873)
  • Congressional Record (1873-2005)

More information about the Congressional Record is available on the US Congress library guide.

Please note:

  • OCR quality varies, and is particularly problematic for older material.
  • The ProQuest data did not include any content from 2003-2005 (the 108th Congress) so that folder is currently missing.
  • There are cases where the XML in certain files may be missing a <fulltext> wrapper.
  • Folder naming convention: yyyy-yyyy-^session_number^
  • File naming convention: yyyymmdd-PQCR^serial_number^

Helpful information:

 

GPO Website

Other Tools and Resources