Skip to main content

Text Mining & Computational Text Analysis: Congressional Record

Congressional Record text corpus (ProQuest)

ProQuest Congressional Record (1789-2005): Alpha release UCB access only

These files were derived from one large, unstructured XML file obtained from ProQuest. Harrison Dekker loaded the original file into BaseX, and Scott McGinnis queried the BaseX database to

  1. create a single file for each day
  2. sort those files into folders based on Congressional sessions, representing two-year spans. 

The Congressional data here were derived from the following publications:

  • Annals of Congress (1789-1824)
  • Register of Debates (1824-1837)
  • Congressional Globe (1833-1873)
  • Congressional Record (1873-2005)

More information about the Congressional Record is available on the US Congress library guide.

Please note:

  • OCR quality varies, and is particularly problematic for older material.
  • The ProQuest data did not include any content from 2003-2005 (the 108th Congress) so that folder is currently missing.
  • There are cases where the XML in certain files may be missing a <fulltext> wrapper.
  • Folder naming convention: yyyy-yyyy-^session_number^
  • File naming convention: yyyymmdd-PQCR^serial_number^

Helpful information:

 

Copyright © 2014-2016 The Regents of the University of California. All rights reserved. Except where otherwise noted, this work is subject to a Creative Commons Attribution-Noncommercial 4.0 License.