Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

You can still access the UC Berkeley Library’s services and resources during the closure. Here’s how.

Text Mining & Computational Text Analysis

Government Documents

Congressional Record text corpus (ProQuest)

ProQuest Congressional Record (1789-2005): Alpha release UCB access only

These files were derived from one large, unstructured XML file obtained from ProQuest. Harrison Dekker loaded the original file into BaseX, and Scott McGinnis queried the BaseX database to

  1. create a single file for each day
  2. sort those files into folders based on Congressional sessions, representing two-year spans. 

The Congressional data here were derived from the following publications:

  • Annals of Congress (1789-1824)
  • Register of Debates (1824-1837)
  • Congressional Globe (1833-1873)
  • Congressional Record (1873-2005)

More information about the Congressional Record is available on the US Congress library guide.

Please note:

  • OCR quality varies, and is particularly problematic for older material.
  • The ProQuest data did not include any content from 2003-2005 (the 108th Congress) so that folder is currently missing.
  • There are cases where the XML in certain files may be missing a <fulltext> wrapper.
  • Folder naming convention: yyyy-yyyy-^session_number^
  • File naming convention: yyyymmdd-PQCR^serial_number^

Helpful information:

 

GPO Website