Liberating Congressional Financial Disclosure Data from PDFs
By OpenGov Coding Intern Ross Tsiomenko
Editor’s Note: this was originally an OpenSecrets.org challenge at the PDF Liberation Hackathon. The original hacking challenge can be found here.
The PDF Data Extraction Challenge
“Members, officers, and candidates of the House of Representatives submit Financial Disclosure Reports every year that detail the source, type, amount, or value of their incomes, as required by the Ethics in Government Act of 1978. However, there are over 1200 people that file these forms every year, including 435 Representatives, and each form can contain over 10 pages of important financial information. These forms are only available on the Clerk of the House website in PDF format, meaning this data can only be viewed by going to a specific website and searching for a specific name. Our goal is to automate the extraction of this data and make it available to everyone for easy viewing, searching, comparison, and analysis. The only other source for this data right now is OpenSecrets.org, which has to compile this information by hand, meaning they cannot handle the volume of the Reports on a timely basis.” Read More
Ross Tsiomenko, OpenGov Foundation Coding Intern
The current prototype built during the PDF Liberation Hackathon can handle some electronically-filed forms using the ABBYY Cloud OCR API, a paid cloud OCR service.
Example Congressional Financial Disclosure PDF Document. Goal is to extract the financial information at the bottom into an open data format.
Financial Disclosure Reports are not text-based PDFs, but rather scanned-in images, meaning OCR (optical character recognition) software must be used to extract the data. ABBYY Cloud OCR is the only software currently known to extract tabular data correctly; the prototype uses a shell script to upload a PDF to the Cloud API, which returns a text file with most columns and rows intact. This is then cleaned up and turned into a csv file using Python.
Results: Ross was able to successfully extract Congressional Financial Disclosure data from a PDF into this tabular spreadsheet. Click here to view the prototype on Github.
Note there are 2 types of reports – electronically filed and handwritten. Both are scanned in, but electronically filed reports have typing on them which is easy to use OCR on, while the handwritten forms are beyond any fully automated OCR effort, and must be handled differently, for example by breaking up the reports for use in a CAPTCHA service (see the end of the Next Steps section for more).
There is still a lot of work to be done. The software must be expanded to cover all possible Report form variations, and must be tested on a large data set for accuracy. See the next section for a development roadmap.
- The current PDF extraction prototype can be found here.
- Personal Financial Statements can be found here.
Note: there is no bulk download of PDFs, only through a search option.
* Create a bulk package of PDFs for testing
The Clerk of the House website does not allow downloading of all PDF files by year, nor does it allow sorting by electronically filed forms or handwritten ones. To properly test any software that is created, a test data set must first be created; a single zip file of, for example, all 2013 disclosures, would be invaluable to the people working on code. It would also be very helpful to create 2 zip files, one with electronically filed forms, and one with handwritten ones. Finally, it would be great if this process could somehow be automated, so that bulk downloads would proceed automatically for all subsequent years (at least while the Clerk of the House website remains unchanged). A starting point for this is to leave the search box blank and set the filing year to 2013, which should give you 41 pages of 818 separate search results with download links.
* Ensure current version works on multiple types of electronic forms
There is no single standard for Financial Disclosure Reports – there are different form types (A and B, depending on the person filing), different Schedules in each report that be or may not be omitted (such as “Schedule III – Assets and Unearned Income” or “Schedule IV – Transactions”). The software should account for all these possibilities and handle them accordingly.
* Convert the bulk package of electronically filed forms
After a test data set of electronically filed forms has been created, the current prototype should be tested on as many forms as possible to ensure it works with all possible variations.
* Transition to open source ABBYY alternative
Once everything is working, an alternative to the paid ABBYY Cloud OCR service should be found. Although ABBYY works great, it is not free; processing all forms filed within a calendar year would take 10,000+ page requests (not counting development trial and error), which could cost up to $900 according to ABBYY pricing.
Software like Tesseract and Tabula have been tried, and do not currently work. Financial Disclosure Reports are scanned documents, not text-based, so the software must handle both OCR and tabular data (a rarity); Tabula and other Apache PDFBox derivatives only work with text-based PDFs, and Tesseract, while having great OCR capabilities, does not handle tabular data well.
* Tackle Handwritten Financial Disclosure PDFs
From the original OpenSecret challenge – “Members of the House of Representatives submit a yearly report on their personal finances. Though there is the option to submit the report electronically, some Members choose to download the report and hand enter their information. The challenge for these reports is to build a CAPTCHA type program that will allow us to crowd-source the data entry for these fillings. The CAPTCHA program will need to split the PDF into readable bits and know what each bit means (the field in the form). It is also important for this program to not split a word from the PDF mid-letter.”