Liberating Congressional Financial Disclosure Data from PDFs
By OpenGov Coding Intern Ross Tsiomenko
Editor’s Note: this was originally an OpenSecrets.org challenge at the PDF Liberation Hackathon. The original hacking challenge can be found here.
The PDF Data Extraction Challenge
“Members, officers, and candidates of the House of Representatives submit Financial Disclosure Reports every year that detail the source, type, amount, or value of their incomes, as required by the Ethics in Government Act of 1978. However, there are over 1200 people that file these forms every year, including 435 Representatives, and each form can contain over 10 pages of important financial information. These forms are only available on the Clerk of the House website in PDF format, meaning this data can only be viewed by going to a specific website and searching for a specific name. Our goal is to automate the extraction of this data and make it available to everyone for easy viewing, searching, comparison, and analysis. The only other source for this data right now is OpenSecrets.org, which has to compile this information by hand, meaning they cannot handle the volume of the Reports on a timely basis.” Read More
Ross Tsiomenko, OpenGov Foundation Coding Intern
The current prototype built during the PDF Liberation Hackathon can handle some electronically-filed forms using the ABBYY Cloud OCR API, a paid cloud OCR service.
Example Congressional Financial Disclosure PDF Document. Goal is to extract the financial information at the bottom into an open data format.
Financial Disclosure Reports are not text-based PDFs, but rather scanned-in images, meaning OCR (optical character recognition) software must be used to extract the data. ABBYY Cloud OCR is the only software currently known to extract tabular data correctly; the prototype uses a shell script to upload a PDF to the Cloud API, which returns a text file with most columns and rows intact. This is then cleaned up and turned into a csv file using Python.
Results: Ross was able to successfully extract Congressional Financial Disclosure data from a PDF into this tabular spreadsheet. Click here to view the prototype on Github.
Note there are 2 types of reports – electronically filed and handwritten. Both are scanned in, but electronically filed reports have typing on them which is easy to use OCR on, while the handwritten forms are beyond any fully automated OCR effort, and must be handled differently, for example by breaking up the reports for use in a CAPTCHA service (see the end of the Next Steps section for more).
There is still a lot of work to be done. The software must be expanded to cover all possible Report form variations, and must be tested on a large data set for accuracy. See the next section for a development roadmap.
- The current PDF extraction prototype can be found here.
- Personal Financial Statements can be found here.
Note: there is no bulk download of PDFs, only through a search option.
* Create a bulk package of PDFs for testing
The Clerk of the House website does not allow downloading of all PDF files by year, nor does it allow sorting by electronically filed forms or handwritten ones. To properly test any software that is created, a test data set must first be created; a single zip file of, for example, all 2013 disclosures, would be invaluable to the people working on code. It would also be very helpful to create 2 zip files, one with electronically filed forms, and one with handwritten ones. Finally, it would be great if this process could somehow be automated, so that bulk downloads would proceed automatically for all subsequent years (at least while the Clerk of the House website remains unchanged). A starting point for this is to leave the search box blank and set the filing year to 2013, which should give you 41 pages of 818 separate search results with download links.
* Ensure current version works on multiple types of electronic forms
There is no single standard for Financial Disclosure Reports – there are different form types (A and B, depending on the person filing), different Schedules in each report that be or may not be omitted (such as “Schedule III – Assets and Unearned Income” or “Schedule IV – Transactions”). The software should account for all these possibilities and handle them accordingly.
* Convert the bulk package of electronically filed forms
After a test data set of electronically filed forms has been created, the current prototype should be tested on as many forms as possible to ensure it works with all possible variations.
* Transition to open source ABBYY alternative
Once everything is working, an alternative to the paid ABBYY Cloud OCR service should be found. Although ABBYY works great, it is not free; processing all forms filed within a calendar year would take 10,000+ page requests (not counting development trial and error), which could cost up to $900 according to ABBYY pricing.
Software like Tesseract and Tabula have been tried, and do not currently work. Financial Disclosure Reports are scanned documents, not text-based, so the software must handle both OCR and tabular data (a rarity); Tabula and other Apache PDFBox derivatives only work with text-based PDFs, and Tesseract, while having great OCR capabilities, does not handle tabular data well.
* Tackle Handwritten Financial Disclosure PDFs
From the original OpenSecret challenge – “Members of the House of Representatives submit a yearly report on their personal finances. Though there is the option to submit the report electronically, some Members choose to download the report and hand enter their information. The challenge for these reports is to build a CAPTCHA type program that will allow us to crowd-source the data entry for these fillings. The CAPTCHA program will need to split the PDF into readable bits and know what each bit means (the field in the form). It is also important for this program to not split a word from the PDF mid-letter.”
“I want to be able to do that automatically, because ain’t nobody got time to sit here and hack on every government [PDF] document we want to make open . . . the longer-term goal is to improve government processes, detect fraud, waste, and abuse, and provide open data for start-ups to generate economic value with.”— Travis Korte on liberating civic data from PDFs
If you are a government watch-dog trying to follow the scent of financial scandal, you face a dilemma. Much of the information you need is difficult to access–but not because of legal barriers. It’s because members of Congress submit their financial records via PDF.
Portable Document Format, or PDF, is by its nature difficult to transfer into other formats, and can’t be used by developers to produce helpful applications. It’s hard to search for specific information in PDFs, so you often have to scroll through every page to find what you’re looking for. And information like large collections of numbers can’t simply be moved from PDFs into spreadsheets, so journalists and researchers can’t manipulate the data or turn it into helpful visuals.
If you need information stored in PDFs you have two choices. You can either spend thousands of dollars to purchase PDF extraction software, or you can hire a hapless intern to spend hours hunched over a screen, scrolling through data page by page and entering it–by hand–into a database. Sadly, there’s yet to be an easy solution to working with data locked in PDFs.
Hence the PDF Liberation Hackathon: on January 19, 2014, developers and civic-minded individuals gathered in cities around the world in an heroic effort to liberate this crucial public data from inaccessible PDFs. The event focused on converting large collections of documents, ranging from House of Representatives Financial Disclosures to decades of US Foreign Aid Reports, into open data formats usable by developers. Many of the hackathon’s attendees represented transparency-minded organizations who plan to search, sift, organize, and publish the information for public knowledge. But a bigger goal emerged from these challenges–to automate the process of PDF extraction so that anyone who struggles with PDFs, from government workers to civic hackers, will never have to deal with the problem again.
Delivered by OpenGov Foundation Executive Director Seamus Kraft at a memorial for Sean Keefer on January 24, 2014 at Ames Hall on the George Washington University Mount Vernon Campus.
Every one of us is hurting. We’ve lost a loyal friend and stellar student. We’ve lost a talented and joyful teammate just beginning to grasp the gifts he had. You have lost a big brother, and a son. I’m so sorry.
Thank you for sharing Sean with us, and for giving me the chance to celebrate his life here with you today.
Sean joined our OpenGov family 4 short months ago. He stood out from the moment I met him.
“Hello,” he wrote to me not even a week into his first semester. “I’m a Computer Science student at GWU and I am excited to work with the OpenGov Foundation to create disruptive products. For the past 15 months I worked as a development intern at a Portland startup where I helped build, test and maintain an e-commerce website. I led a group of interns, developed the internal metrics system from the ground up. Why should you welcome me into your team? Because I want to build great products that will change lives.”
Who is this kid? I thought. How could a 19-year-old be that accomplished, that confident in his abilities, that fired up to help other people? Either he’s pulling my leg or he’s going to be a rock star.
Sean was a star. He’d come over one day a week and sit next to me at the dining room table, palpably excited to be coding. That rubbed off on everyone around him. It’s not fair that I’ll never again get to sit with him, and see him there smiling, doing what he loved.
Sean, I know you’re watching us from up in heaven. I miss you, but know we’ll meet again. Until then…
May the road rise to meet you,
May the wind be always at your back,
May the sun shine warm upon your face,
May the rains fall soft upon your fields,
And, until we meet again,
May God hold you in the hollow of His hand.
On Tuesday, OpenGov coding intern Sean Keefer passed away. He was just 19, and a freshman at George Washington University.
Sean was a wonderful young man who loved to run and listen to music, but most of all, loved to code. He joined OpenGov in October 2013, excited to apply his talents to helping others. Just last week he finished his first project: creating the first online, accessible and user-friendly edition of the laws of Raleigh, NC. He was eager to push his work live, and put it into the hands of the people who deserve equal access to their government, but cannot secure it by themselves. That is everything to which we aspire in the open government and open data communities. And that was Sean.
Tomorrow at 4 PM EST, the OpenGov Team will join Sean’s family, friends and classmates for a memorial vigil at GWU’s Ames Hall on the Mount Vernon Campus, located at 2100 Foxhall Road NW, Washington, DC 20007.
We are heartbroken, and ask that you keep Sean, his Mom and Dad, and his two younger sisters in your thoughts and prayers.
We miss you, bud. Rest in peace.
– Seamus, Leili, Chris and Bill
My First Hackathon: Learning to Extract
The 2014 PDF Liberation Hackathon
Editor’s Note: OpenGov Intern Matt Steinberg (pictured at right below) recaps his first hackathon, underscoring how “open government” isn’t just for professional software developers.
Over the weekend of January 16-18 the Sunlight Foundation sponsored a PDF Liberation Hackathon which I participated in as a non-developer. The event focused on the problem that many organizations have reams of data in PDF. This is an issue because PDF does not allow the user to interact with the document in word searches and other ways; the data they contain is essentially locked up and difficult to manipulate. This means that a lot of data, including congressional financial disclosures and non-profit expenditures, is not easily viewable, and cannot be used to create tables, graphs, diagrams, and apps. The goal of the event was to make progress in unlocking these documents by testing and tweaking different software to convert documents into more usable formats.
A term that I learned that is essential to opening PDFs is OCR, or optical character recognition–this is the way the characters in PDFs or scans are converted to other formats such as .txt files.
At first I worked with Damarius Hayes from the docHive team. I learned that many businesses have loads of tax forms dating years back that are stuck in PDFs. Damarius showed me how the software the docHive team is developing allows the user to specify certain regions of the screen called pixels to be isolated into little rectangles. The data within the pixellated rectangles may be extracted into tables and spreadsheets which are effective for building diagrams, graphs, and other useful data-related things. Since the visual structure is consistent from form-to-form, you may isolate the important areas of the document with the pixellated rectangles, and boom–you then have a template which you may use to extract info from other structurally identical forms.
I then worked with the docHive team to setup Homebrew on my computer. Homebrew is a “package management system, a collection of software tools to automate the process of installing, upgrading, configuring, and removing software packages for a computer’s operating system in a consistent manner” (wikipedia.) It essentially is open source software that allows the user to easily download other open source software. I then downloaded Tesseract and Imagemagik, two OCR programs. It took a while to get Tesseract to start working, but finally, with the help of Edward Duncan from docHive, I was able to convert a document from PDF to text.
I feel that as a non-developer the biggest thing I contributed was a spirit of curiosity and learning. By asking the developers questions I was a help to them by compelling them to slow down and gain a macroscopic perspective on their work. In fact almost everyone to whom I asked questions ended up tutoring me and expressing their gratitude for my questions. Most said that they were learning from me as well. This is useful because there was a great deal of technical jargon being thrown around and many of the eventual users of the tools that were and will be developed out of this event are amateurs just like me. I have already reached out to Damarius and Edward of DocHive and they continue to help me understand what they do, where to go from here and what I can do to help. Edward in particular has given me some specific technical help on how to use tesseract to convert PDFs to txt. The next step for me is to do private research on all the jargon I heard and to start testing out the different tools I was exposed to – namely Tesseract and Imagemagik. I currently have a project converting PDFs into markdown and think this will be a good opportunity to learn the ins and outs of the software.
All of this is important because it has to do with opening data to the public – helping to construct a more transparent society. This concept of open-data is relatively new to me and I hope to learn more by reading the book “Open Data Now” by Joel Gurin. Having data be easily accessible is essential to a vital and engaged society. After helping to liberate civic data people really need from PDFs, I feel energized to contribute my skills and my time to bring that society to life in the United States.
Bipartisan Group of Judiciary Committee Members Strongly Question Access Barriers & Costs Faced by Citizens, Job Creators
WASHINGTON, DC – The US House Judiciary Subcommittee on Courts, Intellectual Property, and the Internet held a hearing yesterday to discuss the scope of copyright in America, focussing on existing copyright restrictions placed on public information like legal codes, laws and standards. During the hearing, a bipartisan group of Members of Congress and witnesses expressed significant concerns with current copyright rules that block the open access to the law necessary for a healthy democracy, and that can even force Americans to pay hard-earned money to access the laws by which they must live. Going further, Rep. Zoe Lofgren (D-CA) and Rep. Darrell Issa (R-CA) stated that these issues must be addressed in any copyright reform legislation considered by the US Congress.
WATCH Rep. Zoe Lofgren (D-CA): “I agree with Mr. Issa that there’s no copyright reform that we should support that doesn’t resolve this issue.”
Key Quotes: Lofgren
“It seems to me very clear that you cannot have secret law. If you’re going to require people to adhere to a standard, that has to be in public domain. I’m sympathetic–I understand that there’s a business model set up–but you can’t allow the business model to trump the rule of law.”
“. . . if you incorporate by reference a document, that has to be part of the public record.”
“If there’s a fee, for example, that assumes that the public doesn’t have an interest . . . there’s a public interest in this; it’s not just the people in the business. It’s the public’s right to know. Is this of sufficient standard? Well, the only way you’re going to find out is to have free access to that. And to put up a screen to that, if it’s part of the law, is completely and wholly inappropriate.”
“I agree with Mr. Issa that there’s no copyright reform that we should support that doesn’t resolve this issue.”
“Copyright protection for laws, codes and standards appears to clash with the fundamental ability of our citizens to know what laws and regulations they must live by.” – US House Judiciary Chairman Bob Goodlatte (R-VA)
Key Quotes: Goodlatte
Copyright-restricted laws, standards and legal codes is “. . . An issue that has received less public attention, but is one that does go to the heart of how citizens interact with their government.”
“It was also the subject of the very first copyright case heard by the Supreme Court in 1834.
“Copyright protection for laws, codes and standards appears to clash with the fundamental ability of our citizens to know what laws and regulations they must live by.
“It is fortunate that the number of states seeking to claim copyright protection on their laws and regulations, despite long-standing Copyright Office and Administration views to the contrary, has sharply declined.”
Key Quotes: Issa
“I just want to go on the record that in the copyright reform we’re considering in the committee, in order to have my vote in the final passage we’ll have to rectify the ambiguity in the law so that every American has free access to every law that he or she must live under.”
“Who authors a law? . . . If the state of Idaho, the state of Georgia, the state of Mississippi, if they produce a law, every single person who voted for it is an author. It doesn’t belong to some entity by definition. . . . in its rawest form, isn’t in fact every single person who participates in a law, or the inclusion by association of a standard, in fact an author, and therefore, if I’m willing to release it to everyone, as an owner of that copyright, an undivided owner, don’t you ultimately have no possibility of protection?”
“If it is a voluntary standard it’s available for copyright, and I understand that. But if it is incorporated into law, at that point shouldn’t you object to it being incorporated or recognize that you’re waving any copyright objections from the public having free and fair access to essentially a law that they must comply with?”
Key Quotes: Malamud
“…The law belongs to the people. If we give those without great means a substandard web site as their only access to the law, we have put a poll tax on access to justice. When we require a license to speak the law, we have made a mockery of freedom of speech. When we deliberately restrict access to the law—including the public safety codes that protect our homes, families, and workplaces—we have violated the fundamental principle of the rule of law that underpins our democracy.”
Key Quotes: Johnson
“. . . materials created by the US government and state governments do not deserve copyright protection, nor have they ever received it.”
“At its core, this issue touches on the American ideal for justice–that we must know the laws that govern us. This right is fundamental to the rule of law that underpins our democracy, particularly when the concept that ignorance is no excuse pervades our process. It is also central to upholding our system of checks and balances by holding congress accountable for legislation it passes or fails to pass.”
“As we review copyright protection in anticipation of the next great copyright act, we must continue to protect American’s access to laws and justice by protecting access to public materials in the public domain.”
“ It’s easy to take for granted how important public databases are in our increasingly digital democracy; unless public documents are digitized and available they are often out of reach of many.”
Key Quotes: Farenthold
“I’m going to have to agree that once something is enacted into law the public ought to have a right to get to it free.”
“Don’t the standard-setting organizations collect membership dues and generate revenue from the members who participate? I mean, I understand that in the old days it cost money to print up the books and distribute them, but now the marginal cost of making this information available over the internet is basically none.”
“And there’s zero value to a light bulb that doesn’t fit the light bulb standard, to use your analogy. Shouldn’t the private sector that benefits from these pay and the public should have them free?”
“Why shouldn’t I be able to print out a copy of the electric code to make sure the electrician hooked the greenwire up to ground in my house?”
Key Quotes: Love
“I think that the US laws, works of federal employees, and federal laws and regulations are not subject to copyright. I think it would be good to extend that rule to laws at the state level, on everything from court opinion to regulations to statutes.”
SanFranciscoCode.org, BaltimoreCode.org & The State Decoded Featured As Local Government Doing It Right Online
WASHINGTON, DC – A just-released New America Foundation report – “Public Pathways: A Guide to Online Engagement Tools for Local Governments” – features the OpenGov Foundation’s work upgrading municipal codes into more user-friendly, accessible and understandable formats on the Internet, underscoring the power of legal open data and $0-cost software to strengthen efficiency, transparency and accountability in America’s local governments. Building off the State Decoded open source software project, OpenGov has modernized the laws for citizens and public servants in San Francisco, Baltimore, Chicago, Philadelphia and Maryland, with more on the way in 2014.
From the report:
“Budgets are not the only incomprehensible municipal document. Laws can be extremely challenging to read and comprehend because they are not written in easy to understand language. It can be hard to find specific and relevant laws because most ordinances are provided in PDFs (portable document format) that do not allow for simple document searching. To ease the burden of reading complicated laws The OpenGov Foundation built a tool used by the City of Baltimore that opens law to everyone. BaltimoreCode.org transforms the Baltimore City Charter and Code from unalterable, often hard to find online files and updates them into user friendly, organized and modern website formats in accessible language. The goal of providing more accessible law is to add clarity, context, and public understanding of the laws’ impact on Baltimore citizens’ daily lives. The City of Baltimore was the first to become an open law city, but other cities are catching up. San Francisco is also working with the Foundation and residents to open San Francisco’s law by making them user friendly.”