“I want to be able to do that automatically, because ain’t nobody got time to sit here and hack on every government [PDF] document we want to make open . . . the longer-term goal is to improve government processes, detect fraud, waste, and abuse, and provide open data for start-ups to generate economic value with.”— Travis Korte on liberating civic data from PDFs
If you are a government watch-dog trying to follow the scent of financial scandal, you face a dilemma. Much of the information you need is difficult to access–but not because of legal barriers. It’s because members of Congress submit their financial records via PDF.
Portable Document Format, or PDF, is by its nature difficult to transfer into other formats, and can’t be used by developers to produce helpful applications. It’s hard to search for specific information in PDFs, so you often have to scroll through every page to find what you’re looking for. And information like large collections of numbers can’t simply be moved from PDFs into spreadsheets, so journalists and researchers can’t manipulate the data or turn it into helpful visuals.
If you need information stored in PDFs you have two choices. You can either spend thousands of dollars to purchase PDF extraction software, or you can hire a hapless intern to spend hours hunched over a screen, scrolling through data page by page and entering it–by hand–into a database. Sadly, there’s yet to be an easy solution to working with data locked in PDFs.
Hence the PDF Liberation Hackathon: on January 19, 2014, developers and civic-minded individuals gathered in cities around the world in an heroic effort to liberate this crucial public data from inaccessible PDFs. The event focused on converting large collections of documents, ranging from House of Representatives Financial Disclosures to decades of US Foreign Aid Reports, into open data formats usable by developers. Many of the hackathon’s attendees represented transparency-minded organizations who plan to search, sift, organize, and publish the information for public knowledge. But a bigger goal emerged from these challenges–to automate the process of PDF extraction so that anyone who struggles with PDFs, from government workers to civic hackers, will never have to deal with the problem again.
OpenGov’s Role Liberating Civic Data from PDFs
The OpenGov Foundation attended the PDF Liberation Hackathon for personal reasons. We have to deal with a lot of legal code in PDFs to pursue our mission of developing and deploying technologies that support every citizen’s ability to participate in their government. So we wanted to understand what tools work for extracting information from PDFs, and we wanted to share what we learned with others.
Obtaining legal code and legislation in a developer-friendly format remains one of the biggest challenges we face in putting the law online. Most government websites publish legal code in PDF only. For example, take a look at the Baltimore municipal government’s official website. All the laws there are locked in PDFs, and are difficult to navigate–you have to know exactly what you’re looking for, and peruse long documents to find it. We worked long and hard to put up BaltimoreCode.org so that citizens can easily browse, search, and even comment on the laws that govern them. And we struggled to convince the city government to provide us with the data to make it all possible. Often government offices fail to grasp the importance of providing the law in any other format; in some cases they don’t understand the issue of PDFs at all. To date, we have depended on manual labor in those cases–extracting data from PDFs by hand. Needless to say, manual extraction doesn’t scale. An automated PDFs extraction technique would be worth its weight in gold to us, and to our thousands of citizen-users.
We have barely scratched the surface of the PDFs currently locking away civic data from those who need it most. Millions of journalists, academics, and everyday citizens need to know the information sealed away in these documents.
PDF Liberation Progress Made
A number of important insights emerged after the two-day hackathon. You can see in-depth documentation from a variety of participants on the event’s GitHub Gist.
OpenGov Foundation coding intern Rostislav Tsiomenko successfully extracted information from electronically-filed Personal Financial Disclosure (PFD) forms, which members of Congress use to report their expenditures and assets and which are of great interest to governmental watchdogs. Ross used a cloud-based service known as ABBYY Cloud OCR, which he supplemented with some Python code, to convert the information into a neatly separated CSV spreadsheet. You can read more about his work here.
Before: Congressional Financial Disclosure PDF Form
After: Extracted Data
Other PDF Extraction Projects
“The less stuff the end-user has to do, the better it usually goes.” — DocHive’s Michael Nance speaking here at the PDF Liberation Hackathon.
The development team building DocHive came all the way from Raleigh, NC to contribute their time, talents, and software. DocHive is a new open source PDF extraction tool which requires the development of a template for each new PDF form, but may be able to liberate image-based PDFs by isolating pixels into little rectangles and then extracting the data within those rectangles and placing them into a spreadsheet. It’s a great step towards liberating PDFs created by scanning paper documents, which can be tricky. Read more about it from our intern Matt here.
DC’s winning entry,“What Word Where”, incorporates some of the ideas of DocHive. It’s a tool that “treat[s] a page of scanned text as a geography” and uses code from geographic information systems to identify box boundaries on a page as physical borders. Developers can then take the boundary information collected and use it to create templates to locate specific data to extract with the optical character recognition (OCR) of tesseract. This should allow users to extract only the information they need from many pages of documents. It should also work on both image-based and electronically-filed documents, and it’s the closest we’ve come to an effective and affordable extraction tool. You can see the full documentation here.
Also, DC participant Aaron Shumacher identified an issue with the New York Police Department’s Crime Statistics. It appears that, in addition to publishing weekly data sets in PDF only, when the NYPD updates the crime statistics on its website it takes down the previous week’s data. Aaron is using Tabula and ScraperWiki to liberate the statistics, but has also written a script to automatically download PDFs from the site as they are posted so that the information can be archived and extracted later. This will make it possible for interested parties to look at trends in crime over time, and is something government websites should be doing already. He explains his process in a video here.
The main takeaway from the hackathon? PDF extraction remains a tricky business, especially for documents created as scans. Currently, extraction tools require some tweaking and coding expertise–but significant progress was made at the hackathon. And liberating data from PDFs is worth the fight. It will open up a new era of accountability and development in the public and private sectors, as open data crusaders publish government expenditures, use police statistics to track trends in crime, and even publish the laws of the United States in useable and interactive formats. Want to help? The Sunlight Foundation will continue to search for a good extraction tool–jump in on the action here. Need help? Submit PDFs you’re fighting with on PDFoff.org.
Together, we will set civic information free.