AmericaDecoded.org Beta Launched by the OpenGov Foundation
Contact: Seamus Kraft
WASHINGTON, DC - The OpenGov Foundation today announced the beta launch of AmericaDecoded.org, America’s modern, user-friendly and restriction-free online law library. Through this new website, citizens, stakeholders and public servants can access, explore and use the municipal laws and legal codes of San Francisco (CA), Baltimore (MD), Chicago (IL), and Philadelphia (PA), and the state laws and legal codes of Maryland, Virginia and Florida. For those seeking to advance local government transparency and citizen engagement, AmericaDecoded.org provides the tools and resources needed to transform any inaccessible legal information into the open source State Decoded format, then add it to this growing online law library. Right now, efforts are underway to “decode” Raleigh (NC), Washington (DC), Boston (MA), Las Vegas (NV), New York (NY) and Miami (FL), with more on the way.
The America Decoded network helps break down the frustrating barriers that exist between everyday Americans and the most important public information in any city or state – the law itself – by harnessing the latest in website design and open data standards, providing all the expensive tools lawyers use at $0 cost. And unlike many legacy legal information providers, AmericaDecoded.org provides the public laws to the American people the right way: without any copyright restrictions, paywalls or fees.
The laws and legal codes in the AmericaDecoded.org open law library are powered by the State Decoded software, which provides citizens with a ridiculously easy-to-use way to access your local, state, and federal legal code. What does powerful, modern open law data do for you?
Careful organization by article and section makes browsing a breeze.
A site-wide search allows you to find the laws you’re looking for by topic.
Scroll-over definitions translate legal jargon into common English.
Downloadable developer-friendly legal code lets you take the law into your own hands.
Best of all, everything on the site remains cost-and restriction-free.
We’re a non-profit, non-partisan startup working to make government better. That means making it easier for people to access and use as much government information as possible. We believe innovative technology can help deliver a government that listens, works for its citizen-users, and learns from them. We are dedicated to putting better data and better tools in more hands. Our goal is to make or adapt those tools to be easy to use, efficient, scalable and free.
About The State Decoded
The State Decoded is a platform that displays legal codes, court decisions, and information from legislative tracking services to make it all more understandable to normal humans. With beautiful typography, embedded definitions of legal terms, and a robust API, this project aims to make our laws a centerpiece of media coverage.
In development since June 2010, The State Decoded is in open beta testing now, with a growing network of sites running the software. This work is made possible by a generous grant from the Knight Foundation.
The District is about to get a lot more user-friendly.
This weekend 250 people will descend on the World Bank headquarters in Washington, DC to improve their city with public data. Armed with laptops and software building skills, these civic technologists will create applications and websites that make it easier for Washington, DC residents to have a voice in local government, and hold it accountable.
These community-improving efforts – Open Data Day DC–are part of the 2014 International Open Data Hackathon. In over 100 cities on five continents, technology experts, everyday citizens and public servants will gather to solve local problems using software and public data sets–like laws or spending numbers. BaltimoreCode.org, for example, started in a similar event last year.
The OpenGov Foundation team will be there, working on projects to liberate the District’s regulations and rules from inaccessible formats. But that’s just one of a number of promising projects. Open Data Day DC activities include four helpful workshops and a myriad of engaging projects for developers, designers, and interested citizens to tackle.
Organized by leading open data developers from GovTrack.us, Sunlight Foundation, USAID, and the World Bank, the event welcomes everyone–no coding experience required–and aims to “strengthen the open data community and to make connections between people and between projects.”
How it Works
In the spirit of open data and information, all registrants are welcome to submit projects prior to the event, using the official Open Data Day DC hackpad. Participants can peruse the proposals, volunteer for a project, add information, or just show up at the event with laptops and open minds. Project teams will have three sentences in which to pitch their ideas, and then attendees will choose how they want to get involved. The workshops run simultaneous to project development. No competitions or bro-like shenanigans here, just straight-up do-gooders working together to exchange expertise and ideas.
The workshops aim to introduce important concepts to attendees: open data, open collaboration, open mapping, and the basics of coding with Python. Project proposals range from building an open directory of social services in DC, to mapping oil drilling infrastructure in Nigeria, to visualizing aid results in Afghanistan.
OpenGov Foundation’s Role
Open data remains essential to what we do at OpenGov Foundation. Accessing the data that we need, in the format our developers need, remains the biggest barrier we encounter in posting the law online and opening up legislation for crowdsourcing. We’re excited to connect with other open data disciples and to build the movement for open data and transparency. We’re also bringing two projects with us to the event and look forward to getting some help with them:
State Decoded Global Law Search and Compare
Lawmakers frequently lean on existing laws in other jurisdictions as examples when creating laws for their own city or state. For example, a lawmaker in Maryland may want to look at laws about gambling in other states before starting to write a gambling bill herself. Right now, finding and organizing these laws remains difficult and time consuming.
We post state-and city-level legal code online, using The State Decoded software. With this project we aim to develop a cross-instance search function to enable users to easily compare laws and regulations across all of our current State Decoded instances.
DC Municipal Regulations Conversion
The D.C. Regulations are just as important as the law itself when determining what rules citizens must live by. Right now it is very difficult to search these regulations, identify and follow code references, and there is no API or even bulk downloads for developers to use.
The Municipal Regulations of D.C. reference existing laws that are already easily accessible online. By scraping dcregs.dc.gov and converting these regulations into the StateDecoded XML format and the DC Code prototype format we can just as easily browse the city’s regulations and add cross-links between the two bodies.
These two projects will help us move towards where we want to be with the State Decoded, but more importantly, will enhance the ability of citizens and government officials to access the law in useful ways.
How to Get Involved
If you’re going to Open Data Day DC, come say hello to the OpenGov Team! We’re looking forward to meeting other open data disciples. You can also connect with us at firstname.lastname@example.org, or on Twitter at @FoundOpenGov.
Registration is now closed; however, if you’re not registered but still want to be involved, take a look at the hackpad for ideas on future projects. Follow the Twitter hashtags #OpenDataDay #DC. And contact the event organizers to learn about other great meetups in the DC Metro area. The huge response to this year’s International Open Data Hackathon highlights the exploding field of open data and civic technology, and the amazing works that are already out there. There’s plenty to do, and all are welcome. Get involved!
Liberating Congressional Financial Disclosure Data from PDFs
By OpenGov Coding Intern Ross Tsiomenko
Editor’s Note: this was originally an OpenSecrets.org challenge at the PDF Liberation Hackathon. The original hacking challenge can be found here.
The PDF Data Extraction Challenge
“Members, officers, and candidates of the House of Representatives submit Financial Disclosure Reports every year that detail the source, type, amount, or value of their incomes, as required by the Ethics in Government Act of 1978. However, there are over 1200 people that file these forms every year, including 435 Representatives, and each form can contain over 10 pages of important financial information. These forms are only available on the Clerk of the House website in PDF format, meaning this data can only be viewed by going to a specific website and searching for a specific name. Our goal is to automate the extraction of this data and make it available to everyone for easy viewing, searching, comparison, and analysis. The only other source for this data right now is OpenSecrets.org, which has to compile this information by hand, meaning they cannot handle the volume of the Reports on a timely basis.” Read More
Ross Tsiomenko, OpenGov Foundation Coding Intern
The current prototype built during the PDF Liberation Hackathon can handle some electronically-filed forms using the ABBYY Cloud OCR API, a paid cloud OCR service.
Example Congressional Financial Disclosure PDF Document. Goal is to extract the financial information at the bottom into an open data format.
Financial Disclosure Reports are not text-based PDFs, but rather scanned-in images, meaning OCR (optical character recognition) software must be used to extract the data. ABBYY Cloud OCR is the only software currently known to extract tabular data correctly; the prototype uses a shell script to upload a PDF to the Cloud API, which returns a text file with most columns and rows intact. This is then cleaned up and turned into a csv file using Python.
Results: Ross was able to successfully extract Congressional Financial Disclosure data from a PDF into this tabular spreadsheet. Click here to view the prototype on Github.
Note there are 2 types of reports – electronically filed and handwritten. Both are scanned in, but electronically filed reports have typing on them which is easy to use OCR on, while the handwritten forms are beyond any fully automated OCR effort, and must be handled differently, for example by breaking up the reports for use in a CAPTCHA service (see the end of the Next Steps section for more).
There is still a lot of work to be done. The software must be expanded to cover all possible Report form variations, and must be tested on a large data set for accuracy. See the next section for a development roadmap.
- The current PDF extraction prototype can be found here.
- Personal Financial Statements can be found here.
Note: there is no bulk download of PDFs, only through a search option.
* Create a bulk package of PDFs for testing
The Clerk of the House website does not allow downloading of all PDF files by year, nor does it allow sorting by electronically filed forms or handwritten ones. To properly test any software that is created, a test data set must first be created; a single zip file of, for example, all 2013 disclosures, would be invaluable to the people working on code. It would also be very helpful to create 2 zip files, one with electronically filed forms, and one with handwritten ones. Finally, it would be great if this process could somehow be automated, so that bulk downloads would proceed automatically for all subsequent years (at least while the Clerk of the House website remains unchanged). A starting point for this is to leave the search box blank and set the filing year to 2013, which should give you 41 pages of 818 separate search results with download links.
* Ensure current version works on multiple types of electronic forms
There is no single standard for Financial Disclosure Reports – there are different form types (A and B, depending on the person filing), different Schedules in each report that be or may not be omitted (such as “Schedule III – Assets and Unearned Income” or “Schedule IV – Transactions”). The software should account for all these possibilities and handle them accordingly.
* Convert the bulk package of electronically filed forms
After a test data set of electronically filed forms has been created, the current prototype should be tested on as many forms as possible to ensure it works with all possible variations.
* Transition to open source ABBYY alternative
Once everything is working, an alternative to the paid ABBYY Cloud OCR service should be found. Although ABBYY works great, it is not free; processing all forms filed within a calendar year would take 10,000+ page requests (not counting development trial and error), which could cost up to $900 according to ABBYY pricing.
Software like Tesseract and Tabula have been tried, and do not currently work. Financial Disclosure Reports are scanned documents, not text-based, so the software must handle both OCR and tabular data (a rarity); Tabula and other Apache PDFBox derivatives only work with text-based PDFs, and Tesseract, while having great OCR capabilities, does not handle tabular data well.
* Tackle Handwritten Financial Disclosure PDFs
From the original OpenSecret challenge – “Members of the House of Representatives submit a yearly report on their personal finances. Though there is the option to submit the report electronically, some Members choose to download the report and hand enter their information. The challenge for these reports is to build a CAPTCHA type program that will allow us to crowd-source the data entry for these fillings. The CAPTCHA program will need to split the PDF into readable bits and know what each bit means (the field in the form). It is also important for this program to not split a word from the PDF mid-letter.”
“I want to be able to do that automatically, because ain’t nobody got time to sit here and hack on every government [PDF] document we want to make open . . . the longer-term goal is to improve government processes, detect fraud, waste, and abuse, and provide open data for start-ups to generate economic value with.”– Travis Korte on liberating civic data from PDFs
If you are a government watch-dog trying to follow the scent of financial scandal, you face a dilemma. Much of the information you need is difficult to access–but not because of legal barriers. It’s because members of Congress submit their financial records via PDF.
Portable Document Format, or PDF, is by its nature difficult to transfer into other formats, and can’t be used by developers to produce helpful applications. It’s hard to search for specific information in PDFs, so you often have to scroll through every page to find what you’re looking for. And information like large collections of numbers can’t simply be moved from PDFs into spreadsheets, so journalists and researchers can’t manipulate the data or turn it into helpful visuals.
If you need information stored in PDFs you have two choices. You can either spend thousands of dollars to purchase PDF extraction software, or you can hire a hapless intern to spend hours hunched over a screen, scrolling through data page by page and entering it–by hand–into a database. Sadly, there’s yet to be an easy solution to working with data locked in PDFs.
Hence the PDF Liberation Hackathon: on January 19, 2014, developers and civic-minded individuals gathered in cities around the world in an heroic effort to liberate this crucial public data from inaccessible PDFs. The event focused on converting large collections of documents, ranging from House of Representatives Financial Disclosures to decades of US Foreign Aid Reports, into open data formats usable by developers. Many of the hackathon’s attendees represented transparency-minded organizations who plan to search, sift, organize, and publish the information for public knowledge. But a bigger goal emerged from these challenges–to automate the process of PDF extraction so that anyone who struggles with PDFs, from government workers to civic hackers, will never have to deal with the problem again.
OpenGov’s Role Liberating Civic Data from PDFs
The OpenGov Foundation attended the PDF Liberation Hackathon for personal reasons. We have to deal with a lot of legal code in PDFs to pursue our mission of developing and deploying technologies that support every citizen’s ability to participate in their government. So we wanted to understand what tools work for extracting information from PDFs, and we wanted to share what we learned with others.
Obtaining legal code and legislation in a developer-friendly format remains one of the biggest challenges we face in putting the law online. Most government websites publish legal code in PDF only. For example, take a look at the Baltimore municipal government’s official website. All the laws there are locked in PDFs, and are difficult to navigate–you have to know exactly what you’re looking for, and peruse long documents to find it. We worked long and hard to put up BaltimoreCode.org so that citizens can easily browse, search, and even comment on the laws that govern them. And we struggled to convince the city government to provide us with the data to make it all possible. Often government offices fail to grasp the importance of providing the law in any other format; in some cases they don’t understand the issue of PDFs at all. To date, we have depended on manual labor in those cases–extracting data from PDFs by hand. Needless to say, manual extraction doesn’t scale. An automated PDFs extraction technique would be worth its weight in gold to us, and to our thousands of citizen-users.
We have barely scratched the surface of the PDFs currently locking away civic data from those who need it most. Millions of journalists, academics, and everyday citizens need to know the information sealed away in these documents.
PDF Liberation Progress Made
A number of important insights emerged after the two-day hackathon. You can see in-depth documentation from a variety of participants on the event’s GitHub Gist.
OpenGov Foundation coding intern Rostislav Tsiomenko successfully extracted information from electronically-filed Personal Financial Disclosure (PFD) forms, which members of Congress use to report their expenditures and assets and which are of great interest to governmental watchdogs. Ross used a cloud-based service known as ABBYY Cloud OCR, which he supplemented with some Python code, to convert the information into a neatly separated CSV spreadsheet. You can read more about his work here.
Before: Congressional Financial Disclosure PDF Form
After: Extracted Data
Other PDF Extraction Projects
“The less stuff the end-user has to do, the better it usually goes.” — DocHive’s Michael Nance speaking here at the PDF Liberation Hackathon.
The development team building DocHive came all the way from Raleigh, NC to contribute their time, talents, and software. DocHive is a new open source PDF extraction tool which requires the development of a template for each new PDF form, but may be able to liberate image-based PDFs by isolating pixels into little rectangles and then extracting the data within those rectangles and placing them into a spreadsheet. It’s a great step towards liberating PDFs created by scanning paper documents, which can be tricky. Read more about it from our intern Matt here.
DC’s winning entry,“What Word Where”, incorporates some of the ideas of DocHive. It’s a tool that “treat[s] a page of scanned text as a geography” and uses code from geographic information systems to identify box boundaries on a page as physical borders. Developers can then take the boundary information collected and use it to create templates to locate specific data to extract with the optical character recognition (OCR) of tesseract. This should allow users to extract only the information they need from many pages of documents. It should also work on both image-based and electronically-filed documents, and it’s the closest we’ve come to an effective and affordable extraction tool. You can see the full documentation here.
Also, DC participant Aaron Shumacher identified an issue with the New York Police Department’s Crime Statistics. It appears that, in addition to publishing weekly data sets in PDF only, when the NYPD updates the crime statistics on its website it takes down the previous week’s data. Aaron is using Tabula and ScraperWiki to liberate the statistics, but has also written a script to automatically download PDFs from the site as they are posted so that the information can be archived and extracted later. This will make it possible for interested parties to look at trends in crime over time, and is something government websites should be doing already. He explains his process in a video here.
The main takeaway from the hackathon? PDF extraction remains a tricky business, especially for documents created as scans. Currently, extraction tools require some tweaking and coding expertise–but significant progress was made at the hackathon. And liberating data from PDFs is worth the fight. It will open up a new era of accountability and development in the public and private sectors, as open data crusaders publish government expenditures, use police statistics to track trends in crime, and even publish the laws of the United States in useable and interactive formats. Want to help? The Sunlight Foundation will continue to search for a good extraction tool–jump in on the action here. Need help? Submit PDFs you’re fighting with on PDFoff.org.
Together, we will set civic information free.
Delivered by OpenGov Foundation Executive Director Seamus Kraft at a memorial for Sean Keefer on January 24, 2014 at Ames Hall on the George Washington University Mount Vernon Campus.
Every one of us is hurting. We’ve lost a loyal friend and stellar student. We’ve lost a talented and joyful teammate just beginning to grasp the gifts he had. You have lost a big brother, and a son. I’m so sorry.
Thank you for sharing Sean with us, and for giving me the chance to celebrate his life here with you today.
Sean joined our OpenGov family 4 short months ago. He stood out from the moment I met him.
“Hello,” he wrote to me not even a week into his first semester. “I’m a Computer Science student at GWU and I am excited to work with the OpenGov Foundation to create disruptive products. For the past 15 months I worked as a development intern at a Portland startup where I helped build, test and maintain an e-commerce website. I led a group of interns, developed the internal metrics system from the ground up. Why should you welcome me into your team? Because I want to build great products that will change lives.”
Who is this kid? I thought. How could a 19-year-old be that accomplished, that confident in his abilities, that fired up to help other people? Either he’s pulling my leg or he’s going to be a rock star.
Sean was a star. He’d come over one day a week and sit next to me at the dining room table, palpably excited to be coding. That rubbed off on everyone around him. It’s not fair that I’ll never again get to sit with him, and see him there smiling, doing what he loved.
Sean, I know you’re watching us from up in heaven. I miss you, but know we’ll meet again. Until then…
May the road rise to meet you,
May the wind be always at your back,
May the sun shine warm upon your face,
May the rains fall soft upon your fields,
And, until we meet again,
May God hold you in the hollow of His hand.