A Brief Summary
After a hackathon a few months back, we were joking about creating an easy way to take the data we’d painstakingly parsed from PDFs, word documents, and XML files, and “translate” it back into a format that government agencies are used to. Many of us have been shell-shocked in dealing with PDFs from government agencies, which are often scanned documents, off kilter and photocopied many times over. Fundamentally, they’re very difficult to pry information out of. For the OpenGov Foundation’s April Fools’ prank, we created Govify.org, a tool to convert plain text into truly ugly PDFs.
A quick, [one-line ImageMagick command](https://gist.github.com/krues8dr/9437567), was the first version. We quickly produced a few sample documents, and decided that it would be fantastic if users could upload their own files and convert them. Very quickly it became clear that the process might take a couple of seconds, and a decent amount of CPU – so to deal with any sort of load, we’d need a modular, decentralized process, rather than a single webpage to do everything.
— Ben Balter (@BenBalter) April 1, 2014
As Ben Balter points out, there are a lot of moving pieces to this relatively-simple setup. Govify.org is actually a combination of PHP, Compass + SASS, Capistrano, Apache and Varnish, Rackspace Cloud services and their graet API tools, Python and Supervisord, and ImageMagick with a bash script wrapper. Why in the world would you use such a hodgepodge of tools across so many languages? Or, as most people are asking these days, “why not just build the whole thing with Node.js?”
The short answer is, the top concern was time. We put the whole project together in a weekend, using very small pushes to build standalone, modular systems. We reused components wherever possible and tried to wholly avoid known pitfalls via the shortest route around them. A step by step breakdown of those challenges follow.
We started with just a one-line ImageMagick command, which:
Takes text as an input
Fills the text into separate images
Adds noise to the images
Rotates the images randomly
And finally outputs all of the pages as a PDF.
Using that to create a few sample documents, we began putting together a rough website to show them off. Like everyone else who needs to build a website in zero time, we threw Bootstrap onto a really basic template (generated with HTML5 Boilerplate. We use a few SASS libraries – Compass, SASS Bootstrap, and Keanu – to get some nice helpers, and copied in our standard brand styles that we use everywhere else. A few minutes in photoshop and some filler text later, and we had a full website.
We needed a nice way to deploy the site as we make changes, and our preferred tool is Capistrano. There are other tools available, like Fabric for Python or Rocketeer for PHP, but Capistrano excels in being easy to use, easy to modify, and mostly standalone. It’s also been around for a very long time and the one that we’ve been using the longest.
We’re using Rackspace for most of our hosting, so we stood up box with Varnish in front of Apache and stuck the files on there. Website shipped!
Once that was done, we made the decision to allow users to upload their own files. At OpenGov, we’re primarily a PHP shop, so we decided to use PHP. OK, OK – stop groaning already. PHP is not the most elegant language in the world, and never will be. It has lots of horns and warts, and people like to trash it as a result. That being said, there are a few things it’s great at.
First and foremost, it’s incredibly easy to optimize. Tools like APC and HipHop VM which allow you to take existing PHP scripts and make them run *very* well. The variety and diversity of optimization tools for PHP make it a very attractive language for dealing with high-performance apps, generally.
Second, it’s a “web-first” language, rather than one that’s been repurposed for the web – and as a result, it’s very quick to build handlers for common web-tasks without using a single additional library or package. (And most of those tasks are very well documented on the PHP website as well.) Handling file uploads in PHP is a very simple pattern.
So in no time at all, we were able to create a basic form where users could input a file to upload, have that file processed on the server, and output back our PDF. Using the native PHP ImageMagick functions to translate the files seemed like a lot of extra work for very little benefit, so we ran kept that part as a tiny shell script.
At this point however, we realized that the file processing iself was slow enough that any significant load could bring slow the server considerably. Rather than spinning up a bunch of identical servers, a job queue seemed like an ideal solution.
Creating a Job Queue
A very common pattern for large websites that do processing of data is the job queue, where single items that need processing are added to a list somewhere by one application, and pulled off the list to be processed by another. (Visual explanation, from the Wikipedia Thread Queue article.) Since we’re using Rackspace already, we were able to use Rackspace Cloud Files to store our files for processing, and the Rackspace Queue to share the messages across the pieces of the application. The entire Rackspace Cloud stack is controllable via their API, and there are nice libraries for many languages available.
On our frontend, we were able to drop in the php-opencloud library to get access to the API. Instead of just storing the file locally, we push it up to Rackspace Cloud Files, and then insert a message into our queue, listing the details of the job. We also now collect the user’s email address, so that we can email to let them know that their file is ready.
The backend processing, however, presented a different set of challenges. Generally, you want an always-running process that is constantly checking the queue for new files to process. For processes that take a variable amount of time, you don’t want just a Cron job, since the processes can start stacking up and choke the server – instead we just have a single run loop that runs indefinitely, a daemon or service.
For all the things that PHP is good at, memory management is not on the list. Garbage collection is not done very well, so large processes can start eating memory rapidly. PHP also has a hard memory limit, which will just kill the process in an uncatchable way when it dies.
Python, on the other hand, does a rather admirable job of this. Creating a quick script to get the job back out of the Rackspace Queue, pull down the file to be manipulared, and push that file back up was a rather simple task using the Rackspace Pyrax library. After several failed attempts in trying to use both the python-daemon and daemonize packages as a runner for the script, we reverted to using Supervisor to keep the script going instead.
Obviously, this isn’t the most elegant architecture ever created. It would have made far more sense to use a single language for the whole application – most likely Python, even though very little is shared across the different pieces aside from the API.
That being said, this thing scales remarkably well. Everything is nicely decentralized, and would perform well under significant load. However, we didn’t really get very significant load from our little prank – most people were just viewing the site and example PDFs, and very few were uploading their own. Sometimes overengineering is its own reward.
Not bad for three days of work, if I do say so myself.
All of the pieces are available on Github and GPL2 licensed for examining, forking, and commenting on.
Yesterday’s Document Technology, Tomorrow!
Washington, DC – The OpenGov Foundation today announced the beta release of Govify.org, a new website to help citizens transform open text documents into government-ready PDF files. Simply upload a machine-readable file to Govify, enter your email address and, moments later, a fuzzy and off-kilter PDF is delivered straight to your inbox for downloading, printing and more.
“The truth is, opening government data and government itself is very hard,” said OpenGov Foundation Executive Director Seamus Kraft. “To make it easier on old-fashioned government and its folkways, we built you Govify.org. More paper cuts and fewer trees. That’s our motto.”
One of the biggest barriers preventing governments from embracing Open Data is their inability to produce documents in a government-ready, paper-based format. Like a PDF. Or a book. But the closed-format documents Govify churns out can be easily used for any government purpose – even printed, re-scanned and printed again to attain that extra special level of efficient government bureaucracy.
To further the global advancement of government-ready PDFs, the OpenGov Foundation also released the Govify open source code on Github.
Example Govify-ed documents:
How Govify Works
1. Image Conversion
First, Govify reads in the source text you upload, and converts it into an image.
2. Noise Generation
Next Govify adds a suitable amount of noise, to give you that authentic scanned-in-the-1990s look.
3. Rotation, PDF Export
Last, Govify rotates the image – just like first-day interns do it – and sends a PDF straight to your email inbox.
“For information about the laws, there’s usually much less resistance. And the benefit is very different, very powerful, very broad…The D.C. Code is a very complicated thing and it takes some understanding to know where in the Code you should be looking for something. And, in fact, knowing where the D.C. Code sits in relation to D.C. municipal regulations and case law and other aspects. So it actually — it takes a community to invest in some of these before you can make heads or tails of it.”
- Josh Tauberer, Govtrack.us
AmericaDecoded.org Beta Launched by the OpenGov Foundation
Contact: Seamus Kraft
WASHINGTON, DC - The OpenGov Foundation today announced the beta launch of AmericaDecoded.org, America’s modern, user-friendly and restriction-free online law library. Through this new website, citizens, stakeholders and public servants can access, explore and use the municipal laws and legal codes of San Francisco (CA), Baltimore (MD), Chicago (IL), and Philadelphia (PA), and the state laws and legal codes of Maryland, Virginia and Florida. For those seeking to advance local government transparency and citizen engagement, AmericaDecoded.org provides the tools and resources needed to transform any inaccessible legal information into the open source State Decoded format, then add it to this growing online law library. Right now, efforts are underway to “decode” Raleigh (NC), Washington (DC), Boston (MA), Las Vegas (NV), New York (NY) and Miami (FL), with more on the way.
The America Decoded network helps break down the frustrating barriers that exist between everyday Americans and the most important public information in any city or state – the law itself – by harnessing the latest in website design and open data standards, providing all the expensive tools lawyers use at $0 cost. And unlike many legacy legal information providers, AmericaDecoded.org provides the public laws to the American people the right way: without any copyright restrictions, paywalls or fees.
The laws and legal codes in the AmericaDecoded.org open law library are powered by the State Decoded software, which provides citizens with a ridiculously easy-to-use way to access your local, state, and federal legal code. What does powerful, modern open law data do for you?
Careful organization by article and section makes browsing a breeze.
A site-wide search allows you to find the laws you’re looking for by topic.
Scroll-over definitions translate legal jargon into common English.
Downloadable developer-friendly legal code lets you take the law into your own hands.
Best of all, everything on the site remains cost-and restriction-free.
We’re a non-profit, non-partisan startup working to make government better. That means making it easier for people to access and use as much government information as possible. We believe innovative technology can help deliver a government that listens, works for its citizen-users, and learns from them. We are dedicated to putting better data and better tools in more hands. Our goal is to make or adapt those tools to be easy to use, efficient, scalable and free.
About The State Decoded
The State Decoded is a platform that displays legal codes, court decisions, and information from legislative tracking services to make it all more understandable to normal humans. With beautiful typography, embedded definitions of legal terms, and a robust API, this project aims to make our laws a centerpiece of media coverage.
In development since June 2010, The State Decoded is in open beta testing now, with a growing network of sites running the software. This work is made possible by a generous grant from the Knight Foundation.
The District is about to get a lot more user-friendly.
This weekend 250 people will descend on the World Bank headquarters in Washington, DC to improve their city with public data. Armed with laptops and software building skills, these civic technologists will create applications and websites that make it easier for Washington, DC residents to have a voice in local government, and hold it accountable.
These community-improving efforts – Open Data Day DC–are part of the 2014 International Open Data Hackathon. In over 100 cities on five continents, technology experts, everyday citizens and public servants will gather to solve local problems using software and public data sets–like laws or spending numbers. BaltimoreCode.org, for example, started in a similar event last year.
The OpenGov Foundation team will be there, working on projects to liberate the District’s regulations and rules from inaccessible formats. But that’s just one of a number of promising projects. Open Data Day DC activities include four helpful workshops and a myriad of engaging projects for developers, designers, and interested citizens to tackle.
Organized by leading open data developers from GovTrack.us, Sunlight Foundation, USAID, and the World Bank, the event welcomes everyone–no coding experience required–and aims to “strengthen the open data community and to make connections between people and between projects.”
How it Works
In the spirit of open data and information, all registrants are welcome to submit projects prior to the event, using the official Open Data Day DC hackpad. Participants can peruse the proposals, volunteer for a project, add information, or just show up at the event with laptops and open minds. Project teams will have three sentences in which to pitch their ideas, and then attendees will choose how they want to get involved. The workshops run simultaneous to project development. No competitions or bro-like shenanigans here, just straight-up do-gooders working together to exchange expertise and ideas.
The workshops aim to introduce important concepts to attendees: open data, open collaboration, open mapping, and the basics of coding with Python. Project proposals range from building an open directory of social services in DC, to mapping oil drilling infrastructure in Nigeria, to visualizing aid results in Afghanistan.
OpenGov Foundation’s Role
Open data remains essential to what we do at OpenGov Foundation. Accessing the data that we need, in the format our developers need, remains the biggest barrier we encounter in posting the law online and opening up legislation for crowdsourcing. We’re excited to connect with other open data disciples and to build the movement for open data and transparency. We’re also bringing two projects with us to the event and look forward to getting some help with them:
State Decoded Global Law Search and Compare
Lawmakers frequently lean on existing laws in other jurisdictions as examples when creating laws for their own city or state. For example, a lawmaker in Maryland may want to look at laws about gambling in other states before starting to write a gambling bill herself. Right now, finding and organizing these laws remains difficult and time consuming.
We post state-and city-level legal code online, using The State Decoded software. With this project we aim to develop a cross-instance search function to enable users to easily compare laws and regulations across all of our current State Decoded instances.
DC Municipal Regulations Conversion
The D.C. Regulations are just as important as the law itself when determining what rules citizens must live by. Right now it is very difficult to search these regulations, identify and follow code references, and there is no API or even bulk downloads for developers to use.
The Municipal Regulations of D.C. reference existing laws that are already easily accessible online. By scraping dcregs.dc.gov and converting these regulations into the StateDecoded XML format and the DC Code prototype format we can just as easily browse the city’s regulations and add cross-links between the two bodies.
These two projects will help us move towards where we want to be with the State Decoded, but more importantly, will enhance the ability of citizens and government officials to access the law in useful ways.
How to Get Involved
If you’re going to Open Data Day DC, come say hello to the OpenGov Team! We’re looking forward to meeting other open data disciples. You can also connect with us at firstname.lastname@example.org, or on Twitter at @FoundOpenGov.
Registration is now closed; however, if you’re not registered but still want to be involved, take a look at the hackpad for ideas on future projects. Follow the Twitter hashtags #OpenDataDay #DC. And contact the event organizers to learn about other great meetups in the DC Metro area. The huge response to this year’s International Open Data Hackathon highlights the exploding field of open data and civic technology, and the amazing works that are already out there. There’s plenty to do, and all are welcome. Get involved!