Transforming US Civic Data Into a “What You See Is What You Mean” Format
City and state governments across America are adopting Open Data policies at a fantastic clip. Technology hotbed San Francisco has one, but places far from Silicon Valley – like Pittsburgh, PA – are also joining the municipal open data movement. The increasing availability of free tools and enthusiastic volunteer software developers has opened the door for a vast amount of new government data to be made publicly available in digital form. But merely putting this data out there on the Internet is only the first step.
Much of this city data is released under assumption that a government agency must publish something – anything – and fast. In this rush to demonstrate progress, little thought is given to the how. But the citizens who care about this data – and are actually building websites and applications with it – need to access it in machine-readable, accessible, and standards-compliant formats, as the Sunlight Foundation explains here. This explains why most city open data sets aren’t seen or used. There is a vast difference between merely opening data, and publishing Open Data.
By publishing data in good formats that adhere to modern, widely-accepted standards, users of the data may reuse existing software to manipulate and display the data in a variety of ways, without having to start from scratch. Moreover, it allows easy comparison between data from different sources. If every city and every organization chooses to adopt their own standard for data output, this task becomes absolutely insurmountable – the data will grow faster than anyone on the outside can possibly keep up.
Most Government “Open Data” Is A Useless Mess
Take, for example, the mess that is data.gov. Lots of data is available – but most of these datasets are windows-only self-extracting zip archives of Excel files without headings that are nearly useless. This is not what the community at large means by “Open Data” – there are lots of closed formats along the way.
Similarly, data which is released with its own schema, rather than adopting a common standard, is just as problematic. If you’re forcing your users to learn an entirely new data schema – essentially, a brand new language – and to write entirely new parsing software – a brand new translator – just to interact with your data, you’ve added a considerable barrier to entry that undercuts openness and accessibility.
A good standard lets anyone who wants to interact with the data do so easily, without having to learn anything new or build anything new. Standard programming libraries can be built, so that it’s as simple as opening a webpage for everyone. This means that in most programming languages, using a standards-based data source can be as simple as it is to interact with the web, import httplib and you’re done.
Evaluating Existing Standards
Every day at The OpenGov Foundation, I work with legal and legislative data. Laws, legal codes, legislation, regulations, and judicial opinions are a few examples. What standard do we use? Well, let’s look at the most common standards available for publishing legal data on the Internet:
- Akoma Ntoso – a well-known XML-based format that is very verbose. The level of complexity presents a high barrier to entry for most users, and has prevented its wide adoption.
- United States Legislative Markup (USLM) - another XML-based format used by the US House of Representatives. It has the advantage of being not very verbose, extensible, and easy to use.
- State Decoded XML – the format used by The State Decoded project. Currently, this only support law code data, and is not widely adopted outside of this project.
- JSON – JSON is not actually a standard, but a general format well suited to relational and tabular data, and chunks of plain text. A variant is JSON-LD which has all of the same properties, but is better for relational data. It is commonly used for transferring data on the web, but it is not practical for annotated or marked-up data.
None of these are ideal. But if I had to pick a single option to move forward, the USLM standard is the most attractive for several reasons:
- It is older, established, and has good documentation
- It is easily implemented and used
- It is extensible, but not especially verbose
- It is designed to handle inline markup and annotation, such as tables, mathematical formulas, and images
It also acts as a very good “greatest common factor” as a primary format – it can be translated easily into common formats such as HTML, Microsoft Word, plain text, and even JSON – but does not add any superfluous complexity to address most common needs (e.g., tables or annotations) that other formats require.
Setting the Standard for Open Law & Legislative Data
Moving forward, the next step beyond simply exporting USLM data from existing data sources would be to to have end-to-end solutions that speak USLM natively. Instead of editing Word or WordPerfect documents to craft legislation, lawmakers could write bills in new tools that look and feel like Word, but are actually crafting well-formatted USLM XML behind the scenes, instead of a closed-source, locked-in format. This is what we call “What You See Is What You Mean” – or WYSIWYM.
Here at The OpenGov Foundation, we believe in a rich, standards-based data economy, and we are our actively doing our part to contribute. Our open-source online policymaking platform – Madison - already consumes USLM, and we are actively working on an WYSIWYM editor to make it easier to create and modify policy documents in Madison. We are also investigating USLM support for The State Decoded - both as an input and output format. Hopefully, other software projects will actively follow suit – creating an interoperable ecosystem of legal data for everyone in the United States.
The law is the most important data set in your town, city or state. Yet it is often the hardest information to find, access and use on the Internet. Whether you are a developer, a lawyer, or a government worker, the frustrations with today’s paper-based legal information are endless. But those maddening frustrations are starting to end.
Tomorrow, June 13 at 11am, The OpenGov Foundation’s Seamus Kraft will share the secrets of decoding the law in a talk titled “Liberating Your Laws for the Internet Age”. Our team of software developers is solving these problems with open source software, liberating America’s laws from inaccessible, closed data formats into ridiculously useful “decoded” open data with user-friendly websites and powerful tools on top.
The conversation – moderated by Jonathan Askin, Professor of Law at Brooklyn Law School, and Dazza Greenwood, of MIT Media Lab and CIVICS.com – is part of the MIT Legal Hackathon, a four day online “unhackathon,” during where participants can collaborate on projects, attend planned events, or create their own sessions. Everything is geared towards solving shared legal and technical challenges related to government and the law. The event runs June 12–June 15; session topics range from open data and crowdsourcing legislation to transitioning governments into using open source software. Anyway interested in working on these issues is welcome to participate. You can register for the event online for free.
Watch NYC Council Member Kallos’ MIT Legal Hackathon Preview Video
New York City Council Member Ben Kallos will help to lead the event, providing a keynote address on Friday at noon, and participating in sessions. The Council Member recently introduced seven bills related to open government and data in the New York City Council; you can comment on them using Madison, a collaborative online policy drafting platform. The Council Member’s bills include The Law Online Act, City Records Online Act, and eNotices Act, as well as bills on crime data and open maps.
Head on over to the MIT Legal Hackathon website to join us.
Watch Live Tomorrow & Submit Questions Now on District Plan to Increase Urban Farming
Tomorrow at 11 AM EST, the DC Council will hold a legislative hearing on the Urban Farming and Food Security Act, a plan to transform vacant District building lots into food-producing urban farms. And for the first time ever, DC residents can go online to participate in this official Council hearing before, during, and after the gavel drops.
To get involved, simply visit the urban farming bill in MadisonDC, log in, and submit your questions for Council Member David Grosso to answer during the hearing. Grosso has been drafting the plan online in cooperation with Council Members Cheh, Wells, and District residents in the MadisonDC collaborative policymaking platform.
Details on June 12 Online Hearing
What: Participate in DC Council’s First-Ever Online, Real-Time Legislative Hearing
When: Now through the conclusion of tomorrow’s Live 11 AM EST Hearing
On What: Council Member Grosso’s plans to increase DC urban farming opportunities
How to Get Involved
1. Click here to visit the Urban Farming and Food Security Act
2. Log into MadisonDC community to submit questions, feedback, anything!
3. Join tomorrow’s 11 AM EST hearing live online, keep your questions coming in MadisonDC!
Council Member Grosso has already started reviewing citizen urban farming questions, and will do so all the way through tomorrow’s live hearing before the Committee on Finance and Revenue. The open-to-all online Council event, like Grosso’s continuing MadisonDC beta test, is the first time such direct citizen involvement has been attempted in Washington, D.C. city government. The long-term goal is to transform legislative hearings for the Internet age, improving the efficiency, transparency and outcomes of the Council hearing process itself.
MadisonDC is the District of Columbia’s version of the free Madison software that reinvents government for the Internet Age. Madison 1.0 powered the American people’s successful defense of Internet freedom from Congressional threats. It delivered the first crowdsourced bill in the history of the U.S. Congress. And now, the non-partisan, non-profit OpenGov Foundation has released Madison 2.0, empowering you to participate in your government, efficiently access your elected officials, and hold them accountable.
Currently in beta, Madison 2.0 is open source software that can be used to open any government data production process on the Web – from regulations to rule-making, legislation to letter-writing. Bottom line: Madison is custom-built to connect the decision-makers in our democracy to the people they serve. Click here if you want Madison and The OpenGov Foundation to help you get better results from your local, state or federal government.
A Brief Summary
After a hackathon a few months back, we were joking about creating an easy way to take the data we’d painstakingly parsed from PDFs, word documents, and XML files, and “translate” it back into a format that government agencies are used to. Many of us have been shell-shocked in dealing with PDFs from government agencies, which are often scanned documents, off kilter and photocopied many times over. Fundamentally, they’re very difficult to pry information out of. For the OpenGov Foundation’s April Fools’ prank, we created Govify.org, a tool to convert plain text into truly ugly PDFs.
A quick, [one-line ImageMagick command](https://gist.github.com/krues8dr/9437567), was the first version. We quickly produced a few sample documents, and decided that it would be fantastic if users could upload their own files and convert them. Very quickly it became clear that the process might take a couple of seconds, and a decent amount of CPU – so to deal with any sort of load, we’d need a modular, decentralized process, rather than a single webpage to do everything.
— Ben Balter (@BenBalter) April 1, 2014
As Ben Balter points out, there are a lot of moving pieces to this relatively-simple setup. Govify.org is actually a combination of PHP, Compass + SASS, Capistrano, Apache and Varnish, Rackspace Cloud services and their graet API tools, Python and Supervisord, and ImageMagick with a bash script wrapper. Why in the world would you use such a hodgepodge of tools across so many languages? Or, as most people are asking these days, “why not just build the whole thing with Node.js?”
The short answer is, the top concern was time. We put the whole project together in a weekend, using very small pushes to build standalone, modular systems. We reused components wherever possible and tried to wholly avoid known pitfalls via the shortest route around them. A step by step breakdown of those challenges follow.
We started with just a one-line ImageMagick command, which:
Takes text as an input
Fills the text into separate images
Adds noise to the images
Rotates the images randomly
And finally outputs all of the pages as a PDF.
Using that to create a few sample documents, we began putting together a rough website to show them off. Like everyone else who needs to build a website in zero time, we threw Bootstrap onto a really basic template (generated with HTML5 Boilerplate. We use a few SASS libraries – Compass, SASS Bootstrap, and Keanu – to get some nice helpers, and copied in our standard brand styles that we use everywhere else. A few minutes in photoshop and some filler text later, and we had a full website.
We needed a nice way to deploy the site as we make changes, and our preferred tool is Capistrano. There are other tools available, like Fabric for Python or Rocketeer for PHP, but Capistrano excels in being easy to use, easy to modify, and mostly standalone. It’s also been around for a very long time and the one that we’ve been using the longest.
We’re using Rackspace for most of our hosting, so we stood up box with Varnish in front of Apache and stuck the files on there. Website shipped!
Once that was done, we made the decision to allow users to upload their own files. At OpenGov, we’re primarily a PHP shop, so we decided to use PHP. OK, OK – stop groaning already. PHP is not the most elegant language in the world, and never will be. It has lots of horns and warts, and people like to trash it as a result. That being said, there are a few things it’s great at.
First and foremost, it’s incredibly easy to optimize. Tools like APC and HipHop VM which allow you to take existing PHP scripts and make them run *very* well. The variety and diversity of optimization tools for PHP make it a very attractive language for dealing with high-performance apps, generally.
Second, it’s a “web-first” language, rather than one that’s been repurposed for the web – and as a result, it’s very quick to build handlers for common web-tasks without using a single additional library or package. (And most of those tasks are very well documented on the PHP website as well.) Handling file uploads in PHP is a very simple pattern.
So in no time at all, we were able to create a basic form where users could input a file to upload, have that file processed on the server, and output back our PDF. Using the native PHP ImageMagick functions to translate the files seemed like a lot of extra work for very little benefit, so we ran kept that part as a tiny shell script.
At this point however, we realized that the file processing iself was slow enough that any significant load could bring slow the server considerably. Rather than spinning up a bunch of identical servers, a job queue seemed like an ideal solution.
Creating a Job Queue
A very common pattern for large websites that do processing of data is the job queue, where single items that need processing are added to a list somewhere by one application, and pulled off the list to be processed by another. (Visual explanation, from the Wikipedia Thread Queue article.) Since we’re using Rackspace already, we were able to use Rackspace Cloud Files to store our files for processing, and the Rackspace Queue to share the messages across the pieces of the application. The entire Rackspace Cloud stack is controllable via their API, and there are nice libraries for many languages available.
On our frontend, we were able to drop in the php-opencloud library to get access to the API. Instead of just storing the file locally, we push it up to Rackspace Cloud Files, and then insert a message into our queue, listing the details of the job. We also now collect the user’s email address, so that we can email to let them know that their file is ready.
The backend processing, however, presented a different set of challenges. Generally, you want an always-running process that is constantly checking the queue for new files to process. For processes that take a variable amount of time, you don’t want just a Cron job, since the processes can start stacking up and choke the server – instead we just have a single run loop that runs indefinitely, a daemon or service.
For all the things that PHP is good at, memory management is not on the list. Garbage collection is not done very well, so large processes can start eating memory rapidly. PHP also has a hard memory limit, which will just kill the process in an uncatchable way when it dies.
Python, on the other hand, does a rather admirable job of this. Creating a quick script to get the job back out of the Rackspace Queue, pull down the file to be manipulared, and push that file back up was a rather simple task using the Rackspace Pyrax library. After several failed attempts in trying to use both the python-daemon and daemonize packages as a runner for the script, we reverted to using Supervisor to keep the script going instead.
Obviously, this isn’t the most elegant architecture ever created. It would have made far more sense to use a single language for the whole application – most likely Python, even though very little is shared across the different pieces aside from the API.
That being said, this thing scales remarkably well. Everything is nicely decentralized, and would perform well under significant load. However, we didn’t really get very significant load from our little prank – most people were just viewing the site and example PDFs, and very few were uploading their own. Sometimes overengineering is its own reward.
Not bad for three days of work, if I do say so myself.
All of the pieces are available on Github and GPL2 licensed for examining, forking, and commenting on.
Yesterday’s Document Technology, Tomorrow!
Washington, DC – The OpenGov Foundation today announced the beta release of Govify.org, a new website to help citizens transform open text documents into government-ready PDF files. Simply upload a machine-readable file to Govify, enter your email address and, moments later, a fuzzy and off-kilter PDF is delivered straight to your inbox for downloading, printing and more.
“The truth is, opening government data and government itself is very hard,” said OpenGov Foundation Executive Director Seamus Kraft. “To make it easier on old-fashioned government and its folkways, we built you Govify.org. More paper cuts and fewer trees. That’s our motto.”
One of the biggest barriers preventing governments from embracing Open Data is their inability to produce documents in a government-ready, paper-based format. Like a PDF. Or a book. But the closed-format documents Govify churns out can be easily used for any government purpose – even printed, re-scanned and printed again to attain that extra special level of efficient government bureaucracy.
To further the global advancement of government-ready PDFs, the OpenGov Foundation also released the Govify open source code on Github.
Example Govify-ed documents:
How Govify Works
1. Image Conversion
First, Govify reads in the source text you upload, and converts it into an image.
2. Noise Generation
Next Govify adds a suitable amount of noise, to give you that authentic scanned-in-the-1990s look.
3. Rotation, PDF Export
Last, Govify rotates the image – just like first-day interns do it – and sends a PDF straight to your email inbox.