[Developers Blog] My First Hackathon: Learning to Extract

My First Hackathon: Learning to Extract

The 2014 PDF Liberation Hackathon

Editor’s Note: OpenGov Intern Matt Steinberg (pictured at right below) recaps his first hackathon, underscoring how “open government” isn’t just for professional software developers.

Over the weekend of January 16-18 the Sunlight Foundation sponsored a PDF Liberation Hackathon which I participated in as a non-developer. The event focused on the problem that many organizations have reams of data in PDF. This is an issue because PDF does not allow the user to interact with the document in word searches and other ways; the data they contain is essentially locked up and difficult to manipulate. This means that a lot of data, including congressional financial disclosures and non-profit expenditures, is not easily viewable, and cannot be used to create tables, graphs, diagrams, and apps. The goal of the event was to make progress in unlocking these documents by testing and tweaking different software to convert documents into more usable formats.

A term that I learned that is essential to opening PDFs is OCR, or optical character recognition–this is the way the characters in PDFs or scans are converted to other formats such as .txt files.

At first I worked with Damarius Hayes from the docHive team. I learned that many businesses have loads of tax forms dating years back that are stuck in PDFs. Damarius showed me how the software the docHive team is developing allows the user to specify certain regions of the screen called pixels to be isolated into little rectangles. The data within the pixellated rectangles may be extracted into tables and spreadsheets which are effective for building diagrams, graphs, and other useful data-related things. Since the visual structure is consistent from form-to-form, you may isolate the important areas of the document with the pixellated rectangles, and boom–you then have a template which you may use to extract info from other structurally identical forms.

I then worked with the docHive team to setup Homebrew on my computer. Homebrew is a “package management system, a collection of software tools to automate the process of installing, upgrading, configuring, and removing software packages for a computer’s operating system in a consistent manner” (wikipedia.) It essentially is open source software that allows the user to easily download other open source software. I then downloaded Tesseract and Imagemagik, two OCR programs. It took a while to get Tesseract to start working, but finally, with the help of Edward Duncan from docHive, I was able to convert a document from PDF to text.

I feel that as a non-developer the biggest thing I contributed was a spirit of curiosity and learning. By asking the developers questions I was a help to them by compelling them to slow down and gain a macroscopic perspective on their work. In fact almost everyone to whom I asked questions ended up tutoring me and expressing their gratitude for my questions. Most said that they were learning from me as well. This is useful because there was a great deal of technical jargon being thrown around and many of the eventual users of the tools that were and will be developed out of this event are amateurs just like me. I have already reached out to Damarius and Edward of DocHive and they continue to help me understand what they do, where to go from here and what I can do to help. Edward in particular has given me some specific technical help on how to use tesseract to convert PDFs to txt. The next step for me is to do private research on all the jargon I heard and to start testing out the different tools I was exposed to – namely Tesseract and Imagemagik. I currently have a project converting PDFs into markdown and think this will be a good opportunity to learn the ins and outs of the software.

All of this is important because it has to do with opening data to the public – helping to construct a more transparent society. This concept of open-data is relatively new to me and I hope to learn more by reading the book “Open Data Now” by Joel Gurin. Having data be easily accessible is essential to a vital and engaged society. After helping to liberate civic data people really need from PDFs, I feel energized to contribute my skills and my time to bring that society to life in the United States.