Getting That Data Out of That Ugly PDF

A real-life pdf received by Amanda Loder of StateImpact New Hampshire. She had about 15 pages like this that she wanted to put into a sortable table. And she did it!

Has some government agency sent you a completely messy, crookedly scanned copy of an Excel print-out? Are they claiming that it would be impossible for them to share with you the original spreadsheet?

Don’t despair! There is still hope for you.

There is a very special trick called “optical character recognition” (or, “OCR” if you’re cool) that can help you covert those fuzzy tables into actual, usable Excel spreadsheets. While OCR software can be costly,  we have found at least one website that can help you out for much less scratch: Online OCR. The only caveat is that they only let you do about five pages for free. After that, you have to sign up and get a password and pay something like 7 cents a sheet. Annoying, but still better than manual data entry.

But be warnedYou’ll want to go through and make sure that your numbers still add up to whatever they add up to in your original pdfs. Depending on the quality of the scan, a 7 might look like a 1, a 3 like an 8… You get the idea. Make sure you dutifully clean and check the work. And check it again.

And then pat yourself on the back for overcoming yet another obstacle in your quest for government transparency. Well done, you!

