As we’ve all discovered, many government agencies prefer releasing records in portable document format, or PDF. Sometimes that’s helpful, for example with narrative text files. But not so much for data.
This tutorial will show you one free way to convert PDFs with tabular data into spreadsheets. The data I’ll use comes from a PDF I converted recently: The number of sworn police officers for the top 50 municipalities in the United States. Here’s what the file looks like:
You can’t copy/paste this text into a spreadsheet, unfortunately, and you don’t want to waste time or risk a correction by typing the data manually. So let’s convert it.
“Once a week, buy a key person coffee. Learn what they want from you before telling them what you want from them. When possible, do interviews in person. Build relationships. While on a story, log contact info for good sources you meet.” — Erin Barnett, The Oregonian
Norman Rockwell's "The Runaway" shows a beat cop talking up a runaway child.
In our latest StateImpact webinar, our database reporting coordinator Matt Stiles gave us a run down of some things to keep in mind to really own your beat. Before he was a data nerd, he was a nerdy beat reporter. Stiles covered federal courts in Dallas and City Hall in Houston. Both had challenges. The feds wouldn’t talk to him. And he couldn’t get the folks at City Hall to stop talking to him.
As you know, beat reporting is hard. The best beat reporters are organized, they really care about a subject and they’re asking the right questions. Some tips to remember to own your beat:
1.) Be aware and be around.
The best beat cops are in diners and on street corners meeting people and, in this case, perhaps discovering a runaway. If he were out in his car, or back at the station house, he might not have seen this kid. Or looked close enough to recognize the pack on the floor. This kid might have gotten away.
Our own Chris Amico describes data apps as “[tools] for both the public and reporters that enables open-ended exploration of an on-going news story or issue.” – Chris Amico
Matt Stiles explains data best practices, apps of the future, and how you can request your very own custom built data app in our latest data webinar:
News App Basics:
How will we decide what to build?
- News: Is there a story?
- Network: Will other stations benefit?
- ROI: Will people use it?
- Station plan: vision, goals and specs.
- Guide: Early consultation with NPR about ideas, calendar. Station writes “story” about app and basic specs.
- Package: What stories, document might accompany apps?
- Promotion: How will we make sure people see the work?
- Clarity: Same as a text story. Be careful not to mislead, overwhelm, confuse readers.
- Ethics: Micro vs. aggregated. Is it fair to name individuals, organizations in data without their comment?
- “News” vs. Transparency: What’s our mission? Not data for data’s sake.
We’ve updated the content-management system so that it’s easier to add data tables to posts — and to make them searchable and sortable with pagination. Even better: Reporters don’t have to touch any code.
This functionality is possible thanks to a plugin that transfers data from your Google Docs spreadsheets into our database and ultimately into your blog posts. Here’s how it works.
First, you’ll want to format the columns and rows in you’re Google Docs spreadsheet exactly how you’d like them to appear in the browser. For example, if I wanted this display:
Google Fusion Tables allows journalists to publish, visualize and analyze large data sets in the browser without expensive software. Learn how this free tool can help you create custom online maps, graphs and timelines, mash-up different data sets and collaborate on data. Our Database Reporting Coordinator Matty Stiles’ webinar on Google Fusion Tables is now available below.
How to use Fusion Tables?
- Upload data: kml, cvs, xls
- Upload metadata: table name, description, attribution
- Share your data
- Merge with other tables
- Make a map in minutes
A detailed tutorial post is also available here.
UPDATE: We’ve improved functionality on tables for reporters and readers. See the updated documentation. You can still use this tutorial for small graphics, if you choose, but the newer solution might save you some time.
StateImpact bloggers will occasionally need to include basic data tables in posts. Here’s a simple method that doesn’t involve too much code.
First, structure your data in Microsoft Excel or Google Docs so its rows and columns appear as you want them in the post. Copy the data:
Open this converter, which changes your copied text to HTML. Paste your data in the form. Notice that tabs are used to denote columns in text copied from the spreadsheet:
Embedding DocumentCloud interactives into posts is simple using a button in the visual editor.
To start, upload your document and add any sections, notes and descriptions, and then publish the document. When open, highlight and copy the document URL:
Next, create a new post. Select the DocumentCloud button in the ribbon above the visual editor:
Paste the URL in the pop-up window. Notice the option to have the document render in a normal or wide view. A wide post eliminates the right rail, showing the document across the entire page at 940px. You also have the option to check a button to add a sidebar if you want to display your notes and sections. The popup looks like this:
After you click insert, a short code with the DocumentCloud URL will appear in the editor:
Add any categories or post text, as needed, and save a draft or publish. You’re done!
A big part of StateImpact’s editorial mission is to be data-driven — to focus on asking for, acquiring, cleaning up and presenting numbers and information in ways to best educate our readers. We’re lucky to have Matt Stiles coordinate our data effort and teach us what he means when he talks about being a data journalist. Here’s his first webinar with the group — Data 101.
What is “Data Journalism”?
- The use of electronic records to find, support and explain stories.
- Basic social science methods
- Less “he said”, “she said”
- Visualization not always required
Find The Right Tool
- Story type: online, radio – both?
- What’s the need: Data queries, visualization, maps, text analysis?
- Data structure: Should it be cleaned, reorganized, normalized before you begin any analysis?
Basic Data Journalism Tools
- Spreadsheets: Excel, Calc
- Databases: Access, MySQL, Base
- Mapping: ArcGIS, QGIS
- Statistics: SPSS, SAS, R
- Online: Google, “Hidden” Web
Online Journalism Tools
- Data Analysis: Fusion Tables, MS Web Apps, Google Docs
- Visualization: Many Eyes, Tableau Public, Google Charts, Highcharts
- Mapping: OpenHeatMap, Tableau, Fusion Tables, GeoCommons, QGIS
- Text: DocumentCloud, Wordle, xpdf
Few journalists are programmers or graphic designers, but that doesn’t mean they can’t dabble in data journalism online. Here are some free, easy tools that you, our StateImpact reporters, will be using to tell stories.
• IBM Many Eyes: A site with multiple visualization tools, including interactive bubble charts, line graphs and tree maps. It’s also among the best (free) places to experiment with text visualization. It has weaknesses, though. Users can’t style graphics to match their own CSS, and the embeds require a Java browser plugin. Still, it’s a neat tool.
• Google Charts API: A tool that allows you to create and embed multiple, customizable chart styles without branding or clutter. The API also has multiple libraries for programmers to create charts dynamically from data. Here’s an example.