Skip to content

Latest commit

 

History

History
157 lines (114 loc) · 9.35 KB

bulletproofing-data-projects.md

File metadata and controls

157 lines (114 loc) · 9.35 KB

Bulletproofing Data Projects

Adapted from ProPublica's A Guide to Bulletproofing Your Data

from jotted down notes

  • Make a copy and file it away as raw data. Never work with the raw data.

  • Create a new data notebook or diary file. Keep track of everything you do.

  • Ask for the documentation.

  • Ask how data was compiled.

  • Ask to speak with whomever compiled.

  • Count the number of rows.

  • Count the number of columns.

  • Understand what values should be/can be for a given field.

  • Are values missing in particular fields?

  • Spot check spelling, particularly when dealing for street names, address, cities, categories, dates.

    • Clean where necessary
  • Keep definitions of terms in your data notebook or diary file.

  • Get basic stats

    • Mean
    • Median
    • Max
    • Min
    • Distribution
  • What are the max number of rows excel can handle?

  • What could be missing?

  • When making comparisons, are you comparing the same unit of measure?

  • Is there a canonical data source

  • Understand the limitations of the data

    • What can it show?
    • What can't it show?
  • Reports and studies

    • Ask how something is calculated/what is the methodology?
    • Ask to see the raw numbers
    • Compare to known sources
    • Question potential motivation for findings
  • Canonical Sources

    • BLS?
    • Census?
    • State economic development

Random

  • It's all about the methodology. If we're doing our own, explain the process and the assumptions. If it's someone else's make an effort to understand. Regardless, describe to readers simply.

  • Using a different lens can't I also write that a household grossing $68,640 can "afford" to spend $1,716 a month on rent?

  • We recommend using government or academic data before startup PR data, and try to limit to a single source for apples-to-apples comparisons. Or at the very least, understand the limitations of what we're trying to publish.

  • In stories when citing facts and figures from both the city of LA and Los Angeles County we must be clear which we are referring to on each reference

A Guide to Bulletproofing Your Data

  • Document any methodology and present it to readers clearly. Post a link to it prominently in the app or graphic. The methodology (or "nerd box") should detail the sources of the data, the steps you took to clean the data and the exact analysis you ran on the data.

by Jennifer LaFleur, ProPublica

Integrity Checks for Every Data Set

  • Make sure you know how many records you should have and that you have them all. Certain versions of Excel, especially ones used in gov't agencies, have row limits; be concerned if you or someone in your newsroom tells you they have 65,536. Database programs like MySQL have limits on the size of integers and on the length of string fields. These limits tend to be powers of two -- 2^n or 2^n-1. or 1,048,576 records, so be skeptical if your strings are 255 characters long or if a sum appears to be 2,097,152.
  • Double-check totals or counts. Check for studies or summary reports.
  • Consistency-check all fields. Are all city names spelled the same way? How about other important fields? Check by running a GROUP BY and sorting alphabetically by every important field. Check it for spelling inconsistencies. For example, if you’re analyzing a database of highway accidents, GROUP BY and sort ascending on the road name to check for inconsistencies.
  • Other basic checks: make sure all states/cities/counties are included. Check the range of fields. (For example, check for DOBs that would make people too old or too young.)
  • Check for missing data or blank fields. Are they real values, or did something happen with an import or append query?
  • Check your methodology (if necessary) against other similar research.

Beyond Basic Checks

  • Keep a data notebook (or computer file) and write down everything you do.
  • Know the source of the data.
  • Get similar data from another source.
  • Create a back-up copy of the database.
  • Check against reports.
  • Make sure you’re using the right tool. You may need to do more than counting and sorting.
  • Check with experts from different sides of the issue.
  • Find similar stories and study what they did.
  • Look at it. If you can actually physically go spot check records, do it.
  • Don’t forget the gut check. If someone just doesn’t seem right, it probably isn’t.
  • Beware of lurking variables
  • Come up with a standard naming convention for your files so you can stay organized. I don't recommend naming files "final" or even "super final."
  • If you think you’re in over your head, call on an expert to help.

Some Notes About Others’ Studies

  • Get the questionnaire and the methodology. If they won’t give it to you, that’s a red flag.
  • Beware of nonscientific methods: Web surveys, man on the street or other self-selection.
  • Know the sample size, which will give you the sampling error.
  • Again, know the source.
  • Account for margin of error and non-response or “don’t know” when drawing conclusions.
  • If possible, run statistical tests on the data. What may look significant to you, may not be.
  • When reporting, avoid false precision. Saying 52.18 percent of people think “blah, blah, blah” is portraying an impossible accuracy to readers.
  • Put your numbers in perspective

Find the Right Methodology

  • Read research reports.
  • Finding an existing data model - There are some accepted methodologies for dealing with certain types of data.
  • Find an expert to bounce your methodology off during the process.
  • Show findings to the targets of the story.
  • Duplicate your work. To make sure you didn’t mess something up along the way.
  • Maintain a consistent universe of cases. If you have to filter or redefine your universe, be able to explain why you isolated certain records or cases.
  • Give yourself enough time to follow through on collecting information for your database before you start writing. If you’ve built an organic database, where information may need to be updated or will change after additional reporting, set a cut off date and don’t make any more changes to the database unless the data is inaccurate or the new information will change the meaning of the story.

Other Tips From Our Colleagues

Sarah Cohen of the New York Times on homemade databases:

  • Number the pages of your documents to keep them in order and include the number when you enter the data. It helps you stay organized and with double-checking later on.
  • Add fields that relate to how “publishable” information is. I usually create fields that anyone can fill in that say 1) is the name spelling checked.

Ron Campbell of the CHCF Center for Health Reporting on documenting your work:

“I try to document every stage of my work using three tools:”

  • A work log (in Word or a text file); I describe what I'm trying to do at each stage and paste in queries.
  • Query files. Easy to do in SQL Server. Just be sure to put in a comment above the query explaining what the hell you were trying to do.
  • The Comment tool in Excel. Again useful for documenting what you're trying to do.

Russell Clemmings of the Fresno Bee on rechecking your data:

  • Write a different query that should yield the same results and see if it does so.
  • Pull a random sample from your results and check them against the raw data.
  • Have someone who knows the data check your results before publication -- even the target of the story, if possible.
  • Double-check surprising results -- if citations spiked by 50 percent in one year, it could be a story or it could (more likely) be an error.

For More Information

Numbers in the Newsroom: Using Math and Statistics in News by Sarah Cohen for Investigative Reporters and Editors, Inc.

Precision Journalism by Philip Meyer. Indiana University Press, Bloomington. 4th Edition. 2002.

News and Numbers by Victor Cohn. Iowa State University Press, Ames. 1989.

How to Lie with Statistics by Darrell Huff. W. W. Norton & Company, New York. 1954 (renewed 1984)

Innumeracy: Mathematical Illiteracy and Its Consequences by John Allen Paulos. Vintage Books, New York. 1990.

A Mathematician Reads the Newspaper by John Allen Paulos. Anchor Books, New York. 1995. (Also, check out the tape from Paulos keynote address at NICAR 2002 in Philadelphia)

IRE Resource Center: www.ire.org

Danielle Cervantes contributed to this guide.

Understanding the difference between average and median and that one might be best used over the other but you have to calculate both to make that determination. You should understand how conclusions of a study were arrived at before parroting the summary. It's all about the methodology. If we're doing our own "analysis", explain the process and the assumptions. If it's someone else's make an effort to understand. Regardless, describe to readers simply and straightforward how the findings were arrived at. Be wary of creating data soup, pulling figures from a mix of public, private, internet startup, academic and advocacy groups. Each likely has it's own methodology that makes it hard to create a comparison. And between you and me for now, but is it really a story every time Zillow or Uber or Trulia or an insurance company puts out a news release?