Cleaner Data

Do it once (and only once)


David Eads @eads | NPR Visuals @nprviz

Based on an NPR Visuals blog post and a talk by Geoff Hing

NICAR 2015, ATL


Slides: http://recoveredfactory.net/cleaner-data-nicar15

Source: http://github.com/eads/data-workflow

Made with Tarbell

Alternate titles


Do it once and then repeat it ad nauseaum

Do it over and over until you get it right

Do it once... and only once... until you need to issue a correction

My kind of town

city_state
CCHICAGO, IL
CDHICAGO
CHHICAGO IL
CHICAGO CH
CHICAHO IL
CHCAGO IL
CHCAGO IL
CHCAGO IL
CHCAIGO IL
CHCIACO IL
CHCIAGO
...

Automate everything

Do your homework

  • What questions are you asking of your data?
  • Read the data dictionary... then read it again.
  • No data dictionary? Make one yourself.
  • How was the dataset created?
  • Test data entry to see common mistakes for yourself
  • Google interesting values, cryptic field names
  • Look for other data about the same subject
  • Query, visualize or publish a sample
  • What don't you need to clean?

My favorite tools

  • Excel / Google spreadsheets
  • csvkit
  • Git / version control
  • Relational database (PostgreSQL)
  • Custom scripts (bash & Python)

Visual inspection

  • Empty columns
  • Values that stand out
  • Sort to find min / max / outliers
  • Search for things you expect (or don't)
  • Meaningless columns

Quick data stats


csvstat -c statute data/Criminal_Convictions_ALLCOOK_05-09.csv

14. statute
<type 'unicode'>
Nulls: True
Unique values: 1616
5 most frequent values:
720-570/402(c):29290
720-5/19-1(a):14697
720-5/16A-3(a):13613
720-570/401(c)(2):13415
720-570/401(d)(i):10959
Max length: 27
Row count: 321590

Case study

Defense Logistics Agency "1033" program provides military gear to local law enforcement agencies

End product

MRAPs And Bayonets: What We Know About The Pentagon's 1033 Program
by Arezou Rezvani, Jessica Pupovac, David Eads and Tyler Fisher

Script-based workflow

  • Carry out repeatable transformations to clean the data
  • Import into a relational database
  • Join with other data sets
  • Generate and export analysis and summarization in journalist friendly format

Control script


#!/bin/bash

echo 'IMPORT DATA'
echo '-----------'
./import.sh

echo 'CREATE SUMMARY FILES'
echo '--------------------'
./summarize.sh

echo 'EXPORT PROCESSED DATA'
echo '---------------------'
./export.sh

Work in public

Clean and convert Excel data to CSV with Python

  • Parse the date field, which represents dates in two different formats
  • Strip out extra spaces from any strings (of which there are many)
  • Split the National Stock Number into two additional fields:
    • First two digits represent the "federal supply group" (e.g. "WEAPONS").
    • First four digits represent the "federal supply class" (e.g. "Guns, through 30 mm").


# NSN parsing example
if header == 'nsn':
    nsn_parts = cell_value.split('-')
    row_dict['federal_supply_class'] = nsn_parts[0]
    row_dict['federal_supply_group'] = nsn_parts[0][:2]

Import data to PostgreSQL

  • Import the cleaned CSVs generated from the source data
  • Import a "FIPS crosswalk" CSV: Maps county name and state to the Federal Information Processing Standard identifier to identify counties.
  • Import a CSV file with Federal Supply Codes
  • Import 5 year county population estimates from the US Census Bureau's American Community Survey


# Example import statement
echo "Import FIPS crosswalk"
psql leso -c "CREATE TABLE fips (
  county varchar,
  state varchar,
  fips varchar
);"
psql leso -c "COPY fips FROM '`pwd`/src/fips_crosswalk.csv' DELIMITER ',' CSV HEADER;"

Summarize: Time series by category


select
  c.name,
  sum(quantity * acquisition_cost) as total_cost,
  extract(year from ship_date) as year

from data as d

join codes as c on d.federal_supply_category = c.code
group by c.name, year
order by year desc

Dump the data

Make corrections


Impact!

Rand Paul: "Mr. Estevez, in the NPR investigation of 1033 program they list that 12,000 bayonets have been given out. What purpose are bayonets being given out for?"

Impact!