Do it once and then repeat it ad nauseaum
Do it over and over until you get it right
Do it once... and only once... until you need to issue a correction
My kind of town
Do your homework
- What questions are you asking of your data?
- Read the data dictionary... then read it again.
- No data dictionary? Make one yourself.
- How was the dataset created?
- Test data entry to see common mistakes for yourself
- Google interesting values, cryptic field names
- Look for other data about the same subject
- Query, visualize or publish a sample
- What don't you need to clean?
- Empty columns
- Values that stand out
- Sort to find min / max / outliers
- Search for things you expect (or don't)
- Meaningless columns
Quick data stats
csvstat -c statute data/Criminal_Convictions_ALLCOOK_05-09.csv
Unique values: 1616
5 most frequent values:
Max length: 27
Row count: 321590
Defense Logistics Agency "1033" program provides military gear to local law enforcement agencies
- Carry out repeatable transformations to clean the data
- Import into a relational database
- Join with other data sets
- Generate and export analysis and summarization in journalist friendly format
echo 'IMPORT DATA'
echo 'CREATE SUMMARY FILES'
echo 'EXPORT PROCESSED DATA'
Work in public
Clean and convert Excel data to CSV with Python
- Parse the date field, which represents dates in two different formats
- Strip out extra spaces from any strings (of which there are many)
- Split the National Stock Number into two additional fields:
- First two digits represent the "federal supply group" (e.g. "WEAPONS").
- First four digits represent the "federal supply class" (e.g. "Guns, through 30 mm").
# NSN parsing example
if header == 'nsn':
nsn_parts = cell_value.split('-')
row_dict['federal_supply_class'] = nsn_parts
row_dict['federal_supply_group'] = nsn_parts[:2]
Import data to PostgreSQL
- Import the cleaned CSVs generated from the source data
- Import a "FIPS crosswalk" CSV: Maps county name and state to the Federal Information Processing Standard identifier to identify counties.
- Import a CSV file with Federal Supply Codes
- Import 5 year county population estimates from the US Census Bureau's American Community Survey
# Example import statement
echo "Import FIPS crosswalk"
psql leso -c "CREATE TABLE fips (
psql leso -c "COPY fips FROM '`pwd`/src/fips_crosswalk.csv' DELIMITER ',' CSV HEADER;"
Summarize: Time series by category
sum(quantity * acquisition_cost) as total_cost,
extract(year from ship_date) as year
from data as d
join codes as c on d.federal_supply_category = c.code
group by c.name, year
order by year desc
Dump the data
Rand Paul: "Mr. Estevez, in the NPR investigation of 1033 program they list that 12,000 bayonets have been given out. What purpose are bayonets being given out for?"