Drunks in the US¶

Notebook to retrieve and cleanup statistics on US adults age 26 and above who reported bing drinking in the previous month based on data collected from approximately 67,500 individuals in 2011 and 2012 by the Substance Abuse and Mental Health Services Administration and published by Bloomberg.

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
import geonamescache

from lxml import html

gc = geonamescache.GeonamesCache()

url ='http://www.bloomberg.com/visual-data/best-and-worst/most-drunks-states'
xpath = '//*[@class="hid"]'

tree = html.parse(url)
table = tree.xpath(xpath)[0]
raw_html = html.tostring(table)

df = pd.read_html(raw_html, header=0, parse_dates=[1])[0]
df.head()

Out[1]:

	Rank	State	Binge drank in previous month
0	1	District of Columbia t	30.10%
1	2	North Dakota t	27.96%
2	3	Rhode Island t	27.94%
3	4	South Dakota t	27.37%
4	5	Wisconsin t	26.71%

Cleanup columns by removing unnecessary characters and convert percentage value into float.

In [2]:

df['State'] = df['State'].apply(lambda x: x.rstrip('t').strip())
df['Binge drank in previous month'] = df['Binge drank in previous month'].apply(lambda x: float(x.rstrip('%')))

Map state names to FIPS codes and delete columns not needed in the Web based map.

In [3]:

states = gc.get_dataset_by_key(gc.get_us_states(), 'name')
df['fips'] = df['State'].apply(lambda x: 'US' + states[x]['fips'])

del df['Rank'], df['State']
df.head(5)

Out[3]:

	Binge drank in previous month	fips
0	30.10	US11
1	27.96	US38
2	27.94	US44
3	27.37	US46
4	26.71	US55

Use more meaningful column heading and save as CSV.

In [4]:

df.columns = ['% of adults age 26+ who binge drank in previous month', 'fips']
df.to_csv('../static/data/csv/drunks-us-states.csv', encoding='utf-8', index=False)