A notebook to retrieve and process data about the 50 most murderous cities in the world by homicide rate.

In [1]:
import io
import json
import os
import re

import pandas as pd
import requests

from geonamescache import GeonamesCache

re_num = re.compile(r'^[\d,.]+$')
gc = GeonamesCache()
cnames = gc.get_countries_by_names()

Get the data

Fetch the Wikipedia table from the page List of cities by murder rate as a CSV file.

In [2]:
url = 'http://wikitables.geeksta.net/dl/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FList_of_cities_by_murder_rate&idx=0'
csv = requests.get(url).text

Clean and Process the data

Read the CSV into a pandas data frame and convert all numeric values to floats.

In [3]:
def fix_num(x):
    if (isinstance(x, str) and re.search(re_num, x)):
        return float(x.replace(',', ''))
    return x

df = pd.read_csv(io.StringIO(csv), index_col='Rank')
df = df.applymap(fix_num)

Set the 2-letter ISO country codes from the country names.

In [4]:
df['iso'] = df['Country'].apply(lambda x: cnames[x]['iso'])
df['iso3'] = df['Country'].apply(lambda x: cnames[x]['iso3'])

Retrieve geo coordinates for the 50 cities from GeonamesCache or the geonames Web service.

In [5]:
cities_by_geoid = gc.get_cities()
geonames_params = {
    'maxRows': 1,
    'username': os.environ['GEONAMES_USER']
}

def geonames_search(name, country):
    geonames_params['q'] = name
    geonames_params['country'] = country
    resp = requests.get('http://api.geonames.org/searchJSON', params=geonames_params)
    if resp.ok:
        content = json.loads(resp.text)
        return content

def get_geoid(cityname, iso):
    cities = gc.get_cities_by_name(cityname)
    if cities:
        # Pick the first city that matches the ISO code
        for city in cities:
            return list(city.keys())[0]
    else:
        content = geonames_search(cityname, iso)
        if content['geonames']:
            d = content['geonames'][0]
            geoid = d['geonameId']
            cities_by_geoid[geoid] = {
                'longitude': d['lng'],
                'latitude': d['lat']
            }
            return geoid
        else:
            print(content)

def get_city_prop(geoid, prop):
    if geoid in cities_by_geoid:
        return cities_by_geoid[geoid][prop]

df['city_geoid'] = df.apply(lambda x: get_geoid(x['Municipality'], x['iso']), axis=1)
df['longitude'] = df['city_geoid'].apply(lambda x: get_city_prop(x, 'longitude'))
df['latitude'] = df['city_geoid'].apply(lambda x: get_city_prop(x, 'latitude'))

Remove unused columns and improve column name.

In [6]:
df.drop(['iso', 'city_geoid'], axis=1, inplace=True)
df.columns = ['Municipality', 'Country', 'Homicides (2013)', 'Population (2013)', 'Homicides per 100,000 People', 'iso3', 'longitude', 'latitude']
df.head()
Out[6]:
Municipality Country Homicides (2013) Population (2013) Homicides per 100,000 People iso3 longitude latitude
Rank
1 San Pedro Sula Honduras 1411 753990 187.14 HND -88.025 15.50417
2 Caracas Venezuela 4364 3247971 134.36 VEN -66.87919 10.48801
3 Acapulco Mexico 940 833294 112.80 MEX -99.8901 16.86336
4 Cali Colombia 1930 2319684 83.20 COL -76.5225 3.43722
5 Maceió Brazil 795 996733 79.76 BRA -35.73528 -9.66583

5 rows × 8 columns

In [7]:
df.to_csv('../static/data/csv/most-murderous-cities.csv')

Limit to cities in Brazil.

In [8]:
df_bra = df[df['iso3'] == 'BRA']
df_bra.to_csv('../static/data/csv/most-murderous-cities-bra.csv')

Map Preview


Ramiro Gómez

About this post

This post was written by Ramiro Gómez (@yaph) and published on June 30, 2014.


blog comments powered by Disqus