One of the things I like the most about graduate school this semester is that I have a dedicated 3 credits to work on a thesis. The thesis part is a tad daunting, but the credits mean that I have time to work on a research project of my choosing. At this point I’m still in the exploratory phase which led me to an interesting data set that I’ll be using in this post: the Federal Trade Commission’s Consumer Sentinel Network Data Book. The FTC’s Consumer Sentinel Network collects complaints on a variety of different types of fraud including credit card fraud, imposter scams, shop-at-home and catalog sales, and identity theft. According to the 2013 data set the CSN collected over 1.6 million complaints of fraud, with about 300,000 (14%) of the complaints being related to identity theft.
The data book comes in a .xlsx format with different pages being focused on different areas. I am interested in doing my thesis on identity theft in particular, and perhaps on geographic differences in identity theft victimization. As luck would have it, the FTC’s 2013 data set includes a break down of identity theft complaints by metropolitan statistical area (MSA). I had the idea of mapping this data, and since I’m also taking a class on R for data visualization (and machine learning which isn’t relevant here but is very exciting) I’ll use R and ggmap to visualize the Complaints per 100,000 Population for each area.
Here’s my plan of attack:
1. clean up the data
2. get the latitude and longitude for each MSA
3. get the data into R
4. map the points, using dot size to represent the Complaints per 100,000 Population
Cleaning Up the Data
After saving the relevant data sheet in csv format, I’ll read it into Pandas and see what it looks like.
# import pandas as pd df = pd.read_csv("C:/Users/Andrew/Desktop/IDTheftMSA.csv") df.describe() #
The describe function produces a table that looks like this:
Metropolitan Area | Complaints | Complaints Per 100,000 Population | Rank | |
---|---|---|---|---|
count | 385 | 384 | 384 | 384 |
unique | 379 | 263 | 303 | 378 |
top | Metropolitan Area | Complaints | Complaints Per 100,000 Population | Rank |
freq | 7 | 7 | 7 | 7 |
Uh-oh. It’s probably normal that there are a few duplicates in the Complaints and Complaints Per 100,000 Population columns, but I would think that each Metropolitan Area and Rank value would be unique. Since the count value for both of these columns is higher than the unique value, we know there are duplicate values. Also troubling is the fact that there is a row with Metropolitan Area being the only populated column.
The problem of duplicate values is easy to fix. The top row of the above table shows us that the header values (Metropolitan Area, Complaints, Complaints Per 100,000 Population, and Rank) are repeated 7 times throughout the data. In order to drop rows like this one could simply use the df.drop_duplicates()
function, but that would leave the first appearance of the header in our data. This is because Pandas would see it as “unique”, while the other 6 appearances are the duplicates. Instead, let’s just set our data frame equal to every row that isn’t empty and every row without a Metropolitan Area value equal to “Metropolitan Area”.
# df = df.loc[(df["Metropolitan Area"] != "Metropolitan Area") & (pd.notnull(df["Metropolitan Area"]))].reset_index(drop = True) #
Now when I use df.describe()
I get this:
Metropolitan Area | Complaints | Complaints Per 100,000 Population | Rank | |
---|---|---|---|---|
count | 377 | 377 | 377 | 377 |
unique | 377 | 262 | 302 | 377 |
top | New York-Newark-Jersey City, NY-NJ-PA Metropol… | 54 | 38.9 | 321 |
freq | 1 | 7 | 5 | 1 |
Great! I have 377 Metropolitan Area values, and they’re all unique. Next I’ll drop the last two rows, which are part of a footnote I don’t need at the moment.
# df = df.iloc[range(len(df)-2)].reset_index(drop = True) #
Using the Google Maps API to Get Latitude and Longitude Information
Next up I’m looking to get latitude and longitude information for each MSA in order to be able to tell ggmap where to plot each data point. After looking around, I decided the best way to do this would be by using Google’s Map API. However, if you look at the values in our Metropolitan Area column, you’ll see that each value looks like this: Abilene, TX Metropolitan Statistical Area
I’m not looking for an area, just a single latitude and longitude point, so I’ll go through each Metropolitan Area value and remove the “Metropolitan Statistical Area” or “Micropolitan Statistical Area” text.
# for i, row in df.iterrows(): if "Metropolitan Statistical Area" in row["Metropolitan Area"]: df.iloc[i]["Metropolitan Area"] = row["Metropolitan Area"].replace(" Metropolitan Statistical Area", "") elif "Micropolitan Statistical Area" in row["Metropolitan Area"]: df.iloc[i]["Metropolitan Area"] = row["Metropolitan Area"].replace(" Micropolitan Statistical Area", "") #
Now our values look like this: Abilene, TX
. A separate issue exists in that some of the MSAs span multiple cities which make up the metropolitan area, and some even span multiple states. I tested these out and Google seems to be good at picking a central point, which suits my purposes just fine.
Now I’m ready to use the Google Maps API. In order to use this, I had to get a unique key, which is a random string of characters used to identify the user of the API. Next, I have to figure out what the request should look like. Below is the url used to interact with the maps API.
https://maps.googleapis.com/maps/api/geocode/json?address=[CITY,STATE]&key=[YOURKEY]
I’ll be filling in the [CITY,STATE]
place holder with data in our Metropolitan Area column, and using my key in the [YOURKEY]
place holder. So my requests will look something like:
https://maps.googleapis.com/maps/api/geocode/json?address=Abilene, TX&key=abcdefghijklm12345678
I specified that I’d like the response to be in json format in the url. I’ll run one request and show the response from Google.
# import requests, json url = 'https://maps.googleapis.com/maps/api/geocode/json?address=' key = '&key=%s' % MY_SECRET_KEY exampleCity = "Abilene, TX" reqURL = url+exampleCity+key req = requests.get(reqURL) response = json.loads(req.text) response {'results': [{'types': ['locality', 'political'], 'place_id': 'ChIJRWkGKjmOVoYRX6kjIr_m2Qc', 'address_components': [{'types': ['locality', 'political'], 'long_name': 'Abilene', 'short_name': 'Abilene'}, {'types': ['administrative_area_level_2', 'political'], 'long_name': 'Taylor County', 'short_name': 'Taylor County'}, {'types': ['administrative_area_level_1', 'political'], 'long_name': 'Texas', 'short_name': 'TX'}, {'types': ['country', 'political'], 'long_name': 'United States', 'short_name': 'US'}], 'geometry': {'viewport': {'northeast': {'lng': -99.62308399999999, 'lat': 32.6244809}, 'southwest': {'lng': -99.84967089999999, 'lat': 32.343292}}, 'bounds': {'northeast': {'lng': -99.62308399999999, 'lat': 32.6244809}, 'southwest': {'lng': -99.84967089999999, 'lat': 32.343292}}, 'location': {'lng': -99.73314390000002, 'lat': 32.4487364}, 'location_type': 'APPROXIMATE'}, 'formatted_address': 'Abilene, TX, USA'}], 'status': 'OK'} #
The great thing about Python’s json library is that it turns the json response into a dictionary, then a series of lists and nested dictionaries, making it possible to navigate with the normal commands using a little trial and error. I’m interested in the latitude and longitude data held which is under the location
key in the geometry
dictionary, which is part of a dictionary in the list which can be grabbed by calling the key results
on the main response
dictionary returned by our request. Got all that? Me either. Play around with the type
function on a few of the elements in the response and you’ll be able to drill your way down to the latitude and longitude data you’re after. Below is how it looks in my code.
# lonlatData = response['results'][0]['geometry']['location'] ###dictionary containing both longitude and latitude longitude = lonlatData['lat'] latitude = lonlatData['lng'] print("%f, %f" % (longitude, latitude)) 32.448736, -99.733144 #
Now let’s iterate through every row in our dataframe and get the latitude and longitude for each city, adding the values to the dataframe as we go.
###Create new columns to accommodate Lat and Lon data df["Lat"] = 0 df["Lon"] = 0 ###iterate through every row for i, row in df.iterrows(): try: city = row["Metropolitan Area"] cityKey = city+"&key=%s" % (key) reqURL = url+cityKey req = requests.get(reqURL) response = json.loads(req.text) lonlatDat = response['results'][0]['geometry']['location'] latitude = lonlatDat['lat'] longitude = lonlatDat['lng'] df["Lat"].iloc[i] = latitude df["Lon"].iloc[i] = longitude except: print("Couldn't get %s" % row["Metropolitan Area"]) print() pass #
Now look at your latitude and longitude data in all its glory:
# df["Lat"] df["Lon"] #
Beautiful. In Part 2 I’ll show how to use this data to plot Identity Theft Complaints per 100,000 Population on a map of the United States! Thanks for reading and as always, if you have comments e-mail me at osentand(at)gmail.com.