A while back, I started messing around with data visualization stuff, and came up with a chloropleth North American map that attempted to show the places the Tragically Hip played most often. I now have a slightly shinier and more granular map that shows cities, which is step 0.1 on the way to glory. Not that I have any idea what constitutes ‘glory’ in this instance, but I’m told that life is about the journey, not the destination.
Once again, I started with the [admittedly unreliable] show archive on the Hip’s site. (Actually, I started with the .csv I generated last time, but whatever. That’s the data source. It’s still only North American shows, but I think it’d be easy enough to extend to the rest of them.) To get the shiny googlemap with markers for places they’ve played, I came up with the following steps:
- Pull a list of cities & states/provinces out of the csv.
- Ask google for the lat/lng coordinates of those cities.
- Turn said coordinates into xml tags.
- Feed the xml into googlemaps and get a marked map.
This went way better than my experiments last time, but there’s a lot I still want to do with this map to make it more useful interesting. It’s still some lazy, sloppy code, but here it is anyway.
Step one: get a list of cities to look up.
queries = []
reader = csv.reader(open('concerts.csv'), delimiter=",")
for row in reader:
if row[4].strip() in ['United States', 'Canada']:
city = row[2] + "+" + row[3]
queries.append(city)
locales = set(queries)
So that gives us a list of unique cities. (I may want to know that they’ve played Toronto 1938459 times or whatever, but I don’t need to look up Toronto’s coordinates more than once.) The way the geocode api works is that you pass in a URL with a bunch of parameters in the query string, and then it returns the coordinates in whatever format you’ve requested. So the next step is to set up all the junk to build those URLs.
scheme = 'http'
netloc = 'maps.google.com'
path = '/maps/geo'
params = ''
fragment = ''
query_dict = {'output': 'csv',
'sensor': 'false',
'key': 'REDACTED',
'q': '',}
all_query_strings = []
for city in locales:
query_dict['q'] = city
newqs = urllib.urlencode(query_dict)
all_query_strings.append(newqs)
all_urls = []
for place_qs in all_query_strings:
url = urlunparse((scheme, netloc, path, params, place_qs, fragment))
all_urls.append(url)
Yeah, I probably should have used list comprehensions there, but I tend to write those when I’m refactoring stuff. The first pass at something tends to be step by rudimentary step. At any rate, all_urls is now a list of URLs to give to google, which is step two.
f = open('coords.csv', 'wb')
# using regular file handling instead of the csv module because i am lazy
# and google sends back strings.
h = httplib2.Http()
for url in all_urls:
resp, content = h.request(url)
f.write(content)
f.write('\n')
time.sleep(1)
f.close()
There’s a limit to how many requests you can send in a given time period, but I don’t know what that limit is and am not in any hurry. Also, I should only have to run this once, so I’m not uptight about that one-second sleep in there. At any rate, I now have a csv called coords.csv that looks like this:
200,4,28.3936186,-81.5386842
200,4,32.9911550,-117.2711481
Etc. The format is: response code, accuracy, latitude, longitude. Now I want to turn all of that into xml, which is just straight-up string interpolation.
all_xml = open('coords.xml', 'wb')
reader = csv.reader(open('coords.csv'), delimited=",")
for row in reader:
latitude = row[2]
longitude = row[3]
xml_tag = "<marker lat='%s' lng='%s'/>\n" % (latitude, longitude, )
all_xml.write(xml_tag)
all_xml.close()
And that’s pretty much that. Add in the other xml stuff to make sure it’s a valid xml document, and then call the whole thing from an html file with some canned javascript.
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<script src="http://maps.google.com/maps?file=api&v=2&key=REDACTED" type="text/javascript"></script>
<script type="text/javascript">
function initialize() {
if (GBrowserIsCompatible()) {
var map = new GMap2(document.getElementById("map_canvas"));
map.setCenter(new GLatLng(37, -92), 4);
map.setUIToDefault();
GDownloadUrl("coords.xml", function(data) {
var xml = GXml.parse(data);
var markers = xml.documentElement.getElementsByTagName("marker");
for (var i = 0; i < markers.length; i++) {
var point = new GLatLng(parseFloat(markers[i].getAttribute("lat")),
parseFloat(markers[i].getAttribute("lng")));
map.addOverlay(new GMarker(point));
}
});
}
};
</script>
</head>
<body onload="initialize()" onunload="GUnload()">
<div id="map_canvas" style="width: 1000px; height: 600px"></div>
</body>
</html>
Et voilá. An unuseful map of unlabeled markers, some of which are incorrect (like, why is there one in the Cayman Islands?). Having the coordinates, though, opens up a lot of possibilities for future awesomeness, which I’m sure someone else will come up with. My own current thoughts involve things like grouping the markers by year (or possibly by tour) and let people turn them on and off; and one marker per show would be better than one marker per city (although, in that case, do I stick with city markers and just say they’ve played Toronto a lot, or do I do the data massaging necessary to get coordinates for and show Lee’s Palace vs the Horseshoe vs the ACC?); and some way to make it collaborative and interactive would be completely amazing. Like, there would be one giant hipmap and if someone were to be logged into their google account, they could click on a marker for a show they’ve attended and add their info and their two cents. And who knows what I’m going to do about setlists. Something, something, something, someday, maybe.
A few months ago, my external hard drive died. It had all my music on it, nearly ten years of accumulated mp3s from various sources: ripping my own collection, music from my friends, buying downloads from various sources, those glorious months in 2002 when eMusic gave you unlimited downloads, and almost certainly a few things that were torrented. It was a pretty devastating loss, really; much of it is easy enough to find again, but some of my favorite things were lost to the ether and I haven’t been able to get them back.
I tried to look on the bright side, though; organizing my itunes is always this horrible ordeal and it’s never finished and there were a few months that I didn’t really listen to music because the mere IDEA of opening my itunes gave me anxiety attacks. (You know the ones: HOW DO WE TAG THE TAGS?!) So declaring itunes bankruptcy seemed like an okay idea! I could pull in the album I wanted to listen to right then, make sure the metadata was okay, listen, move on. That had actually been going fairly well, even if I couldn’t necessarily put my hands on some rare b-side from back when there were actualfax b-sides.
…and then I started listening to Hip shows. I have maybe 170 right now, which is not a lot compared to the hardcore people who have been doing this for a long time, but seems ludicrous to people who don’t do it. And I was digging through my itunes the other night, looking for a particular show, and realized that I have once again got to the point where I don’t recognize a lot of the stuff in my itunes, and the rest of it is the Hip. I joked with a friend that my itunes has started spawning new music again, and he laughed at me. [We have a running joke that my itunes is some kind of musical font, because somehow it is always full of stuff I do not recognize and have never listened to, and sometimes I find it and am like, okay, what the hell is THIS and where did it even come from? I mean, I download stuff, yeah, but it's not like I have told my computer to just go download every mp3 on the internet, but sometimes I feel like that's exactly what it's done.] Then he related a conversation he had with some of his buddies about how most people have around 600 songs digitally available and then probably a handful of albums that haven’t been ripped in some manner. He has around 7,000, and his friends felt this was a LOT of music. I felt it was probably about average, and so I decided to ask the people I know and tally results.
I’m pretty much ready to declare the experiment over, because the average has now been holding steady for the last 20 people I’ve entered, even when their answer consists of a screenshot of a music folder, which is on its own 1-tb drive and contains 52,000+ songs and includes no boots or shows. Commercial releases only!
After that long-winded and probably boring explanation, here are the results:
total respondents: 56
total songs: 527,168
low answer: 0
high answer: 52,760
average song count: 9,413
median song count: 5,679
throwing out the freaks: 5,391 (this is the average of respondents who have fewer than 20,000 songs; it’s close to the median, which a math teacher informed me was the better number to use as an indicator anyway)
<10,000 songs: 43 people
10-20,000 songs: 5 people
>20,000 songs: 8 people
Responses collected by checking out all the shared itunes folders at work, asking twitter, asking LJ/DW, and asking on the hipbase (the Tragically Hip fan forum, which I expected to skew the results more than it actually did — I figured people there were likely to have a lot of music, and unlikely to have it all digitized; in the end, it was a wash).
The caveats here are, of course, huge. Some people answered with the size of their collection, and so I divided it by 6 megs to get a song count that is probably wrong. Many people estimated their answers. Many people told me that they have a lot of stuff that isn’t ripped; a few have literal rooms full of vinyl that has never been digitized. I got answers about ipod vs hard drive; work vs home; hard drive vs music server. In those cases, I used the highest number, because I figured that’s the one that more accurately represented the answer to my [extremely poorly worded] question about how much digital music is available to people for listening. Some people gave range estimates; I took the middle number in that case.
For a while it was looking as if most people either had fewer than 10,000 songs or more than 20,000 songs; there weren’t a lot of people in between. That settled out a bit, but it’s the reason for the breakdown in the number of people with n songs. Many of the people I would consider music geeks didn’t have too many songs digitized, but made a point to say, “this isn’t even close to my entire music collection.”
So! I have drawn the following conclusions:
- I was right.
- My collection is totally and completely reasonable.
Nothing like bad stats and terrible science to validate my position! \o/
This isn’t a personal blog, so I don’t really say much about what I’m up to, but I found a few pictures from the Amsterdam concert I went to last week. And I am actually in one of them! I always appreciate photographic evidence that I’m not making up my entire life.
These photos on flickr, by Henk Ritskes, are all really good, but here are the two I actually care about.

This is early in the show, and nothing is going on right there, so it’s pretty calm. Just over the watermark, you can see my face right at Gord’s feet, my arms on the stage, looking up. I’m not kneeling; I’m on my toes. I look like I’m about eight years old. I didn’t feel like it at the time, but I look ridiculously tiny in that photograph. Note that I’m surrounded by guys who are all much, much bigger than I am.

This is the closing song (‘Blow at High Dough,’ for those of you who care about such things), a shot of the same area of the pit, and it’s pretty easy to imagine me curled against the stage, trying to protect my head but mostly just getting kicked around as the crowd presses in. It wasn’t a 90s-style NIN pit or anything (I was in those, too, and came out hurting), but it got pretty rough for me. But note that I am not complaining! I’m small and by myself and deserve what I get for standing there. I wouldn’t change a thing. Well, okay, maybe I would have had a little less beer dumped down my back.
There was a moment in this show, during ‘Locked in the Trunk of a Car,’ which starts off a little slow, a little quiet, and the room was mostly dark. And then the lights flashed on, bright glaring white shot through with smoke, and the drums kicked in, low and heavy and driving, and I looked to my left and the pit was this huge writhing mass, and people were hanging over the rail of the balconies, and it went up and back and on forever, alive. I thought, “yes,” and then I didn’t think anymore for a long time.
And that is pretty much what I have to say.
The very short background to this is that I love data visualizations, and I love the Tragically Hip. So when I saw this tutorial on making chloropleth maps with python, I thought, hey, I can totally do that. I’ve been wanting to mess around with data visualization stuff for a while now, and this seemed like a pretty good place to start. I decided I wanted to generate a map of North America showing me where the Hip has played most often. Seemed like a pretty simple and straightforward place to start.
First, I needed data, which I got from the Hip’s show archive. I just pasted it into a text editor and did some find-and-replace to make it a .csv. I really thought that would be the hard part. Ha!
Next, I needed a map. The tutorial uses a US county map, but that wouldn’t work for this. First, Canada is not actually in the United States! Also, they do not have counties. They have ridings, but reliable Canadian sources tell me they only care about ridings during elections. So anyway, this North American map is the one I ended up with.
If you read the first post I linked to, you will learn valuable information, such as the fact that the map has to be an .svg, which is a Scalable Vector Graphic, which is really XML. I hope you are now thinking, “oh god, will we have to parse XML?” Sadly, the answer is yes. Also, this particular map has shitty XML, which will make it all the more exciting.
But whatever! Just ignore that impending sense of dread and look at the map in a text editor. Each state/province/territory (henceforth SPT) has an ISO-3166 code assigned to it as an ID. So Illinois is US-IL, and Ontario is CA-ON. The csv I made lists city, SPT, and country, but not in handy ISO-3166 form, so I had to do something about that. I am not sure I really did the right thing, but here’s the code I wrote to do it.
# read in the concert csv
concerts = [r for r in csv.reader(open('concerts.csv', 'r'))]
for show in concerts:
if show[4].strip() == "Canada":
show[3] = "CA-" + "%s" % (show[3])
# quebec's code is QC in iso-world
if show[3] == "CA-PQ":
show[3] = "CA-QC"
elif show[4].strip() == "United States":
show[3] = "US-" + "%s" % (show[3].upper())
This changes my csv from this:
11/07/2009,Landmark Theatre, Syracuse,NY,United States,TTH_SMT_20091107_1328
to this:
11/07/2009,Landmark Theatre, Syracuse,US-NY,United States,TTH_SMT_20091107_1328
I feel like there is a better way to do it, but that was the first solution that popped into my head, and it worked just fine. From there, I needed to figure out how often they played in a given SPT. I KNOW there is a better way to do this, but I stared at it for a few minutes before the laziness won out and I moved on.
north_american_shows = {}
for show in concerts:
# if the state's not already there, add it. the first time through,
# it will never be there, so everything will be set to 0.
if show[3] not in north_american_shows.keys():
north_american_shows[show[3]] = 0
# look, i know this is stupid. i don't care. just count, okay.
if show[3] in north_american_shows.keys():
frequency = north_american_shows[show[3]]
north_american_shows[show[3]] = frequency + 1
As you can see, I first made an empty dictionary called north_american_shows, and then I ripped through the csv and gave my dict keys corresponding to all the ISO codes. I set all their initial values to zero. Then I went through AGAIN, and just incremented the values by one for every show they played in a given SPT. Printing out north_american_shows at this point gives something like this:
{'': 11, 'BE': 1, 'US-NY': 83, 'US-PA': 33, 'US-TN': 3, ...}
I think I probably should have tried to make a dict that only used valid ISO codes as keys, instead of everything in that SPT column, and I probably should have just gone through the csv once, rather than twice. At any rate, this got me a usable dictionary that told me shocking things like mostly the Hip plays in Ontario. That piece of information, by the way, is also nicely conveyed by this:

That is a word cloud I made in wordle in approximately 30 seconds by pasting in the ‘city’ column from my csv. But let us not be deterred by the fundamental uselessness of our exercise! Probably I would never do anything at all if I let that stop me.
Anyway! We now have our data. A quick glance through it seemed to say I should break up the distribution like this:
0
1-5
6-10
11-20
21-50
51-75
76-100
100+
There is only once place they’ve played more than 100 shows (Ontario), and only a few in the 76-100 range, so more granularity than that didn’t make much sense to me. Of course, I know exactly nothing about statistics, so I could be wrong.
Time to pick out some colors. Here, I mostly just did what Flowing Data told me to and used ColorBrewer; I needed eight colors. (Actually, I ended up needing nine because the last color was white, so you couldn’t see the SPT delineations for places they’ve never played.)
After that, the fun part. I expected this to be pretty easy; I’m familiar with Beautiful Soup, the parser used in the tutorial. All I needed to do was go through the XML, find a tag that carried one of my ISO codes as an ID, and then append a fill attribute to color the area based on my chosen color scheme/distribution. Unfortunately, Beautiful Soup did not deal at all well with the poor markup of the map; even just reading it in and then printing it out did all sorts of terrible things. I ended up having to use lxml, which I don’t really like (mostly because it’s hard to build, although that isn’t an issue on Snow Leopard, thank god). Here’s what I ended up with for this part of it:
colors = ['#EFEDF5', '#DADAEB', '#BCBDDC', '#9E9AC8', '#807DBA', '#6A51A3',
'#54278F', '#3F007D']
tree = etree.parse('map.svg')
tags = tree.find('{http://www.w3.org/2000/svg}g')
for tag in tags.iterdescendants():
for a in tag.attrib.values():
if re.match(r'(CA|US)\-.{2}', a):
if a in north_american_shows.keys():
if north_american_shows[a] > 100:
color_class = 7
elif north_american_shows[a] > 75:
color_class = 6
elif north_american_shows[a] > 50:
color_class = 5
elif north_american_shows[a] > 20:
color_class = 4
elif north_american_shows[a] > 10:
color_class = 3
elif north_american_shows[a] > 5:
color_class = 2
else:
color_class = 1
else:
color_class = 0
color = colors[color_class]
tag.set('fill', color)
print etree.tostring(tree)
For whatever reason, I had to pass in the link and then the tag I was looking for. I have no idea why; a coworker had to tell me to do that (thanks, Nat!). Once I had the root tag, I iterated through all its descendants, looking for IDs that looked like an ISO code. I initially looked for ones that matched something in north_american_shows, but then I realized that wouldn’t quite get it done; because there are places they have never played, those places are not in the dict. Hence the regex and the if/else; I needed to find those, too, and set them to the color I chose for 0.
Once that was done, I ran:
$ python colourize.py > hipmap.svg
And, amazingly enough, I have a map that shows me things I already knew.

The darker the purple, the more often they’ve played in that SPT. I’m pretty pleased with myself, even though this is the simplest possible visualization I could do and it took me a lot longer than I feel it should have. Seriously, I spent many days fighting with XML parsers, trying to get it figured out. But I’m definitely curious to know how the code could be better, because it’s obvious to me that it could be.
In the meantime, I think I’m going to try doing something with google maps, incorporating more granularity in terms of where they’ve played and on what tour. I’d also like to do something with setlists, but that data is a little harder to get into a useful format. It’s out there, for sure, thanks to fans more obsessive than I am, but it’s going to take me a while to figure out what I want to do and how I want to do it.
Before I do any of that, though, I’m going to go see some Hip concerts.
Oh, right. The full script, for completeness’ sake:
import csv
import re
from lxml import etree
def do_stuff():
# okay. step one is to dump the csv into a list of lists.
concerts = [r for r in csv.reader(open('concerts.csv', 'r'))]
# so we have a list of lists. now we need the strings to be useful. while
# you're in there, strip out any whitespace, because we hate whitespace
# here in python land.
for show in concerts:
if show[4].strip() == "Canada":
show[3] = "CA-" + "%s" % (show[3])
# fix the PQ/QC thing in the data
if show[3] == "CA-PQ":
show[3] = "CA-QC"
elif show[4].strip() == "United States":
show[3] = "US-" + "%s" % (show[3].upper())
# okay, now we need to find out how often they play in a given place. so,
# uh, let's build a dictionary.
north_american_shows = {}
for show in concerts:
# if the state's not already there, add it. the first time through,
# it will never be there, so everything will be set to 0.
if show[3] not in north_american_shows.keys():
north_american_shows[show[3]] = 0
# look, i know this is stupid. i don't care. just count, okay.
if show[3] in north_american_shows.keys():
frequency = north_american_shows[show[3]]
north_american_shows[show[3]] = frequency + 1
colors = ['#EFEDF5', '#DADAEB', '#BCBDDC', '#9E9AC8', '#807DBA', '#6A51A3',
'#54278F', '#3F007D']
tree = etree.parse('map.svg')
tags = tree.find('{http://www.w3.org/2000/svg}g')
for tag in tags.iterdescendants():
for a in tag.attrib.values():
if re.match(r'(CA|US)\-.{2}', a):
if a in north_american_shows.keys():
if north_american_shows[a] > 100:
color_class = 7
elif north_american_shows[a] > 75:
color_class = 6
elif north_american_shows[a] > 50:
color_class = 5
elif north_american_shows[a] > 20:
color_class = 4
elif north_american_shows[a] > 10:
color_class = 3
elif north_american_shows[a] > 5:
color_class = 2
else:
color_class = 1
else:
color_class = 0
color = colors[color_class]
tag.set('fill', color)
print etree.tostring(tree)
if __name__ == "__main__":
do_stuff()
My company, Leapfrog Online, is looking to hire a Python web developer. There’s more about the company and the tech team here, and details of the job itself here. I think our tech team is pretty cool, and we try hard (with high-level support) not to be dicks about using open-source tools. Which is to say: we use them and we try to give back to the community by sending our engineers to conferences (as attendees and presenters), sponsoring said conferences (we’ve sponsored PyCon and Windy City Rails in the past), submitting patches to the tools we use, and releasing our code when possible. So I think, tech-wise, it is a pretty good place to work.
Among the software and test engineers, though, I’m the only woman. We’re also a pretty white bunch of people. I often hear things like, “well, I’d hire a woman, but none apply!” And I raise my eyebrows and think, “well, where are you looking?” Our recruiting is poor, and we all admit it; we have one HR person who does almost all of it, and she knows very little about technology. Some of us will post a link on our blogs, if we have them, or on twitter, and that’s about it. We tend to have a problem finding qualified devs in general, let alone qualified devs from under-represented groups. But today I asked my VP about it, and he is totally behind the idea, and asked me to come up with some ideas about where we might focus our recruiting efforts to attract more female and minority applicants.
Any ideas? Please feel free to let me know in comments, or email me (zerbie at gmail), or send this around to any groups or lists you know about.
[ETA: There are some ideas in the comments to this post on geekfeminism that I plan to follow up on.]
...archives...