Recent Photos

 

 

Browsing all posts in "data visualization"

more hip visualizations, part 0_1

A while back, I started messing around with data visualization stuff, and came up with a chloropleth North American map that attempted to show the places the Tragically Hip played most often. I now have a slightly shinier and more granular map that shows cities, which is step 0.1 on the way to glory. Not that I have any idea what constitutes ‘glory’ in this instance, but I’m told that life is about the journey, not the destination.

Once again, I started with the [admittedly unreliable] show archive on the Hip’s site. (Actually, I started with the .csv I generated last time, but whatever. That’s the data source. It’s still only North American shows, but I think it’d be easy enough to extend to the rest of them.) To get the shiny googlemap with markers for places they’ve played, I came up with the following steps:

  1. Pull a list of cities & states/provinces out of the csv.
  2. Ask google for the lat/lng coordinates of those cities.
  3. Turn said coordinates into xml tags.
  4. Feed the xml into googlemaps and get a marked map.

This went way better than my experiments last time, but there’s a lot I still want to do with this map to make it more useful interesting. It’s still some lazy, sloppy code, but here it is anyway.

Step one: get a list of cities to look up.

queries = []

reader = csv.reader(open('concerts.csv'), delimiter=",")
for row in reader:
    if row[4].strip() in ['United States', 'Canada']:
        city = row[2] + "+" + row[3]
        queries.append(city)

locales = set(queries)

So that gives us a list of unique cities. (I may want to know that they’ve played Toronto 1938459 times or whatever, but I don’t need to look up Toronto’s coordinates more than once.) The way the geocode api works is that you pass in a URL with a bunch of parameters in the query string, and then it returns the coordinates in whatever format you’ve requested. So the next step is to set up all the junk to build those URLs.

scheme = 'http'
netloc = 'maps.google.com'
path = '/maps/geo'
params = ''
fragment = ''
query_dict =   {'output': 'csv',
                'sensor': 'false',
                'key': 'REDACTED',
                'q': '',}

all_query_strings = []
for city in locales:
    query_dict['q'] = city
    newqs = urllib.urlencode(query_dict)
    all_query_strings.append(newqs)

all_urls = []
for place_qs in all_query_strings:
    url = urlunparse((scheme, netloc, path, params, place_qs, fragment))
    all_urls.append(url)

Yeah, I probably should have used list comprehensions there, but I tend to write those when I’m refactoring stuff. The first pass at something tends to be step by rudimentary step. At any rate, all_urls is now a list of URLs to give to google, which is step two.

f = open('coords.csv', 'wb')

# using regular file handling instead of the csv module because i am lazy
# and google sends back strings.
h = httplib2.Http()
for url in all_urls:
    resp, content = h.request(url)
    f.write(content)
    f.write('\n')
    time.sleep(1)
f.close()

There’s a limit to how many requests you can send in a given time period, but I don’t know what that limit is and am not in any hurry. Also, I should only have to run this once, so I’m not uptight about that one-second sleep in there. At any rate, I now have a csv called coords.csv that looks like this:

200,4,28.3936186,-81.5386842
200,4,32.9911550,-117.2711481

Etc. The format is: response code, accuracy, latitude, longitude. Now I want to turn all of that into xml, which is just straight-up string interpolation.

all_xml = open('coords.xml', 'wb')
reader = csv.reader(open('coords.csv'), delimited=",")
for row in reader:
    latitude = row[2]
    longitude = row[3]
    xml_tag = "<marker lat='%s' lng='%s'/>\n" % (latitude, longitude, )
    all_xml.write(xml_tag)
all_xml.close()

And that’s pretty much that. Add in the other xml stuff to make sure it’s a valid xml document, and then call the whole thing from an html file with some canned javascript.

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <script src="http://maps.google.com/maps?file=api&amp;v=2&amp;key=REDACTED" type="text/javascript"></script>
    <script type="text/javascript">

    function initialize() {
      if (GBrowserIsCompatible()) {
        var map = new GMap2(document.getElementById("map_canvas"));
        map.setCenter(new GLatLng(37, -92), 4);
        map.setUIToDefault();

        GDownloadUrl("coords.xml", function(data) {
          var xml = GXml.parse(data);
          var markers = xml.documentElement.getElementsByTagName("marker");
          for (var i = 0; i < markers.length; i++) {
            var point = new GLatLng(parseFloat(markers[i].getAttribute("lat")),
                                    parseFloat(markers[i].getAttribute("lng")));
            map.addOverlay(new GMarker(point));
          }
        });
      }
    };
    </script>
  </head>
  <body onload="initialize()" onunload="GUnload()">
    <div id="map_canvas" style="width: 1000px; height: 600px"></div>
  </body>
</html>

Et voilá. An unuseful map of unlabeled markers, some of which are incorrect (like, why is there one in the Cayman Islands?). Having the coordinates, though, opens up a lot of possibilities for future awesomeness, which I’m sure someone else will come up with. My own current thoughts involve things like grouping the markers by year (or possibly by tour) and let people turn them on and off; and one marker per show would be better than one marker per city (although, in that case, do I stick with city markers and just say they’ve played Toronto a lot, or do I do the data massaging necessary to get coordinates for and show Lee’s Palace vs the Horseshoe vs the ACC?); and some way to make it collaborative and interactive would be completely amazing. Like, there would be one giant hipmap and if someone were to be logged into their google account, they could click on a marker for a show they’ve attended and add their info and their two cents. And who knows what I’m going to do about setlists. Something, something, something, someday, maybe.

data visualizations and the tragically hip, part 0

The very short background to this is that I love data visualizations, and I love the Tragically Hip. So when I saw this tutorial on making chloropleth maps with python, I thought, hey, I can totally do that. I’ve been wanting to mess around with data visualization stuff for a while now, and this seemed like a pretty good place to start. I decided I wanted to generate a map of North America showing me where the Hip has played most often. Seemed like a pretty simple and straightforward place to start.

First, I needed data, which I got from the Hip’s show archive. I just pasted it into a text editor and did some find-and-replace to make it a .csv. I really thought that would be the hard part. Ha!

Next, I needed a map. The tutorial uses a US county map, but that wouldn’t work for this. First, Canada is not actually in the United States! Also, they do not have counties. They have ridings, but reliable Canadian sources tell me they only care about ridings during elections. So anyway, this North American map is the one I ended up with.

If you read the first post I linked to, you will learn valuable information, such as the fact that the map has to be an .svg, which is a Scalable Vector Graphic, which is really XML. I hope you are now thinking, “oh god, will we have to parse XML?” Sadly, the answer is yes. Also, this particular map has shitty XML, which will make it all the more exciting.

But whatever! Just ignore that impending sense of dread and look at the map in a text editor. Each state/province/territory (henceforth SPT) has an ISO-3166 code assigned to it as an ID. So Illinois is US-IL, and Ontario is CA-ON. The csv I made lists city, SPT, and country, but not in handy ISO-3166 form, so I had to do something about that. I am not sure I really did the right thing, but here’s the code I wrote to do it.

# read in the concert csv
concerts = [r for r in csv.reader(open('concerts.csv', 'r'))]

for show in concerts:
    if show[4].strip() == "Canada":
        show[3] = "CA-" + "%s" % (show[3])
        # quebec's code is QC in iso-world
        if show[3] == "CA-PQ":
            show[3] = "CA-QC"
    elif show[4].strip() == "United States":
        show[3] = "US-" + "%s" % (show[3].upper())

This changes my csv from this:

11/07/2009,Landmark Theatre, Syracuse,NY,United States,TTH_SMT_20091107_1328

to this:

11/07/2009,Landmark Theatre, Syracuse,US-NY,United States,TTH_SMT_20091107_1328

I feel like there is a better way to do it, but that was the first solution that popped into my head, and it worked just fine. From there, I needed to figure out how often they played in a given SPT. I KNOW there is a better way to do this, but I stared at it for a few minutes before the laziness won out and I moved on.

    north_american_shows = {}
    for show in concerts:
        # if the state's not already there, add it. the first time through,
        # it will never be there, so everything will be set to 0.
        if show[3] not in north_american_shows.keys():
            north_american_shows[show[3]] = 0

        # look, i know this is stupid. i don't care. just count, okay.
        if show[3] in north_american_shows.keys():
            frequency = north_american_shows[show[3]]
            north_american_shows[show[3]] = frequency + 1

As you can see, I first made an empty dictionary called north_american_shows, and then I ripped through the csv and gave my dict keys corresponding to all the ISO codes. I set all their initial values to zero. Then I went through AGAIN, and just incremented the values by one for every show they played in a given SPT. Printing out north_american_shows at this point gives something like this:

{'': 11, 'BE': 1, 'US-NY': 83, 'US-PA': 33, 'US-TN': 3, ...}

I think I probably should have tried to make a dict that only used valid ISO codes as keys, instead of everything in that SPT column, and I probably should have just gone through the csv once, rather than twice. At any rate, this got me a usable dictionary that told me shocking things like mostly the Hip plays in Ontario. That piece of information, by the way, is also nicely conveyed by this:

word cloud

That is a word cloud I made in wordle in approximately 30 seconds by pasting in the ‘city’ column from my csv. But let us not be deterred by the fundamental uselessness of our exercise! Probably I would never do anything at all if I let that stop me.

Anyway! We now have our data. A quick glance through it seemed to say I should break up the distribution like this:

0
1-5
6-10
11-20
21-50
51-75
76-100
100+

There is only once place they’ve played more than 100 shows (Ontario), and only a few in the 76-100 range, so more granularity than that didn’t make much sense to me. Of course, I know exactly nothing about statistics, so I could be wrong.

Time to pick out some colors. Here, I mostly just did what Flowing Data told me to and used ColorBrewer; I needed eight colors. (Actually, I ended up needing nine because the last color was white, so you couldn’t see the SPT delineations for places they’ve never played.)

After that, the fun part. I expected this to be pretty easy; I’m familiar with Beautiful Soup, the parser used in the tutorial. All I needed to do was go through the XML, find a tag that carried one of my ISO codes as an ID, and then append a fill attribute to color the area based on my chosen color scheme/distribution. Unfortunately, Beautiful Soup did not deal at all well with the poor markup of the map; even just reading it in and then printing it out did all sorts of terrible things. I ended up having to use lxml, which I don’t really like (mostly because it’s hard to build, although that isn’t an issue on Snow Leopard, thank god). Here’s what I ended up with for this part of it:

    colors = ['#EFEDF5', '#DADAEB', '#BCBDDC', '#9E9AC8', '#807DBA', '#6A51A3',
             '#54278F', '#3F007D']

    tree = etree.parse('map.svg')
    tags = tree.find('{http://www.w3.org/2000/svg}g')

    for tag in tags.iterdescendants():
        for a in tag.attrib.values():
            if re.match(r'(CA|US)\-.{2}', a):
                if a in north_american_shows.keys():
                    if north_american_shows[a] > 100:
                        color_class = 7
                    elif north_american_shows[a] > 75:
                        color_class = 6
                    elif north_american_shows[a] > 50:
                        color_class = 5
                    elif north_american_shows[a] > 20:
                        color_class = 4
                    elif north_american_shows[a] > 10:
                        color_class = 3
                    elif north_american_shows[a] > 5:
                        color_class = 2
                    else:
                        color_class = 1
                else:
                    color_class = 0

                color = colors[color_class]
                tag.set('fill', color)

    print etree.tostring(tree)

For whatever reason, I had to pass in the link and then the tag I was looking for. I have no idea why; a coworker had to tell me to do that (thanks, Nat!). Once I had the root tag, I iterated through all its descendants, looking for IDs that looked like an ISO code. I initially looked for ones that matched something in north_american_shows, but then I realized that wouldn’t quite get it done; because there are places they have never played, those places are not in the dict. Hence the regex and the if/else; I needed to find those, too, and set them to the color I chose for 0.

Once that was done, I ran:

$ python colourize.py > hipmap.svg

And, amazingly enough, I have a map that shows me things I already knew.

north american hip map

The darker the purple, the more often they’ve played in that SPT. I’m pretty pleased with myself, even though this is the simplest possible visualization I could do and it took me a lot longer than I feel it should have. Seriously, I spent many days fighting with XML parsers, trying to get it figured out. But I’m definitely curious to know how the code could be better, because it’s obvious to me that it could be.

In the meantime, I think I’m going to try doing something with google maps, incorporating more granularity in terms of where they’ve played and on what tour. I’d also like to do something with setlists, but that data is a little harder to get into a useful format. It’s out there, for sure, thanks to fans more obsessive than I am, but it’s going to take me a while to figure out what I want to do and how I want to do it.

Before I do any of that, though, I’m going to go see some Hip concerts.

Oh, right. The full script, for completeness’ sake:

import csv
import re
from lxml import etree

def do_stuff():
    # okay. step one is to dump the csv into a list of lists.
    concerts = [r for r in csv.reader(open('concerts.csv', 'r'))]

    # so we have a list of lists. now we need the strings to be useful. while
    # you're in there, strip out any whitespace, because we hate whitespace
    # here in python land.
    for show in concerts:
        if show[4].strip() == "Canada":
            show[3] = "CA-" + "%s" % (show[3])
            # fix the PQ/QC thing in the data
            if show[3] == "CA-PQ":
                show[3] = "CA-QC"
        elif show[4].strip() == "United States":
            show[3] = "US-" + "%s" % (show[3].upper())

    # okay, now we need to find out how often they play in a given place. so,
    # uh, let's build a dictionary.
    north_american_shows = {}
    for show in concerts:
        # if the state's not already there, add it. the first time through,
        # it will never be there, so everything will be set to 0.
        if show[3] not in north_american_shows.keys():
            north_american_shows[show[3]] = 0

        # look, i know this is stupid. i don't care. just count, okay.
        if show[3] in north_american_shows.keys():
            frequency = north_american_shows[show[3]]
            north_american_shows[show[3]] = frequency + 1

    colors = ['#EFEDF5', '#DADAEB', '#BCBDDC', '#9E9AC8', '#807DBA', '#6A51A3',
              '#54278F', '#3F007D']

    tree = etree.parse('map.svg')
    tags = tree.find('{http://www.w3.org/2000/svg}g')

    for tag in tags.iterdescendants():
        for a in tag.attrib.values():
            if re.match(r'(CA|US)\-.{2}', a):
                if a in north_american_shows.keys():
                    if north_american_shows[a] > 100:
                        color_class = 7
                    elif north_american_shows[a] > 75:
                        color_class = 6
                    elif north_american_shows[a] > 50:
                        color_class = 5
                    elif north_american_shows[a] > 20:
                        color_class = 4
                    elif north_american_shows[a] > 10:
                        color_class = 3
                    elif north_american_shows[a] > 5:
                        color_class = 2
                    else:
                        color_class = 1
                else:
                    color_class = 0

                color = colors[color_class]
                tag.set('fill', color)

    print etree.tostring(tree)

if __name__ == "__main__":
    do_stuff()