data visualizations and the tragically hip, part 0

The very short background to this is that I love data visualizations, and I love the Tragically Hip. So when I saw this tutorial on making chloropleth maps with python, I thought, hey, I can totally do that. I’ve been wanting to mess around with data visualization stuff for a while now, and this seemed like a pretty good place to start. I decided I wanted to generate a map of North America showing me where the Hip has played most often. Seemed like a pretty simple and straightforward place to start.

First, I needed data, which I got from the Hip’s show archive. I just pasted it into a text editor and did some find-and-replace to make it a .csv. I really thought that would be the hard part. Ha!

Next, I needed a map. The tutorial uses a US county map, but that wouldn’t work for this. First, Canada is not actually in the United States! Also, they do not have counties. They have ridings, but reliable Canadian sources tell me they only care about ridings during elections. So anyway, this North American map is the one I ended up with.

If you read the first post I linked to, you will learn valuable information, such as the fact that the map has to be an .svg, which is a Scalable Vector Graphic, which is really XML. I hope you are now thinking, “oh god, will we have to parse XML?” Sadly, the answer is yes. Also, this particular map has shitty XML, which will make it all the more exciting.

But whatever! Just ignore that impending sense of dread and look at the map in a text editor. Each state/province/territory (henceforth SPT) has an ISO-3166 code assigned to it as an ID. So Illinois is US-IL, and Ontario is CA-ON. The csv I made lists city, SPT, and country, but not in handy ISO-3166 form, so I had to do something about that. I am not sure I really did the right thing, but here’s the code I wrote to do it.

# read in the concert csv
concerts = [r for r in csv.reader(open('concerts.csv', 'r'))]

for show in concerts:
    if show[4].strip() == "Canada":
        show[3] = "CA-" + "%s" % (show[3])
        # quebec's code is QC in iso-world
        if show[3] == "CA-PQ":
            show[3] = "CA-QC"
    elif show[4].strip() == "United States":
        show[3] = "US-" + "%s" % (show[3].upper())

This changes my csv from this:

11/07/2009,Landmark Theatre, Syracuse,NY,United States,TTH_SMT_20091107_1328

to this:

11/07/2009,Landmark Theatre, Syracuse,US-NY,United States,TTH_SMT_20091107_1328

I feel like there is a better way to do it, but that was the first solution that popped into my head, and it worked just fine. From there, I needed to figure out how often they played in a given SPT. I KNOW there is a better way to do this, but I stared at it for a few minutes before the laziness won out and I moved on.

    north_american_shows = {}
    for show in concerts:
        # if the state's not already there, add it. the first time through,
        # it will never be there, so everything will be set to 0.
        if show[3] not in north_american_shows.keys():
            north_american_shows[show[3]] = 0
        
        # look, i know this is stupid. i don't care. just count, okay.
        if show[3] in north_american_shows.keys():
            frequency = north_american_shows[show[3]]
            north_american_shows[show[3]] = frequency + 1

As you can see, I first made an empty dictionary called north_american_shows, and then I ripped through the csv and gave my dict keys corresponding to all the ISO codes. I set all their initial values to zero. Then I went through AGAIN, and just incremented the values by one for every show they played in a given SPT. Printing out north_american_shows at this point gives something like this:

{'': 11, 'BE': 1, 'US-NY': 83, 'US-PA': 33, 'US-TN': 3, ...}

I think I probably should have tried to make a dict that only used valid ISO codes as keys, instead of everything in that SPT column, and I probably should have just gone through the csv once, rather than twice. At any rate, this got me a usable dictionary that told me shocking things like mostly the Hip plays in Ontario. That piece of information, by the way, is also nicely conveyed by this:

word cloud

That is a word cloud I made in wordle in approximately 30 seconds by pasting in the ‘city’ column from my csv. But let us not be deterred by the fundamental uselessness of our exercise! Probably I would never do anything at all if I let that stop me.

Anyway! We now have our data. A quick glance through it seemed to say I should break up the distribution like this:

0
1-5
6-10
11-20
21-50
51-75
76-100
100+

There is only once place they’ve played more than 100 shows (Ontario), and only a few in the 76-100 range, so more granularity than that didn’t make much sense to me. Of course, I know exactly nothing about statistics, so I could be wrong.

Time to pick out some colors. Here, I mostly just did what Flowing Data told me to and used ColorBrewer; I needed eight colors. (Actually, I ended up needing nine because the last color was white, so you couldn’t see the SPT delineations for places they’ve never played.)

After that, the fun part. I expected this to be pretty easy; I’m familiar with Beautiful Soup, the parser used in the tutorial. All I needed to do was go through the XML, find a tag that carried one of my ISO codes as an ID, and then append a fill attribute to color the area based on my chosen color scheme/distribution. Unfortunately, Beautiful Soup did not deal at all well with the poor markup of the map; even just reading it in and then printing it out did all sorts of terrible things. I ended up having to use lxml, which I don’t really like (mostly because it’s hard to build, although that isn’t an issue on Snow Leopard, thank god). Here’s what I ended up with for this part of it:

    
    colors = ['#EFEDF5', '#DADAEB', '#BCBDDC', '#9E9AC8', '#807DBA', '#6A51A3',
             '#54278F', '#3F007D']
    
    tree = etree.parse('map.svg')
    tags = tree.find('{http://www.w3.org/2000/svg}g')
    
    for tag in tags.iterdescendants():
        for a in tag.attrib.values():
            if re.match(r'(CA|US)\-.{2}', a):
                if a in north_american_shows.keys():
                    if north_american_shows[a] > 100:
                        color_class = 7
                    elif north_american_shows[a] > 75:
                        color_class = 6
                    elif north_american_shows[a] > 50:
                        color_class = 5
                    elif north_american_shows[a] > 20:
                        color_class = 4
                    elif north_american_shows[a] > 10:
                        color_class = 3
                    elif north_american_shows[a] > 5:
                        color_class = 2
                    else:
                        color_class = 1
                else:
                    color_class = 0
                
                color = colors[color_class]
                tag.set('fill', color)
    
    print etree.tostring(tree)

For whatever reason, I had to pass in the link and then the tag I was looking for. I have no idea why; a coworker had to tell me to do that (thanks, Nat!). Once I had the root tag, I iterated through all its descendants, looking for IDs that looked like an ISO code. I initially looked for ones that matched something in north_american_shows, but then I realized that wouldn’t quite get it done; because there are places they have never played, those places are not in the dict. Hence the regex and the if/else; I needed to find those, too, and set them to the color I chose for 0.

Once that was done, I ran:

$ python colourize.py > hipmap.svg

And, amazingly enough, I have a map that shows me things I already knew.

north american hip map

The darker the purple, the more often they’ve played in that SPT. I’m pretty pleased with myself, even though this is the simplest possible visualization I could do and it took me a lot longer than I feel it should have. Seriously, I spent many days fighting with XML parsers, trying to get it figured out. But I’m definitely curious to know how the code could be better, because it’s obvious to me that it could be.

In the meantime, I think I’m going to try doing something with google maps, incorporating more granularity in terms of where they’ve played and on what tour. I’d also like to do something with setlists, but that data is a little harder to get into a useful format. It’s out there, for sure, thanks to fans more obsessive than I am, but it’s going to take me a while to figure out what I want to do and how I want to do it.

Before I do any of that, though, I’m going to go see some Hip concerts.

Oh, right. The full script, for completeness’ sake:

import csv
import re
from lxml import etree

def do_stuff():
    # okay. step one is to dump the csv into a list of lists.
    concerts = [r for r in csv.reader(open('concerts.csv', 'r'))]
    
    # so we have a list of lists. now we need the strings to be useful. while
    # you're in there, strip out any whitespace, because we hate whitespace
    # here in python land.
    for show in concerts:
        if show[4].strip() == "Canada":
            show[3] = "CA-" + "%s" % (show[3])
            # fix the PQ/QC thing in the data
            if show[3] == "CA-PQ":
                show[3] = "CA-QC"
        elif show[4].strip() == "United States":
            show[3] = "US-" + "%s" % (show[3].upper())
    
    # okay, now we need to find out how often they play in a given place. so,
    # uh, let's build a dictionary. 
    north_american_shows = {}
    for show in concerts:
        # if the state's not already there, add it. the first time through,
        # it will never be there, so everything will be set to 0.
        if show[3] not in north_american_shows.keys():
            north_american_shows[show[3]] = 0
        
        # look, i know this is stupid. i don't care. just count, okay.
        if show[3] in north_american_shows.keys():
            frequency = north_american_shows[show[3]]
            north_american_shows[show[3]] = frequency + 1
    
    colors = ['#EFEDF5', '#DADAEB', '#BCBDDC', '#9E9AC8', '#807DBA', '#6A51A3',
              '#54278F', '#3F007D']
    
    tree = etree.parse('map.svg')
    tags = tree.find('{http://www.w3.org/2000/svg}g')
    
    for tag in tags.iterdescendants():
        for a in tag.attrib.values():
            if re.match(r'(CA|US)\-.{2}', a):
                if a in north_american_shows.keys():
                    if north_american_shows[a] > 100:
                        color_class = 7
                    elif north_american_shows[a] > 75:
                        color_class = 6
                    elif north_american_shows[a] > 50:
                        color_class = 5
                    elif north_american_shows[a] > 20:
                        color_class = 4
                    elif north_american_shows[a] > 10:
                        color_class = 3
                    elif north_american_shows[a] > 5:
                        color_class = 2
                    else:
                        color_class = 1
                else:
                    color_class = 0
                
                color = colors[color_class]
                tag.set('fill', color)
    
    print etree.tostring(tree)
    

if __name__ == "__main__":
    do_stuff()

share:
facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmail

, , , ,

7 Responses to data visualizations and the tragically hip, part 0

  1. Julie Steele 2009-11-23 at 1534 #

    Sweet! Love it. What could be better than the combination of good tunes and good tech?

    I’m curious: was the availability of a useable map what made the decision to visualize location by state/province rather than city? Or was it a granularity issue?

    It would be potentially fascinating to see set list data, particularly if you found a way to correlate that with region, world events, days of the week, or Gord Downie’s coffee consumption. :-)

    • pam 2009-11-23 at 1548 #

      Life’s too short for bad coffee!

      And… yeah, I’m not sure how I would have done a chloropleth map with city data. You need areas to fill in for these things, so that doesn’t make much sense if you’re using cities. I might have tried to do it by county, but having it be two separate countries that divide things up and identify areas differently made that really hard.

      I definitely do want to do cities, but I think markers on a googlemap will be a better fit. That, however, will require me to get latitude and longitude coordinates for the cities, and if I’m doing THAT, I maybe want to do it by actual venue. Like, they have played a lot in Toronto, but I’d be curious to watch them go from smaller clubs to, say, the Gardens. I could do a set of markers for each tour, and you could turn them on and off and watch them take over the world! Or, you know, Canada.

      I also agree that there’s a ton of cool stuff to do with setlists, but that is step four or five of my crazy plan to visualize all this stuff.

      • Julie Steele 2009-11-23 at 1619 #

        Yeah, I understand about setlist data being a few steps down the road. But I don’t think I’ve seen anyone analyze that kind of data before, and my imagination is all fired up over how you could explore it.

        You could assign each song a “cheerfulness” value (subjectively assigned from 0-5) perhaps, or a “popularity” value (based on number of weeks on radio charts), or any number of other values, and then look for patterns in which songs are played most often, where in the show they’re played, in which cities certain songs are skipped, etc. It’s the kind of data where the outliers would be the most interesting parts.
        “What? They’ve never played ‘Poets’ in Seattle?” That kind of thing.

        I can think of a bunch of other bands I’d love to do this with, too. Then you could aggregate! “Wow, songs played in Memphis are 75% more cheerful than songs played in Austin for all five bands we looked at.”

        This is how pet projects turn into monsters, I know, but when it’s about music I can help feeding them.

        • pam 2009-11-23 at 1637 #

          Oh, man, I will be doing this the rest of my life! But those are hilariously excellent ideas, and then there can be heated arguments about whether the cheerfulness rating of ‘Fireworks’! This all makes me wish I’d paid more attention back in my stats class. Or any attention at all.

  2. murklins 2009-11-23 at 2049 #

    I read this, finally! You totally had me expecting some terrible code in that part where you populate your dictionary of locations/show frequency but it was really anti-climatic. I guess that’s for the best, code efficiency-wise, but my day has been blah so I was hoping for more of a thrill. Instead, I just learned that you don’t much like the else clause.

    I keep thinking I am the absolute KING of typos these days, but it is just my FF dictionary telling me that none of my words are actual French words. Except shocking. Shocking is fine.

    • pam 2009-11-24 at 0327 #

      Well, the hilariously terrible code had the disadvantage of not actually working, so I had to write better code. Sorry it failed to brighten your day, although I do find that shocking. Shocking, I say. Possibly I am even outraged, Darren Nichols style.

Trackbacks/Pingbacks

  1. more hip visualizations, part 0_1 | newsprint fray - 2010-02-04

    [...] data visualizations and the tragically hip, part 0 [...]

say something...

Powered by WordPress. Designed by Woo Themes