Recent Photos

 

 

Browsing all posts in "development"

more hip visualizations, part 0_1

A while back, I started messing around with data visualization stuff, and came up with a chloropleth North American map that attempted to show the places the Tragically Hip played most often. I now have a slightly shinier and more granular map that shows cities, which is step 0.1 on the way to glory. Not that I have any idea what constitutes ‘glory’ in this instance, but I’m told that life is about the journey, not the destination.

Once again, I started with the [admittedly unreliable] show archive on the Hip’s site. (Actually, I started with the .csv I generated last time, but whatever. That’s the data source. It’s still only North American shows, but I think it’d be easy enough to extend to the rest of them.) To get the shiny googlemap with markers for places they’ve played, I came up with the following steps:

  1. Pull a list of cities & states/provinces out of the csv.
  2. Ask google for the lat/lng coordinates of those cities.
  3. Turn said coordinates into xml tags.
  4. Feed the xml into googlemaps and get a marked map.

This went way better than my experiments last time, but there’s a lot I still want to do with this map to make it more useful interesting. It’s still some lazy, sloppy code, but here it is anyway.

Step one: get a list of cities to look up.

queries = []

reader = csv.reader(open('concerts.csv'), delimiter=",")
for row in reader:
    if row[4].strip() in ['United States', 'Canada']:
        city = row[2] + "+" + row[3]
        queries.append(city)

locales = set(queries)

So that gives us a list of unique cities. (I may want to know that they’ve played Toronto 1938459 times or whatever, but I don’t need to look up Toronto’s coordinates more than once.) The way the geocode api works is that you pass in a URL with a bunch of parameters in the query string, and then it returns the coordinates in whatever format you’ve requested. So the next step is to set up all the junk to build those URLs.

scheme = 'http'
netloc = 'maps.google.com'
path = '/maps/geo'
params = ''
fragment = ''
query_dict =   {'output': 'csv',
                'sensor': 'false',
                'key': 'REDACTED',
                'q': '',}

all_query_strings = []
for city in locales:
    query_dict['q'] = city
    newqs = urllib.urlencode(query_dict)
    all_query_strings.append(newqs)

all_urls = []
for place_qs in all_query_strings:
    url = urlunparse((scheme, netloc, path, params, place_qs, fragment))
    all_urls.append(url)

Yeah, I probably should have used list comprehensions there, but I tend to write those when I’m refactoring stuff. The first pass at something tends to be step by rudimentary step. At any rate, all_urls is now a list of URLs to give to google, which is step two.

f = open('coords.csv', 'wb')

# using regular file handling instead of the csv module because i am lazy
# and google sends back strings.
h = httplib2.Http()
for url in all_urls:
    resp, content = h.request(url)
    f.write(content)
    f.write('\n')
    time.sleep(1)
f.close()

There’s a limit to how many requests you can send in a given time period, but I don’t know what that limit is and am not in any hurry. Also, I should only have to run this once, so I’m not uptight about that one-second sleep in there. At any rate, I now have a csv called coords.csv that looks like this:

200,4,28.3936186,-81.5386842
200,4,32.9911550,-117.2711481

Etc. The format is: response code, accuracy, latitude, longitude. Now I want to turn all of that into xml, which is just straight-up string interpolation.

all_xml = open('coords.xml', 'wb')
reader = csv.reader(open('coords.csv'), delimited=",")
for row in reader:
    latitude = row[2]
    longitude = row[3]
    xml_tag = "<marker lat='%s' lng='%s'/>\n" % (latitude, longitude, )
    all_xml.write(xml_tag)
all_xml.close()

And that’s pretty much that. Add in the other xml stuff to make sure it’s a valid xml document, and then call the whole thing from an html file with some canned javascript.

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <script src="http://maps.google.com/maps?file=api&amp;v=2&amp;key=REDACTED" type="text/javascript"></script>
    <script type="text/javascript">

    function initialize() {
      if (GBrowserIsCompatible()) {
        var map = new GMap2(document.getElementById("map_canvas"));
        map.setCenter(new GLatLng(37, -92), 4);
        map.setUIToDefault();

        GDownloadUrl("coords.xml", function(data) {
          var xml = GXml.parse(data);
          var markers = xml.documentElement.getElementsByTagName("marker");
          for (var i = 0; i < markers.length; i++) {
            var point = new GLatLng(parseFloat(markers[i].getAttribute("lat")),
                                    parseFloat(markers[i].getAttribute("lng")));
            map.addOverlay(new GMarker(point));
          }
        });
      }
    };
    </script>
  </head>
  <body onload="initialize()" onunload="GUnload()">
    <div id="map_canvas" style="width: 1000px; height: 600px"></div>
  </body>
</html>

Et voilá. An unuseful map of unlabeled markers, some of which are incorrect (like, why is there one in the Cayman Islands?). Having the coordinates, though, opens up a lot of possibilities for future awesomeness, which I’m sure someone else will come up with. My own current thoughts involve things like grouping the markers by year (or possibly by tour) and let people turn them on and off; and one marker per show would be better than one marker per city (although, in that case, do I stick with city markers and just say they’ve played Toronto a lot, or do I do the data massaging necessary to get coordinates for and show Lee’s Palace vs the Horseshoe vs the ACC?); and some way to make it collaborative and interactive would be completely amazing. Like, there would be one giant hipmap and if someone were to be logged into their google account, they could click on a marker for a show they’ve attended and add their info and their two cents. And who knows what I’m going to do about setlists. Something, something, something, someday, maybe.

data visualizations and the tragically hip, part 0

The very short background to this is that I love data visualizations, and I love the Tragically Hip. So when I saw this tutorial on making chloropleth maps with python, I thought, hey, I can totally do that. I’ve been wanting to mess around with data visualization stuff for a while now, and this seemed like a pretty good place to start. I decided I wanted to generate a map of North America showing me where the Hip has played most often. Seemed like a pretty simple and straightforward place to start.

First, I needed data, which I got from the Hip’s show archive. I just pasted it into a text editor and did some find-and-replace to make it a .csv. I really thought that would be the hard part. Ha!

Next, I needed a map. The tutorial uses a US county map, but that wouldn’t work for this. First, Canada is not actually in the United States! Also, they do not have counties. They have ridings, but reliable Canadian sources tell me they only care about ridings during elections. So anyway, this North American map is the one I ended up with.

If you read the first post I linked to, you will learn valuable information, such as the fact that the map has to be an .svg, which is a Scalable Vector Graphic, which is really XML. I hope you are now thinking, “oh god, will we have to parse XML?” Sadly, the answer is yes. Also, this particular map has shitty XML, which will make it all the more exciting.

But whatever! Just ignore that impending sense of dread and look at the map in a text editor. Each state/province/territory (henceforth SPT) has an ISO-3166 code assigned to it as an ID. So Illinois is US-IL, and Ontario is CA-ON. The csv I made lists city, SPT, and country, but not in handy ISO-3166 form, so I had to do something about that. I am not sure I really did the right thing, but here’s the code I wrote to do it.

# read in the concert csv
concerts = [r for r in csv.reader(open('concerts.csv', 'r'))]

for show in concerts:
    if show[4].strip() == "Canada":
        show[3] = "CA-" + "%s" % (show[3])
        # quebec's code is QC in iso-world
        if show[3] == "CA-PQ":
            show[3] = "CA-QC"
    elif show[4].strip() == "United States":
        show[3] = "US-" + "%s" % (show[3].upper())

This changes my csv from this:

11/07/2009,Landmark Theatre, Syracuse,NY,United States,TTH_SMT_20091107_1328

to this:

11/07/2009,Landmark Theatre, Syracuse,US-NY,United States,TTH_SMT_20091107_1328

I feel like there is a better way to do it, but that was the first solution that popped into my head, and it worked just fine. From there, I needed to figure out how often they played in a given SPT. I KNOW there is a better way to do this, but I stared at it for a few minutes before the laziness won out and I moved on.

    north_american_shows = {}
    for show in concerts:
        # if the state's not already there, add it. the first time through,
        # it will never be there, so everything will be set to 0.
        if show[3] not in north_american_shows.keys():
            north_american_shows[show[3]] = 0

        # look, i know this is stupid. i don't care. just count, okay.
        if show[3] in north_american_shows.keys():
            frequency = north_american_shows[show[3]]
            north_american_shows[show[3]] = frequency + 1

As you can see, I first made an empty dictionary called north_american_shows, and then I ripped through the csv and gave my dict keys corresponding to all the ISO codes. I set all their initial values to zero. Then I went through AGAIN, and just incremented the values by one for every show they played in a given SPT. Printing out north_american_shows at this point gives something like this:

{'': 11, 'BE': 1, 'US-NY': 83, 'US-PA': 33, 'US-TN': 3, ...}

I think I probably should have tried to make a dict that only used valid ISO codes as keys, instead of everything in that SPT column, and I probably should have just gone through the csv once, rather than twice. At any rate, this got me a usable dictionary that told me shocking things like mostly the Hip plays in Ontario. That piece of information, by the way, is also nicely conveyed by this:

word cloud

That is a word cloud I made in wordle in approximately 30 seconds by pasting in the ‘city’ column from my csv. But let us not be deterred by the fundamental uselessness of our exercise! Probably I would never do anything at all if I let that stop me.

Anyway! We now have our data. A quick glance through it seemed to say I should break up the distribution like this:

0
1-5
6-10
11-20
21-50
51-75
76-100
100+

There is only once place they’ve played more than 100 shows (Ontario), and only a few in the 76-100 range, so more granularity than that didn’t make much sense to me. Of course, I know exactly nothing about statistics, so I could be wrong.

Time to pick out some colors. Here, I mostly just did what Flowing Data told me to and used ColorBrewer; I needed eight colors. (Actually, I ended up needing nine because the last color was white, so you couldn’t see the SPT delineations for places they’ve never played.)

After that, the fun part. I expected this to be pretty easy; I’m familiar with Beautiful Soup, the parser used in the tutorial. All I needed to do was go through the XML, find a tag that carried one of my ISO codes as an ID, and then append a fill attribute to color the area based on my chosen color scheme/distribution. Unfortunately, Beautiful Soup did not deal at all well with the poor markup of the map; even just reading it in and then printing it out did all sorts of terrible things. I ended up having to use lxml, which I don’t really like (mostly because it’s hard to build, although that isn’t an issue on Snow Leopard, thank god). Here’s what I ended up with for this part of it:

    colors = ['#EFEDF5', '#DADAEB', '#BCBDDC', '#9E9AC8', '#807DBA', '#6A51A3',
             '#54278F', '#3F007D']

    tree = etree.parse('map.svg')
    tags = tree.find('{http://www.w3.org/2000/svg}g')

    for tag in tags.iterdescendants():
        for a in tag.attrib.values():
            if re.match(r'(CA|US)\-.{2}', a):
                if a in north_american_shows.keys():
                    if north_american_shows[a] > 100:
                        color_class = 7
                    elif north_american_shows[a] > 75:
                        color_class = 6
                    elif north_american_shows[a] > 50:
                        color_class = 5
                    elif north_american_shows[a] > 20:
                        color_class = 4
                    elif north_american_shows[a] > 10:
                        color_class = 3
                    elif north_american_shows[a] > 5:
                        color_class = 2
                    else:
                        color_class = 1
                else:
                    color_class = 0

                color = colors[color_class]
                tag.set('fill', color)

    print etree.tostring(tree)

For whatever reason, I had to pass in the link and then the tag I was looking for. I have no idea why; a coworker had to tell me to do that (thanks, Nat!). Once I had the root tag, I iterated through all its descendants, looking for IDs that looked like an ISO code. I initially looked for ones that matched something in north_american_shows, but then I realized that wouldn’t quite get it done; because there are places they have never played, those places are not in the dict. Hence the regex and the if/else; I needed to find those, too, and set them to the color I chose for 0.

Once that was done, I ran:

$ python colourize.py > hipmap.svg

And, amazingly enough, I have a map that shows me things I already knew.

north american hip map

The darker the purple, the more often they’ve played in that SPT. I’m pretty pleased with myself, even though this is the simplest possible visualization I could do and it took me a lot longer than I feel it should have. Seriously, I spent many days fighting with XML parsers, trying to get it figured out. But I’m definitely curious to know how the code could be better, because it’s obvious to me that it could be.

In the meantime, I think I’m going to try doing something with google maps, incorporating more granularity in terms of where they’ve played and on what tour. I’d also like to do something with setlists, but that data is a little harder to get into a useful format. It’s out there, for sure, thanks to fans more obsessive than I am, but it’s going to take me a while to figure out what I want to do and how I want to do it.

Before I do any of that, though, I’m going to go see some Hip concerts.

Oh, right. The full script, for completeness’ sake:

import csv
import re
from lxml import etree

def do_stuff():
    # okay. step one is to dump the csv into a list of lists.
    concerts = [r for r in csv.reader(open('concerts.csv', 'r'))]

    # so we have a list of lists. now we need the strings to be useful. while
    # you're in there, strip out any whitespace, because we hate whitespace
    # here in python land.
    for show in concerts:
        if show[4].strip() == "Canada":
            show[3] = "CA-" + "%s" % (show[3])
            # fix the PQ/QC thing in the data
            if show[3] == "CA-PQ":
                show[3] = "CA-QC"
        elif show[4].strip() == "United States":
            show[3] = "US-" + "%s" % (show[3].upper())

    # okay, now we need to find out how often they play in a given place. so,
    # uh, let's build a dictionary.
    north_american_shows = {}
    for show in concerts:
        # if the state's not already there, add it. the first time through,
        # it will never be there, so everything will be set to 0.
        if show[3] not in north_american_shows.keys():
            north_american_shows[show[3]] = 0

        # look, i know this is stupid. i don't care. just count, okay.
        if show[3] in north_american_shows.keys():
            frequency = north_american_shows[show[3]]
            north_american_shows[show[3]] = frequency + 1

    colors = ['#EFEDF5', '#DADAEB', '#BCBDDC', '#9E9AC8', '#807DBA', '#6A51A3',
              '#54278F', '#3F007D']

    tree = etree.parse('map.svg')
    tags = tree.find('{http://www.w3.org/2000/svg}g')

    for tag in tags.iterdescendants():
        for a in tag.attrib.values():
            if re.match(r'(CA|US)\-.{2}', a):
                if a in north_american_shows.keys():
                    if north_american_shows[a] > 100:
                        color_class = 7
                    elif north_american_shows[a] > 75:
                        color_class = 6
                    elif north_american_shows[a] > 50:
                        color_class = 5
                    elif north_american_shows[a] > 20:
                        color_class = 4
                    elif north_american_shows[a] > 10:
                        color_class = 3
                    elif north_american_shows[a] > 5:
                        color_class = 2
                    else:
                        color_class = 1
                else:
                    color_class = 0

                color = colors[color_class]
                tag.set('fill', color)

    print etree.tostring(tree)

if __name__ == "__main__":
    do_stuff()

adventures in app development

There are many important pieces to The Puzzle Of Pam, and one of the biggest is my unreasonable and undying love for delicious. I think I have four — sorry, five — accounts there, plus access to quite a few more. I am often asked how I spent my weekend, and it’s not unusual for me to say something like, “re-tagged everything in delicious,” or something that might sound benign except for the fact that I have multiple thousands of bookmarks and only some of my management tasks have been scripted.

Anyway. One of the things about del that makes me a sad pamda is their bundle management interface. I know that many people don’t bundle at all (just yesterday I was asked, “what’s a bundle?”), but I like it. It keeps my tags organized and readable and useful, both to me and to other people. Depending on what you’re doing with the account, bundles can be used to impose some hierarchical structure on an otherwise flat setup. So I like bundling, and it’s important to me and to my likeminded friends. And I feel that bundling should be full of fun, but actually it is full of nails, sorrow, and repetitive stress injuries.

I was going to talk about WHY I hate the bundle management interface, and about the greasemonkey scripts and user styles and crazy tagging hacks that make it slightly less painful, but that is also not the point of this post. [This is one of the things that's difficult for me about this blog. How much background do I give? My usual MO is to just start waving my hands in the air and shouting, with the expectation everyone will be able to follow along. I'm not sure that's the case here, but... I'm also not sure I care! I'm awesome that way.]

The point of this post is that I have, after a few years of threatening to do so, started working on my Glorious Bundle Management App Of Magnificent Amazingness. It needs a name that does not consist entirely of adjectives, and probably it also needs unicorns and sparkles. The thing is, though, that I’m not an app developer. I am a test developer. And so it’s a strange and interesting learning experience to try to build something for other people to use. I’m not used to considering other people! I design for my own needs, and then the second I show it to someone else, it breaks. And my needs are… specific and strange, especially when it comes to delicious bundle management.

I’m still not sure what’s going to work and what isn’t, and the design has already changed a little bit as I’ve worked on it, but the basic idea is an interface that pulls in all your bundles, shows you the tags in those bundles, and allows you to add or remove tags via drag-and-drop. There will be a list down one side of the page that will show you your unbundled tags or your bundled ones; if it’s showing you all tags, you can hover over them and see which bundles those tags are currently in. None of this is built into the del management interface. You have to use a GM script to see what bundles a tag is in. You have to go back and forth between a lot of screens to see what tags are in which bundles. Etc etc. So I feel my app will transform bundling from something that is soulsucking to something that is soulenriching.

I’m not very far along, for a few reasons. One, I just started. Two, I’m writing it in PHP and jQuery, neither of which I know all that much about. I find it a little sadmaking that I would rather dig through the increasingly foggy recesses of my mind to try to remember where the fuck to put semicolons (EVERYWHERE) than try to figure out how to deploy a Python app. I just don’t want to mess with it. I feel like, even for someone who knows Python fairly well, and who deploys applications to production servers on a regular basis, the deployment (and, to some extent, development of) Python web apps sucks. [Yeah, yeah, appengine. It's STILL more difficult than it needs to be, IMO. I am lazy.] I just want to write code for a fairly simple web app, and I don’t want to have to do much else. And PHP makes that really fast and easy, even if the combination of PHP and javascript means I’m having nightmares about curly braces.

At my job, we’re big on iterative development, and I’ve watched it work for years, so that’s what I’ve been trying to do. I have a nice step-by-step list that does not include any items like, “5. ??????” or “17. Make it work.” I even have user stories. I’ve noticed, though, that it’s really easy to get distracted. Granted, I am an easily distracted person. But when it’s just me working on something that is largely for myself, when there’s no one to say, “um, Pam, probably you do not need to spend the next seven hours writing CSS for a login form, because the login form does not currently log anyone in to anything,” I tend to get a little lost.

And, like I said, some of that is me being me, and some of it is that it’s my project and I have to be my own PM, but I also feel like I’m falling into a rabbit hole I see other people going down. I think it’s a danger in user-focused javascripty apps in general, to spend a lot of time up front focusing on making it pretty “for the user,” when the user would probably like to have something to use. I like to think that users would rather have something that works than something that looks good sucking.

So anyway, I have to keep reminding myself of that, and reining myself in, and this process has been slow and a little painful, but also very illuminating. Updates as warranted.