Using Google refine to produce a list of Ontario’s private schools

Last week, I excitedly read the announcement and watched the screencasts of Google Refine 2.0.

Today, I used it to transform this hot mess of HTML:

Into this structured data goodness:


This is really powerful. It would have been incredibly useful when I was compiling a list of every local election official in the United States for TurboVote. Or when I was making a list of every secondary school, CEGEP, college and university in Canada for CMSF. But I was done with that.

Or so I thought. It turns out the Ontario Ministry of Education’s list of schools is broken. (I called and told them about this weeks ago.) And unlike most other provinces, the ministry doesn’t provide a spreadsheet listing all schools. The school finder does have a list of schools. And each school has its own web page with the data I need. So I wrote a scraper, downloaded 934 pages of HTML and parsed the data in to a spreadsheet. Done, right? Wrong.

What’s missing from the list? Ontario’s private schools. That page is hard to find — it’s buried in the FAQ for parents. The list of schools is hidden behind a submit button.

So now to extract all that data with Google refine. First, I used TextMate to remove the extraneous HTML, giving me just the list of schools. Importing that a running some transforms got me 90% of the way there. Since I couldn’t figure out how to merge the last few freeform columns (enhancement request filed), so I exported to TSV, opened in Excel, merged a few columns and re-imported into Google refine. I then extracted the school types, affiliations, OSSD credit-granting status and association membership.

Check out the files I used:

Join the Conversation

3 Comments

  1. Hey Paul. Great stuff. Have you all the school data somewhere? I’m move to the GTA and it’s been difficult to find the school information in a useful for so I would appreciate getting a hold of your list. I’ll bang it up web mashup so I can look at houses for rent next to schools…

Leave a comment

Your email address will not be published. Required fields are marked *