Last week, I excitedly read the announcement and watched the screencasts of Google Refine 2.0.
Today, I used it to transform this hot mess of HTML:
Into this structured data goodness:
This is really powerful. It would have been incredibly useful when I was compiling a list of every local election official in the United States for TurboVote. Or when I was making a list of every secondary school, CEGEP, college and university in Canada for CMSF. But I was done with that.
Or so I thought. It turns out the Ontario Ministry of Education’s list of schools is broken. (I called and told them about this weeks ago.) And unlike most other provinces, the ministry doesn’t provide a spreadsheet listing all schools. The school finder does have a list of schools. And each school has its own web page with the data I need. So I wrote a scraper, downloaded 934 pages of HTML and parsed the data in to a spreadsheet. Done, right? Wrong.
What’s missing from the list? Ontario’s private schools. That page is hard to find — it’s buried in the FAQ for parents. The list of schools is hidden behind a submit button.
So now to extract all that data with Google refine. First, I used TextMate to remove the extraneous HTML, giving me just the list of schools. Importing that a running some transforms got me 90% of the way there. Since I couldn’t figure out how to merge the last few freeform columns (enhancement request filed), so I exported to TSV, opened in Excel, merged a few columns and re-imported into Google refine. I then extracted the school types, affiliations, OSSD credit-granting status and association membership.
Check out the files I used:
Awesome. Now all you need is a kid to send to one of these schools ;-p
Hey Paul. Great stuff. Have you all the school data somewhere? I’m move to the GTA and it’s been difficult to find the school information in a useful for so I would appreciate getting a hold of your list. I’ll bang it up web mashup so I can look at houses for rent next to schools…
Email me and I’ll get you the GTA schools.
Leave a comment