This is really powerful. It would have been incredibly useful when I was compiling a list of every local election official in the United States for TurboVote. Or when I was making a list of every secondary school, CEGEP, college and university in Canada for CMSF. But I was done with that.
Or so I thought. It turns out the Ontario Ministry of Education’s list of schools is broken. (I called and told them about this weeks ago.) And unlike most other provinces, the ministry doesn’t provide a spreadsheet listing all schools. The school finder does have a list of schools. And each school has its own web page with the data I need. So I wrote a scraper, downloaded 934 pages of HTML and parsed the data in to a spreadsheet. Done, right? Wrong.
So now to extract all that data with Google refine. First, I used TextMate to remove the extraneous HTML, giving me just the list of schools. Importing that a running some transforms got me 90% of the way there. Since I couldn’t figure out how to merge the last few freeform columns (enhancement request filed), so I exported to TSV, opened in Excel, merged a few columns and re-imported into Google refine. I then extracted the school types, affiliations, OSSD credit-granting status and association membership.
Check out the files I used: