This weekend has been spent trying to make head or tail of the various ways that the recent UK election results have been summarised on Wikipedia.
The good news: each constituency has its own page, with a name that's easily findable. There's even an index page with a list of them all.
The bad news: while there are a few common themes in the way the pages are laid out, there's also a lot of special cases.
The result: while I can get a good proportion of the results quite easily, the remaining few are tricky to grab automatically. It's now got to the point where I'm starting to play whack-a-mole. Every time I tweak my parser to work on a currently non-working page it more than likely breaks one that's currently working.
Heigh-ho! Such is the way of parsing human-editable content.
Part of the challenge is that the format used is quite loose too - and a lot of people have tried to write good parsers. I'm currently using this library but with a few manual tweaks to ensure some of the pages parse as expected.
[Image credit: Jeremy Keith]