Wikipedia parsing

This weekend has been spent trying to make head or tail of the various ways that the recent UK election results have been summarised on Wikipedia.

The good news: each constituency has its own page, with a name that's easily findable. There's even an index page with a list of them all.

The bad news: while there are a few common themes in the way the pages are laid out, there's also a lot of special cases.

The result: while I can get a good proportion of the results quite easily, the remaining few are tricky to grab automatically. It's now got to the point where I'm starting to play whack-a-mole. Every time I tweak my parser to work on a currently non-working page it more than likely breaks one that's currently working.

Heigh-ho! Such is the way of parsing human-editable content.

Part of the challenge is that the format used is quite loose too - and a lot of people have tried to write good parsers. I'm currently using this library but with a few manual tweaks to ensure some of the pages parse as expected.

Getting there!

[Image credit: Jeremy Keith]