The Path to Philosophy
Wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at “Philosophy”.
— Randall Munroe, title text of xkcd/903.
The method described in the quote at the top of this page is actually away to make a graph out of Wikipedia, I realized. Some sort of reverse tree. Not exactly the right description, since cycles are possible (and do exist on Wikipedia).
I wondered whether it would be possible to write a script that starts from some article on Wikipedia and goes on until it finds Philosophy or ends up in a cycle.
As it turns out, it is. Once you get past the parsing of HTML, it is not that hard, even. For that, I avoided regular expressions, and went straight to html.parser.HTMLParser
. Can’t say I used it before, but it was a breeze. 90% of the code in the parser is concerned with keeping track of all the conditions whether it is the right time to follow a link. Surprisingly, handling the “not in parentheses” condition was the simplest. I want to keep the code out of this post and in a Gist, but I can’t resist sharing this snippet beforehand:
def handle_data(self, data): self.open_parens += data.count('(') - data.count(')')
For my walker, I ran into the problem of redirects: this could cause problems — or at least redundancy — for cycles. I solved the problem by making what can be described as a dictionary of sets. I sought help from Stack Overflow, and got a neat new toy out of it, but in the end I just went with what I had already.
Long story short: most articles seem to end up in Philosophy eventually (often in ~20 steps), but some end up in a cycle. It is possible that there is some article in there without a valid link, which means there could be dead ends (I don’t check for that, only for Philosophy and cycles), but I haven’t found one yet.
I thought of building an extensive network of Wikipedia articles, but decided against it because of the high volume of requests I would need to send to Wikipedia for it.
And finally, for those of you craving code, here it is. Requires Python 3.
“I thought of building an extensive network of Wikipedia articles, but decided against it because of the high volume of requests I would need to send to Wikipedia for it.”
You can still do that! Wikipedia shares a dump of their database: http://en.wikipedia.org/wiki/Wikipedia:Database_download
I thought of that, but the volume of downloading I would have to do anyway put me off.