Tuesday, March 16, 2010

Combating link rot on Wikipedia

Wikipedia's W (favicon). The "W" ori...Image via Wikipedia
One of the main principles of Wikipedia is verifiability, the idea that any fact you find in an article can also be found in a reliable external source (that's why there are so many footnotes in any given Wikipedia article). These external sources can either be offline paper products or more often than not online web pages. Unfortunately, web pages often change or become unavailable, a process nicknamed link rot , which goes counter to the ability of verification.

One way to combat link rot and to ensure that a reader can always find the sources used to make up a Wikipedia article is to rely on online archiving services such as the Internet Archive or WebCite. The solution to the problem is to submit each linked web page to the archives' attention to make sure they will have a copy of the referenced webpages in the eventuality that they become unavailable.

There is no automatic way to submit all links on a Wikipedia to an archive and different projects have come up with different solutions. The English Wikipedia used to send every new link added to the various articles to the WebCite archive (to the point that said archive had to increase server capacity). The French Wikipedia have devised a way to link to an archived version of linked pages at the Wikiwix search engine, but I don't know the particulars.

So far the Hungarian Wikipedia doesn't have a systematic way of eliminating dead external links. As a first step in the right direction I slightly modified a component of the Pywikipedia framework to go through every single page in the Hungarian Wikipedia and send every external link to the WebCite archive. The method was inefficient because I am not a programmer and both Python and the WebCite website often crashed. (The ideal program would have used the external links database dump that contains only the links without the irrelevant article text.)

As a results of my efforts the vast majority of the external web pages that were linked from the Hungarian Wikipedia and were alive at the end of 2009 can now be found in the WebCite archive. (Such as this copy of the Nobel prize website.) I will run my program periodically to include new links added to articles.

The logical extension of my work would be to include the links to the archived versions next to the links themselves if a page dies. This could be done either manually or automatically, however I haven't the expertise or time to make this happen.