Wiley using fake DOIs to trap web crawlers…and researchers
Over the holiday weekend, some interesting news broke via the twitter of Rik Smith-Unna, a PhD student at Cambridge University in the UK. In a GoogleDoc shared by Smith-Unna, he described a situation whereby his entire institution was blocked from access to all of Wiley’s materials under the assumption that a legitimate, academic information mining crawl was, in fact, a botnet or some similar sinister process. He goes on to describe the university being contacted by Wiley to determine the source of this “data breach”. At the root of all this confusion? Several DOIs, or Digital Object Identifiers, assigned to resources associated with Wiley products.
A brief aside for any unfamiliar with DOIs: a DOI is sequence of letters and numbers meant to uniquely identify a particular digital object (hence the name), and are widely used for scholarly articles published online. They allow any user to quickly and easily navigate to the primary online home of any such article, regardless of whether or not platforms, URLs, or journal names have changed since the item’s publication; this is an invaluable service for researchers and librarians, among others. CrossRef, a not-for-profit association of publishers, handles much of the assignment of DOIs for scholarly materials.
So what was the problem with the Wiley DOIs accessed by Smith-Unna? In short, they were fake: dummy DOIs meant to catch anyone attempting to crawl through, and harvest information on, materials hosted by Wiley online. On the surface, this doesn’t seem like poor practice on behalf of the publisher. However, in addition to blocking access because of legitimate scholarly inquiry (as was Smith-Unna’s above), there are some serious issues with this approach.
First, as mentioned by Geoffrey Bilder (CrossRef’s Director of Strategic Initiatives) in a reply to Smith-Unna’s tweet, CrossRef discourages publishers and content platforms from using fake DOIs or DOI-related things in this way. As Smith-Unna points out, the value of DOIs is that they are both unique and stable; a DOI remains the same regardless of any platform or URL changes, and each DOI is associated with one, and only one, digital item. Muddling this up with “fake” DOIs that are not stable and are not unique compromises the whole system. Can you imagine a researcher trying to access a legitimate article through its DOI and getting rerouted to one of these dummy DOI pages? Not only would this be very problematic for the researcher, it may also cause a situation like Smith-Unna and Cambridge found themselves in. Independent researchers and researchers at smaller institutions may not have as much of an ability to convince a large publisher that their accessing that DOI was legitimate, too.
Second, there are likely better ways of combating unauthorized crawling. Many other sites have do this just fine without relying on dummy DOIs. According to Bilder, in conversation with Smith-Unna on Twitter, this is not the first time CrossRef has encountered this behavior, but other platforms have stopped after CrossRef contacted them. If other platforms in the scholarly publishing realm are able to adequately cope with the threat of unauthorized crawling without resorting to setting up dummy DOIs, then why use them? The obvious answer is that it is easy, but I expect that, not that Bilder is aware of the situation, Wiley and its platforms will cease using this method.
Ideologically, though, this is of concern because it demonstrates a lack of understanding or respect for researchers on the part of Wiley and other publishers. As Bilder tweeted, this is not the first time CrossRef has had to deal with this issue, meaning that Wiley is not the first academic publisher to use dummy DOIs or “DOI-like things”. DOIs have become an important part of maintaining the scholarly record; as I’ve said a few times above, they are unique and stable identifiers for (among other things) journal articles, and most citation styles now recommend including them in lists of references in order to simplify the process of tracking cited articles down. They may not be entirely essential to research, but you’d be hard-pressed to come up with an argument that they are not extremely helpful to research. By setting up fake DOIs, Wiley and other publishers are demonstrating that saving money is more important to them than maintaining the integrity of the DOI system. Preventing unauthorized crawling is something these platforms should be doing, yes, but setting up fake DOIs is only one way of doing this. It is a cheap and easy way of doing it, but that shouldn’t matter more to these publishers than the integrity of a system that is so beneficial to their consumers.
That is why we should find this kind of behavior concerning, because it demonstrates that supporting research is not the chief priority of these publishers.