Sunday, January 25, 2015

Breaking the chain: the fragile web of metadata

I can still hear you sayin'
You would never break the chain
(Never break the chain)

Fleetwood Mac - The Chain Lyrics | MetroLyrics 


One of the amazing things about working with electronic resources in the library environment is how easy we can make things for our users.  Back when I was an undergrad, in what now seems like the Jurassic Period, researching a topic meant going to the library and using the card catalog to find books, then venturing into the reference room to confront the vast shelves of periodical indexes.  It was a time-consuming process that involved checking for the same topics in volume after volume of the same index, writing down citations, and then hoping that (1) your library had the title and (2) someone hadn't stolen or defaced the issue you needed (the dreaded microfilm format had its own problems).

Now, thanks to the Internet and incredible advancement in online indexes and publishing, today's researcher can go to a library's databases, research a topic, and be presented with appropriate citations.  Many of these citations will link directly to the article; the rest will have a link to the library's link resolver, which will offer links to the article, or to the library's catalog (to look for a print copy), ILL system, or other appropriate services.

This link resolver software (explained in detail here) is, of course!, metadata-driven, and relies on specific information to form the link that is presented to the user.  This metadata may include journal title, ISSN, article title, volume/issue/page number, or other information, but frequently will instead be based on the Digital Object Identifier (DOI) assigned to the article. In theory, this is a highly stable way of creating links, as the DOI should remain the same even if the article moves to another location.  Let me reiterate that:  in the unstable world of electronic journal  publishing, DOI is a persistent link that solves many of the problems that happen when journals change publishers, platforms, or URL.

This is an example of a DOI-based link:

http://ezproxy.lib.calpoly.edu/login?url=http://dx.doi.org/10.1017/S1368980008002152?nosfx=y

This takes our users to this article:

 Yeh-Chung Chien, Ya-Jing Huang, Chun-Sen Hsu, Jane C-J Chao and Jen-Fang Liu (2009). Maternal lactation characteristics after consumption of an alcoholic soup during the postpartum ‘doing-the-month’ ritual. Public Health Nutrition, 12, pp 382-388. doi:10.1017/S1368980008002152. 

So, here's the part that isn't so widely advertised, but should surprise no one:  library link resolver technology isn't perfect.  In fact, I get dozens of emails every day from users reporting broken links.  Sometimes this is user error or confusion, especially when we can't link to the article level (not every publisher has the necessary infrastructure ... or provides the necessary metadata), but sometimes it's obvious that something is seriously wrong.

This last Tuesday, January 20, I suddenly saw a huge upsurge of broken link reports.  It didn't take much detective work to see that they were all DOI-based ... and every link went to a "not found" page for doi.org.  There was nothing we could do (reconfiguring our entire link resolver was not an option), and no way our users could reach any of that content except by manually browsing to the citation they needed using our journals A-Z list and drilling down through the publisher's site.

So what happened?  It turns out that the dx.doi.org domain name didn't get renewed.  The entire site went dark, and broke all of the services that rely on it, including CrossRef, FigShare, and many others, among them library link resolvers.  The mistake was discovered within a few hours, and the service was brought back up, but it took several more hours before the restored domain had fully propagated through the Internet.  There is a good write-up about the incident on the CrossRef blog, but this is the most important point from it:


For all the redundancy built into our systems (multiple servers, multiple hosting sites, Raid drives, redundant power), we were undone by a simple administrative task. 

Conclusion:  metadata is necessary, and wonderful, and accuracy is essential, but remember that the systems that rely on it are fragile, and never forget the potential for human error. 

1 comment: