Saturday, January 31, 2015

A Tasty Schema

The blog post The Bogotá Manhattan recipe + markup by Joho the Blog drew my attention to a whole  set of schema I had no idea even existed. Hosted at schema.org, these codes let web designers mark up their pages in ways that help search engines do a better job of recognizing and categorizing certain elements.  The example post uses the code  "itemprop" combined with "ingredients" and "recipeInstructions" to tell search engines that the contents are a recipe and should be included in results for recipe searches on those ingredients.  I viewed the source on a few other recipe sites, and sure enough, there were the itemprop codes! When I looked at schema.org, I found that there are published schemas for all kinds of things, including products, movies, reviews, restaurants, medical conditions, and many more. This really opened my eyes to all the underlying structure in websites that goes way beyond ordinary HTML.

Read-Only Preferred

In his post The Future of Publishing is Writeable, Terry Jones engages in some interesting speculation. He predicts a coming time in which publishing is fragmented into smaller, often-serialized, chunks; self-publishing is ascendant; and every reader is an active participant in the creation process.

There is no doubt that these things are possible--look at sites like fanfiction.net, which have been publishing user-created content for years, or the self-publishing programs run by Amazon, Barnes & Noble, and SmashWords.

However, it seems to me much likelier that all of the trends Jones mentions will exist in tandem with more traditional publishing norms, rather than replacing them.  There is still a need for both authority and recognized talent.

Take the concept of authority...  If a researcher publishes her findings on an important breakthrough, I want to read it in her words, exactly as she presents it.  If she includes her data for me to peruse, so much the better, but she is the expert.  I simply do not want for her research article to be re-edited by anyone and everyone who thinks they have a different interpretation.  That is why the scientific process publishes a version of record, and then other scientists can publish their opinions, under their own professional names, through recognized channels--letters to the journal, for example, or additional research that reproduces, builds upon, or refutes the original work.

Then there is the concept of talent.  As appealing as the concept of a living, growing work of fiction may be, I suspect most of us want to be entertained, and we want some assurance that the entertainment will be worth the investment of our limited time.  Thus the author as brand-name would seem to me to trump the concept of group authorship.  This Forbes article on author and series branding contains a good discussion of how book sales are linked to audience loyalty, which may be built through big-money promotions or through social media such as GoodReads.

Finally, there is Jones' prediction of fragmentation.  I think that in popular media, there may be ever less patience for this.  As MetaDamen pointed out, binge viewing is on the rise, perhaps to the detriment of relatively serialized media like broadcast television.  Even cases where longer works have been broken into smaller chunks are often more closely related to monetization or the perception that the standard-sized container is not sufficient to do justice to the content (Hobbit trilogy, anyone?).  The one case where this argument might be made is the scholarly monograph, particularly where multiple authors wrote the various chapters.  Where these chapters are separately indexed, assigned DOIs, and made available as PDFs, they may as well be separate publications, albeit with a unifying theme.

Wednesday, January 28, 2015

Re: Dummy Proof Data

I really enjoyed the post by Katie Howard // Metadata LS566 regarding "Dummy Proof Data."  I could completely picture the office with papers stacked everywhere as she tried to impose order; I was amused that she "digs" metadata while using wells as her example; and I quite like the idea of the dummy-proof filing scheme.  What I would like to know, though, is whether it was dummy-proof for the next person in her job.  Seeing Katie's passion for organization, I'm going to guess that it was, but that, in my opinion, would be the ultimate dummy-proof filing system.

It's easy when assigning metadata to go to so much effort to make everything findable by the most inexperienced searcher that we create a system that is impossible to maintain.  This can take many forms, including duplicating the data (and the work) in too many places that must be separately maintained, or spending so much time creating access points that we can't keep up with the work, or developing an arcane, undocumented workflow that can't be understood by others.

This tendency can be balanced by asking questions at the outset like "If we need to make a change, how can we make sure it happens in all places?" and "How far do we really need to extend the Dublin Core for this project?" and the old standard "If you win the lottery tomorrow, can someone else step in and do the work?"

So, bring on the dummy-proof discovery schemes, but let's not forget to dummy-proof it for ourselves as well!

Sunday, January 25, 2015

Citation styles and metadata

"Why Are Publishers and Editors Wasting Time Formatting Citations?" is an interesting Scholarly Kitchen post on the potential use of structured metadata to streamline the tedious, time-consuming, and error-prone process of editing citations to fit whichever of the thousands of available styles a particular publisher requires. The author, Todd Carpenter, wonders why a structured citation schema couldn't be developed, so that authors could just paste in the appropriate identifiers and let cross-linking take care of the rest.  In his vision, each identifier would take advantage of existing databases that can act like authority files (e.g. ISSN supplies the correct journal title, ORCID supplies the correct form of the author's name, and so on), eliminating many of the errors that creep into the citation process.  Each publisher would be able to take the citation metadata record and turn it into a human-readable citation using whatever style was preferred, freeing authors from the tyranny of having to proofread each citation for minutiae like punctuation and given name vs. initials.

For a complete trip down the rabbit hole of citation metadata, publisher practices, and author habits, be sure to read the comments.  There is a lively debate regarding the pros and cons of this proposed schema, as well as information on some recent initiatives to develop prototypes that would include this functionality.

DOI and CrossRef explained

"DOI: The 'Big Brother' in the dissemination of scientific documentation," by Miquel Termens, does an excellent job of explaining the reasons and organizations behind DOI (Digital Object Identifiers). The article also gives a clear, concise overview of the inner workings of CrossRef.  (CrossRef, as their site explains, "... allows the user to move from one article to another at the citation level, regardless of journal or publisher.") I would recommend this article for anyone interested in cross linking or the mechanics of access to scholarly publishing.

Although the article is now 8 years old, not too much appears to have changed, other than how DOI is used in the e-book world. DOI is now typically used for scholarly e-books in much the same way it is used for e-journals:  assigned at both title and chapter level by those scholarly publishers who host DRM-free PDF e-book versions on their sites (e.g. Springer, Elsevier ScienceDirect, Wiley, JSTOR, and others).  In adding book chapter DOIs, publishers can take advantage of reference linking (CrossRef et al.), facilitate user access through index and discovery systems, and allow for limited resource sharing (ILL and scholarly sharing at the chapter level).

This quote from the article illustrates the importance of DOI in the scholarly publishing chain and explains the dismay of everyone involved when administrative error temporarily brought the dx.doi.org domain down earlier this week:
The Handle System offers an additional improvement—the working URL is always operative and does not require maintenance. URL obsolescence, and the consequently large number of broken links, is one of the major problems in present day internet structure and it slows down the allocation of links between websites [10]. Handle System provides a solution because, for each, document, it generates a persistent address, a URN, which can be used to locate it. DOI codes can operate as URNs if they are preceded by an address resolution server, e.g., [http://dx.doi.org/10.1126/science.1088234]; hence, the reader does not need to be concerned about a link’s operating state as this is attended to by the publisher of the referenced journal by maintaining an up-to-date URL in the DOI databases.

Breaking the chain: the fragile web of metadata

I can still hear you sayin'
You would never break the chain
(Never break the chain)

Fleetwood Mac - The Chain Lyrics | MetroLyrics 


One of the amazing things about working with electronic resources in the library environment is how easy we can make things for our users.  Back when I was an undergrad, in what now seems like the Jurassic Period, researching a topic meant going to the library and using the card catalog to find books, then venturing into the reference room to confront the vast shelves of periodical indexes.  It was a time-consuming process that involved checking for the same topics in volume after volume of the same index, writing down citations, and then hoping that (1) your library had the title and (2) someone hadn't stolen or defaced the issue you needed (the dreaded microfilm format had its own problems).

Now, thanks to the Internet and incredible advancement in online indexes and publishing, today's researcher can go to a library's databases, research a topic, and be presented with appropriate citations.  Many of these citations will link directly to the article; the rest will have a link to the library's link resolver, which will offer links to the article, or to the library's catalog (to look for a print copy), ILL system, or other appropriate services.

This link resolver software (explained in detail here) is, of course!, metadata-driven, and relies on specific information to form the link that is presented to the user.  This metadata may include journal title, ISSN, article title, volume/issue/page number, or other information, but frequently will instead be based on the Digital Object Identifier (DOI) assigned to the article. In theory, this is a highly stable way of creating links, as the DOI should remain the same even if the article moves to another location.  Let me reiterate that:  in the unstable world of electronic journal  publishing, DOI is a persistent link that solves many of the problems that happen when journals change publishers, platforms, or URL.

This is an example of a DOI-based link:

http://ezproxy.lib.calpoly.edu/login?url=http://dx.doi.org/10.1017/S1368980008002152?nosfx=y

This takes our users to this article:

 Yeh-Chung Chien, Ya-Jing Huang, Chun-Sen Hsu, Jane C-J Chao and Jen-Fang Liu (2009). Maternal lactation characteristics after consumption of an alcoholic soup during the postpartum ‘doing-the-month’ ritual. Public Health Nutrition, 12, pp 382-388. doi:10.1017/S1368980008002152. 

So, here's the part that isn't so widely advertised, but should surprise no one:  library link resolver technology isn't perfect.  In fact, I get dozens of emails every day from users reporting broken links.  Sometimes this is user error or confusion, especially when we can't link to the article level (not every publisher has the necessary infrastructure ... or provides the necessary metadata), but sometimes it's obvious that something is seriously wrong.

This last Tuesday, January 20, I suddenly saw a huge upsurge of broken link reports.  It didn't take much detective work to see that they were all DOI-based ... and every link went to a "not found" page for doi.org.  There was nothing we could do (reconfiguring our entire link resolver was not an option), and no way our users could reach any of that content except by manually browsing to the citation they needed using our journals A-Z list and drilling down through the publisher's site.

So what happened?  It turns out that the dx.doi.org domain name didn't get renewed.  The entire site went dark, and broke all of the services that rely on it, including CrossRef, FigShare, and many others, among them library link resolvers.  The mistake was discovered within a few hours, and the service was brought back up, but it took several more hours before the restored domain had fully propagated through the Internet.  There is a good write-up about the incident on the CrossRef blog, but this is the most important point from it:


For all the redundancy built into our systems (multiple servers, multiple hosting sites, Raid drives, redundant power), we were undone by a simple administrative task. 

Conclusion:  metadata is necessary, and wonderful, and accuracy is essential, but remember that the systems that rely on it are fragile, and never forget the potential for human error. 

Monday, January 19, 2015

Library Automation - past and present

What struck me most strongly while reading Clifford Lynch's From automation to transformation: Forty years of libraries and information technology in higher education was that it was written in 2000, but could easily have been written in 2010.  Having worked in library technical services in various capacities, starting in the mid-1980's as a student assistant in Cataloging, I recognized and experienced the trends and changes the article so accurately describes.  But what fascinates me is how much of that "present state" from 2000 is still the case today, and how many of those predictions are still issues libraries are grappling with.  With all that change mapped out since the 1960's, how is that so little was different even 10 years after this article was written?

My personal, admittedly biased and perhaps ill-informed, opinion is that several factors apply.  The year the article was written, 2000, fell in the middle of period of major changes to desktop hardware, subscription models, and OPAC software.  The years since have perhaps been a time of stabilization and more incremental change.  The other possibility is simply that digitization on this scale is unprecedented and difficult, time-consuming and expensive, confusing and scary.  It's taken a while, perhaps longer than expected, for library systems and library users to catch up to what is now possible.

However, I think we are again on the brink of changes.  Last year, my library spent more on e-books than on print monographs for the first time.  Print journal subscriptions are reduced to popular reading and niche disciplines, and we are withdrawing print journal runs at an astonishing pace.  Libraries are de-emphasizing their OPACs in favor of web-scale discovery systems, and a whole new generation of Library Management Systems are under development, under consideration, or in implementation.  Open Access is a disruptive force in scholarly publishing.  Many of the concerns Lynch mentions are finally being addressed, or at least discussed, including resource sharing in a digital world, access versus ownership of resources, digital preservation, and copyright issues.  I think the next few years will be challenging, and interesting!

Metacrap or Meta-utopia?

Cory Doctorow's essay Metacrap: Putting the torch to seven straw-men of the meta-utopia is nearly 14 years old now.  How old is that in Internet Years?  45?  140?  How's the old girl holding up?

First, in all fairness, predicting the future is hard.  The Internet is awash in embarrassing quotes from people who made unfortunate assumptions and had the misfortune to be famous enough for someone to notice.  However, saying that something is never going to happen is a particularly dangerous game because it takes only a single instance to knock down your whole proposition.  Doctorow cheats a bit by conjuring up a utopia, an ideal world which almost by definition cannot be achieved in the real world; of course it's a "pipe-dream"--that's the nature of utopias.

But let's consider what he's really talking about here.  His arguments seem to divide metadata into two classes:  human-created and machine-created.  Interestingly, he seems to dismiss all human-created metadata as rife with lies, errors, and hidden agendas, while he accepts the kind of metadata used by Google as "far more reliable."

I feel that this was unfair in 2001 and is still unfair in 2015.  First, even the observational metadata he prefers is still mapped to a human-created schema.  Second, the systems that use human-created metadata are often robust enough to counter the error factor.  Sure, you can still find a "Cusinart Whish" listed for sale on eBay, but you'll find it correctly placed in the Food Processor category. Third, Doctorow's meta-utopia appears to be a world in which the user would need to have no skill or digital literacy at all.  In the real world, people always have and always will need to exercise personal judgement.

In conclusion, can I find all the downloadable music?  No, but "all" isn't measurable or attainable.  Can I find enough music to listen to?  Sure.  Can a manufacturer discover suppliers?  Probably, and probably by using the same contacts they always did.  Can I easily choose a hotel room?  Sweet fancy Moses, yes!  And that is meta-utopia enough for me.

Thursday, January 15, 2015

Fly the Metadata Skies

Earlier this week I needed to travel from Tennessee back to California.  During the forced inactivity of several hours of air travel, I let my mind wander to LS 566 and the concept of metadata, and I thought it would be interesting to consider the trail of metadata I left as I crossed the country.  What follows is a kind of simplified metadata outline of my journey. Blue text represents data that I generated or that was retrieved or otherwise associated with me; red and purple indicate where the system broke down.  I have highlighted it to show the incredible amount of metadata that we are surrounded with all the time.  Just like a book, or a digital image, or any of the other library-oriented things (tangible or intangible) we will study and categorize and describe and tag in this class, we ourselves have been thoroughly cataloged, described, and tagged, and are tracked through our everyday lives to an incredible extent.



Check in with Airline:  Frequent flyer number and password retrieve flight information associated with my identity (flight number, ticket number, seat number, departure/arrival times, gate number, etc.); check in process adds metadata confirming my intention to board; checking a bag creates a new set of metadata (size and number of bags, baggage check number) and associates it with my identity.

Google Voice Search:  "Navigate to xxx airport" Google associates my identity in some way with my location, the nearest airport, the optimal route and several alternate routes, traffic events on each route, and drive time.

Drive to Airport:  Follow Google's instructions, but miss a turn in the rain and heavy traffic. Google notes location, offers new route based on my location, adjusts drive time.

Arrive Airport:  Return rental car Rental company locates my account (name, credit card number, driver's license, loyalty plan member number, rental history, vehicle assigned to me) and adds additional data (time/date of return, condition of car, mileage, fuel level).

TSA Check/Airport Security:  TSA confirms identity (driver's license, full name, name on boarding pass) and checks for (and confirms that none is found) metadata regarding no-fly status, other flags.

Board Airplane:  Airline scans boarding pass (my identity plus my seat number), changes metadata status to "boarded," adds name to passenger list.  Gate check carry-on bag (add claim check number to system).

Send Text Messages:  Alert family members that I didn't miss the flight (message time stamp, cell tower, sender, recipient).

Ride in Airplane:  Airline, air traffic control, and other services track metadata associated with flight (aircraft number, flight status, altitude, speed, and a great deal more).

Get off Airplane/Retrieve Carry-on:  Flight status (landed/arrived, time), baggage removed from flight

Next flight (same metadata as previous flight, no gate check)

Arrive at Destination Airport:  Flight status, etc., changed; checked bag arrival noted.

Pick Up Rental Car:  Reservation data (time/date/car type/pickup and return locations) plus my identity in rental car database; rental company assigns specific car to my identity (car license #, make/model/year, mileage, fuel level, car type); rental company upgrades rental car from mid-size to full-size (presumably another customer in my tiny home rental area has requested the larger car, so my data in the system (return to [small town xxx]) flagged my trip as the delivery mechanism), puts my name and slot number on the member board.  Find car, drive to exit.  Metadata problem!  Unmentioned rule says first time reserving as member, you must present the credit card for inspection.  On 3rd car rental of the trip, someone enforces it.  Metadata missing!  Credit card was left at home.  Have to go to customer service, where they have to cancel reservation, re-enter all data, add the credit card I did have with me (customer service clerk advises, half seriously, "go to a different lane this time").  Return to exit gate, so gatekeeper (a different one) can inspect said credit card, entered into system 5 minutes previously, to see if it matches. Metadata matches!  We drive away quickly.

Google Voice Search:  "Navigate small town xxx" Same as previous navigation.  Begin route home.

METADATA FAILURE!!!  Chosen route is "closed at the county line due to slides."  Google does not know this!  Google insists we keep on this route.  COMMUNICATION FAILURE:  Road closure data not transmitted to services like Google. Worse, it is local-centric.  "Closed at the county line"--which county?  which line?  north or south of me?  is that before or after my connection to the main route north? Even worse:  the detour sign "Road closed X miles ahead, take xxx road" appears several miles PAST the turn-off.  Turn around, return down several miles of highway to (unmarked) turn-off, Google suggesting U-turns and turn-around routes the entire way to, and a few miles into, the detour.

Metadata failure!  Google finally accepts our location as on an alternate route, but warns of an extra hour travel time due to tunnel construction.  Construction is over for the day, rush hour is past, we cruise through it.  Google notes our location and adjusts the travel time down by an hour.

Dinner stop:  Use credit card at restaurant (identity, credit card number, restaurant identity, time/date, amount, etc.)

Arrive at home town:  Return car.  Metadata failure!  The car reservation had to be arranged to/from this specific location because it was the only one open till midnight.  The rental company's metadata listed this location as open until 12AM.  The sign behind the desk said it was open until 12AM.  At 10:15 PM, the desk (servicing at least 3 rental companies) was entirely dark and deserted. Recorded requested metadata on contract envelope (ending mileage, time, fuel level), dropped papers and keys in after-hours return slot, crossed fingers.  (Next day, received email with all recorded metadata duly listed along with full rental data and total cost.)