WossaMetaU: discovery

Showing posts with label discovery. Show all posts

Sunday, March 15, 2015

Linked Data, BIBFRAME, and search

This week's reading included several interesting articles on linked data and libraries, including this Library Journal blog post by Enis, this post by Owens, this post by O'Dell, and this article by Gonzales. All of these articles contained similar themes regarding the need to move away from the inflexible MARC format and toward something more amenable to web searching. Some of the problems with current library catalogs include archaic search interfaces, record formats unfriendly to web-based and other "non-print" resources, and their invisibility to search engines.

The hope for BIBFRAME is that it would provide a way forward from MARC and allow library collections to be linked in intuitive and useful ways, so that authors, for example, might be linked not just to their works but perhaps also to members of their social group and events of the time and place where they lived. Perhaps a Google search would be able to easily connect potential users to the nearest available copies of a given book.

All of this is very exciting, and every indication is that this concept, or some aspect of it, will become a Real Thing in the next several years. However, I confess to a little skepticism as to how this will play out in actual practice. For a shared catalog system, among libraries in a system, consortium, regional area, perhaps the world, this could solve a lot of "shared collection" needs. Doing a Google search for something and getting a book from your local library in the top results would be pretty cool.

However, for general web searching, I worry about electronic resources and the issue of authentication and restricted, licensed access. Getting inaccessible results in a Google search quickly dilutes its usefulness, so how, then, does the search handle this? Are the library's results pushed down in relevance or relegated to a "library" silo or special search (like Google Books, perhaps)? Does frustration with licensed access increase the push for Open Access? Or will most of the licensed resources be sufficiently obscure that the main recipients of those results likely be academics with ways of attaining access? Google already includes some licensed content in its regular search results, with some ways of authenticating users by IP address (I believe it does this with Elsevier ScienceDirect content), but it doesn't yet do this on the scale proposed by BIBFRAME. It will be interesting follow this over the next several years!

Wednesday, February 25, 2015

Superschema or Common Schema?

After reading the articles that inspired them (here and here), it was fun to compare Adam's blog post on the useful simplicity behind Dublin Core to Tonya's post on the plausibility of creating a superschema to rule them all. Using very basic set theory to describe these approaches, it seems to me that Dublin Core takes the approach of the intersection of metadata sets, while the superschema idea consists of the union of all metadata sets. Both ideas seem to posit a system in which all elements would be optional, using only those appropriate to the object being described. However, Dublin Core works by providing a limited number of common elements that can describe nearly anything generically, while the superschema would work by providing an almost unlimited number of elements that could describe nearly anything specifically. What an interesting contrast!

From a practical standpoint, DC has a lot to offer in terms of interoperability, maintainability, and the ease of building fast indexes with understandable browsing facets. The superschema idea would allow a lot of freedom, but describers would need to have a very broad knowledgebase, and even local systems based on it would be highly complex.

From the user's standpoint, what would the superschema system look like? I suspect that it would look a lot like Google. The search algorithm would probably need to rely on keywords, with relevancy heavily informed by the tags from the various schema (so your search for "sage in herbal remedies" wouldn't be swamped by articles published by SAGE Publishing). While I don't know how their proprietary indexing systems work, this sounds to me an awful lot like library discovery layers, and the direction they are moving in.

To me, the good news here is that the mix-and-match can, and probably will, happen at a higher level than the metadata schema. Individual systems could continue to use the specialty schema that work best. Knowing other schema is still important in case of migration, but hopefully combining datasets will come to rely on something more sophisticated than the most basic of crosswalks. It will be interesting to see where it all goes!

Wednesday, February 18, 2015

Expanding linked data: Why isn't this done?

JPow asks an excellent question in his LS blog: Why can't citation data (cited and cited by) be included in surrogate records?

I'm going to talk about this in terms of discovery layers rather than online catalogs, because online catalogs are typically very basic, MARC-based software that is fading rapidly into the background of library systems.

I think that JPow's concept could be implemented using a combination of CrossRef, Web of Science, and Scopus -- a terrific idea, and one that is implemented already to some degree, in some discovery systems, but definitely not to any universal extent.

What's the catch? I suspect that it's mostly about money and proprietary data. Citation indexes like Web of Science and Scopus are very expensive to maintain, and very big money makers for the companies that own them. They are willing to share basic indexing and citation metadata with the discovery services, but part of the agreement is that libraries must ALSO have licensed the databases separately before they are allowed to see results from those databases included in the discovery layer, and even then much of the "secret sauce" of citation tracing and other advanced functionality isn't included. (In fairness, quite a bit of this wouldn't easily translate to the simpler discovery interface.)

What I haven't seen implemented yet is CrossRef, and that has interesting potential. I think that one catch there is that it tends to be implemented as part of the full text of articles, in the references section. That section of the full text would perhaps have to be separated in some way and included in the metadata stored by the discovery service. I think that's possible, though I don't know if any systems are doing it currently. I think authentication could be the other tricky piece, since CrossRef links directly through DOI. This isn't a huge issue for on-campus users (who are generally authenticated by IP address) but directing off-campus users through the right hoops (proxy server, Shibboleth, etc.) is a potential hurdle.

I did check my library's discovery system (ProQuest Summon) and found that it offers our users "People who read this journal article also read" article links attached to records it gets from Thomson Reuters Web of Science. On the other hand, it doesn't offer any extra links for records it gets from Elsevier (ScienceDirect and Scopus). We see the Web of Science information because we've separately licensed those indexes from Thomson Reuters, and that means Summon is "allowed" to show us those records. We don't see the citation links from Scopus because we haven't licensed that product, so Summon isn't allowed to present any results from that dataset. I also find it interesting that Web of Science appears to share usage-based metadata but is not sharing citation-based metadata; I'm guessing maybe they see that as potentially too cannibalistic to their own service.

So, the short answer? JPow is asking for a rabbit, yes, and it's not from a hat, but from a deep and twisty rabbit hole. I don't think it's asking too much, though I do think it would be expensive.

Friday, February 13, 2015

Fire up the Serendipity Engine, Watson!

So, now that I've bashed the whole idea of serendipity in an earlier post and stressed the importance of both "good enough" discovery and teaching users to effectively utilize the tools we give them, I'm going to go off in a totally different direction and write about something completely whimsical. The terrific article by Patrick Carr that I referenced in my earlier post mentioned a site called Serendip-o-matic. Serendip-o-matic bills itself as a Serendipity Engine that aims to help the user "discover photographs, documents, maps and other primary sources" by "first examining your research interests, and then identifying related content in locations such as the Digital Public Library of America (DPLA), Europeana, and Flickr Commons."

How does the Serendipity Engine work? You paste text into its search box--from your own paper, an article source, a Wikipedia entry, or anything else that seems like a useful place to start--and click the Make some magic! button. The application extracts keywords from that text snippet and looks for matches in the metadata available from its sources, returning a list of loosely related content (mostly images at this point in its development).

The idea behind Serendip-o-matic is the very opposite of most search engines or discovery systems, which seek to deliver results that are as relevant as possible. In fact, the whole point of the Serendipity Engine is to deliver the unexpected, yet somehow related, item that will loosen the writer's block, send the mind whirling in a fresh direction, or make the connection the brain sensed but couldn't quite reach.

Metadata makes magic! Try it and see what you think.

Wednesday, February 11, 2015

Discovery vs Serendipity

I found the responses to "How far should we go with ‘full library discovery’?" from several of my fellow students to be quite interesting. I noticed that MetaWhat!Data! allowed for the idea that others might find full library discovery to be overload, but she herself liked the idea of "going further down the rabbit hole of discovery." I'm always interested in what people would like more of in their library systems, so I wish she would have actually elaborated a bit more on that. JPow, on the other hand, decided to test different systems and report on the differences, always a useful exercise. I enjoyed his fresh perspective and choice to concentrate on pre-filtering capabilities. I wonder what sort of user studies these systems may (or may not) have regarding whether or not the typical user prefers to refine up-front or after receiving results. MadamLibrarian makes an excellent point about the importance of transparency in discovery and making it clear what isn't being searched. How does a library assess the opportunity cost of a discovery system. MetaDamen brought up very good objections to the use of peer data and previous searches, and Adam raised an important issue regarding next-gen search capabilities and the need for both strict disclosure and an opt-out feature.

Finally, there was much discussion about serendipity and the importance of maintaining an active role in the research process. This is the point where I will choose to be a bit of a contrarian and push back a little. First, let's look at passivity versus mindfulness in the research process. I've worked reference shifts for years, both virtual reference and in-person, and my totally biased opinion is that the mindfulness of the research/writing/learning process really doesn't occur during the "source harvesting" stage, at least not for your average college student. It doesn't matter whether we're talking about the old bound volumes of the Reader's Guide, its newer incarnation in Academic Search Complete, or the latest discovery system, that process involves hoovering up as many articles and books that seem to fit the topic, and winnowing them down later. There's plenty of mindfulness in choosing and narrowing that topic, and plenty later in the reading, synthesis, and writing, but the harvesting? Not so much. If the search isn't going well enough to "satisfice," there's always the helpful librarian ready to offer controlled vocabulary and search tips, at the teachable moment. A better discovery system is like a better car in this case--you may spend less time looking at the scenery as you whiz past, but it was only the strip mall beside the interstate anyway, and you have more time to enjoy your destination.

But what about the serendipity? Serendipity is such a romantic ideal. But why is it that people wax all poetic over a random misshelving in the book stacks, but deny it in the case of a metadata error? Why can't an automated system provide serendipity in searching--indeed, why can't an automated system do a better job of offering it up? In the classic print-only library, if you have a book about both cats and airplanes, you can only shelve it with the cat books or the airplane books, probably on completely separate floors, if not in separate branches. Which set of users gets the serendipity, and which set misses out entirely? On the other hand, if there's a book on cats and airplanes in the library, your discovery system can present it in the results whether you searched for cats or airplanes. Some librarians go even further and suggest that serendipity in the stacks is a negative concept. The Ubiquitous Librarian "would rather teach students to be good searchers instead of lucky ones." Donald Barclay compares browsing the stacks to "hitting the 3-day sales tables" and notes that discovery systems open a world of shared collections much better and more efficient than what any one library can house. Patrick Carr argues in "Serendipity in the Stacks: Libraries, Information Architecture, and the Problems of Accidental Discovery" that "from a perception-based standpoint, serendipity is problematic because it can encourage user-constructed meanings for libraries that are rooted in opposition to change rather than in users’ immediate and evolving information needs." He suggests that libraries strive to develop "information architectures that align outcomes with user intentions and that invite users to see beyond traditional notions of libraries." Perhaps what is really needed is for libraries to repurpose the notion of serendipity and show users how to generate magical results with better searching techniques in better systems.