WossaMetaU: March 2015

Saturday, March 28, 2015

More Thoughts on the Rights Element

The Rights Element is one of the most straightforward of the elements in terms of what it should contain and why it is there. By definition, it simply "includes a statement about various property rights associated with the resource, including intellectual property rights." Operationally, it establishes who the rightsholder is and frequently offers contact information for prospective users.

I looked over the guidelines linked from our wiki and I think the CARLI version is probably the most straightforward and applicable to our project. The left side is simply labeled "Rights" and the right side is both non-threatening in tone and clear in its intent and directions. I particularly like the second example:

All rights held by William Rainey Harper College Archives. For permission to reproduce, distribute, or otherwise use this image, please contact Firstname Lastname at xxx@harpercollege.edu.

Here is an actual example from the Great Lakes Digital Collection (Newberry Library):

Rights All rights reserved by the Newberry Library. Permission to reproduce in any format must be requested in writing. Contact Photoduplication Department, Newberry Library, 60 W. Walton St., Chicago, IL 60610. Phone: 312-255-3566. E-mail: photoduplication@newberry.org

I think this would translate into one simple statement that could be used for every image, something like:

Rights All rights reserved by the University of Alabama. Permission to reproduce in any format must be requested in writing. Contact [appropriate department at UA, phone: ____, email:____ ]

Right now I'm trying to find out what department that will actually be. Several departments seem to hold licensing rights to images and trademarks: Office of University Advancement, University Athletics, Hoole Library, even the Office of Archaeological Research. I've requested clarification on ownership from the client.

In the meantime, how is this looking to everyone? Feedback welcome!

Thursday, March 26, 2015

A Different Kind of Controlled Vocabulary

I read Adam's post about his difficult element, Relation, and it also made me think about some of the other elements that require descriptions that don't necessarily have a straightforward controlled vocabulary to use. How do you achieve consistency of description while still retaining the freedom that allows those less-controlled elements to serve as a useful complement to the more-controlled elements?

For example, if Subject requires a CV that limits it to specific words and phrases like "Tackling" and "Quarterback sack," description might fill in the gaps with keyword-heavy phrases describing the play in more detail. The job of the Relation element would be to link the images of this play, perhaps from the beginning of an attempted tackle through another player's attempt to block and ending with a successful sack.

As long as all the image catalogers are using basically the same style for these elements, all will be well. However, if one describer says "Auburn Right Tackle Smith launches himself" and another says "Tiger offensive lineman number 89 attempts sack" while a third says "Auburn player tries to stop Alabama players by using interpretive dance," then it's not as useful. And I'm going to admit right up front that that third describer is probably me, so I'm concerned about holding up my part of the job when it comes to actually describing these images according to the rules!

That got me to thinking about newspaper and magazine writing, and it occurred to me that they use style guides. So maybe there's a football writing style guide out there that could help all of us to be consistent yet creative and thorough in our descriptions? I looked around and there are some possible choices. Maybe the owners of the more descriptive elements would be kind to the football-challenged among us and consider making a style guide recommended practice for our digital library. Here are a few examples, but there are probably more out there, or the owners of the appropriate elements might just list some suggested terminology and style:

Wednesday, March 25, 2015

A Different Kind of Schema

Tonight in class, Dr. MacCall talked about ways to automate, or partially automate, metadata entry in digital libraries. While highly sophisticated automation is beyond the scope of our class project, it's useful to be aware of the possibilities in case we need them later in our careers. Using our football image library, Dr. MacCall suggested as an example that perhaps instead of manually typing in a whole, possibly difficult-to-spell name (e.g "Baumhower") every time, the image cataloger could instead type in the player number (in this case, 73) and lookup software would provide the correct name.

Of course, it couldn't really be that simple, and the discussion quickly turned to all the ways it could go wrong (reassigned numbers, duplicate numbers, partial numbers, and so on), as well as the myriad rules for uniform numbers allowed on the field every game. I never realized that football uniform numbers were so complicated! Furthermore, that complicated uniform number system would have to be written into the rules of the number-to-name matching software. Thus numbering rules intended to reduce confusion on the actual football field would be re-purposed as a schema intended to reduce errors in a virtual library of football images. How wonderfully poetic to think that the rules and structure of the game itself could become an integral part of creating a football image library!

Tuesday, March 24, 2015

ER&L 2015

I really enjoyed Adam's post on ARLIS 2015. It sounds like he had a great time and got a lot out of his conference experience. I recommend to everyone to try to get to a library conference if you can, and I also totally recommend the smaller, more narrowly focused conferences. If you're a first-timer, the smaller events are much less overwhelming, you'll have a better chance to meet people in your chosen field, and like Adam, you'll probably find that a majority of the presentations are directly relevant to your interests.

I also thank Adam for reminding me that I could blog about my experience at ER&L 2015. I won a travel grant from Taylor & Francis that paid my way to Austin for this great conference on Electronic Resources and Libraries. The conference had a nice mix of programming, with a lot aimed at new electronic resources managers, as well as a healthy number of presentations on more advanced topics. Discovery was a hot topic, with several presentations and two different panels, one comparing experiences with different systems, and another analyzing products to see if they favored one publisher's or platform's content over another's (happily, they didn't find evidence of favoritism). Lots of presenters mentioned metadata in a variety of contexts--usage data (improving standards, collection assessment), journal metadata (TRANSFER and PIE-J and "publishers, whatever you do, please stop with the retroactive title changes"), user data and privacy (analyzing data, new NISO initiative), discovery (which system delivers the best relevancy ranking for results?), altmetrics (measuring the impact of scholarship), and interoperability (what were the challenges of migration to this next generation system?). Of course I couldn't get to everything I wanted to hear, but I had a blast trying and would recommend ER&L to anyone starting or considering a career in electronic resources management, collection assessment, or a related job.

Monday, March 23, 2015

Rights Vocabulary: the publisher situation

While I was already thinking about Rights vocabulary, the Copyright Clearance Center conveniently tweeted a link to this post about the massive scale of rights and royalties that publishers must deal with, and ways they are considering to automate or at least improve the process. Rights and royalty data, it turns out, are a bit of a Wild West situation, with practically no standardization, and publishers "still receiving royalty statements from their licensees in all imaginable formats — PDFs, Excel documents, and even paper printouts."

Although the issue of ultimately funding the initiative is still up in the air, the Book Industry Study Group's (BISG) Rights Committee has begun work associated with three major themes: the value of standardization and how it will provide return on investment, development of a standard vocabulary of rights terminology, and ways to improve the visibility and discoverability of rights.

As the publishing industry continues to consolidate, with the resulting need to combine massively varied licensing data and control ever larger numbers of assets, the need to automate this process and create interoperability will become increasingly pressing. While publishers have been getting along without standardization and automation thus far, it seems likely that the industry as a whole will become convinced that enough return can be made on the investment, and will move forward on this issue.

Rights Vocabulary for Digital Libraries

Since the Rights Element is my responsibility for our class project, I have been thinking about the appropriate language to use to make sure the information is both understandable and reasonably aligned with rights language in other digital libraries. I checked to see if there is a controlled vocabulary that we might use, but found that while much attention has been paid to structure and rules, actual right-hand-side standard terms don't seem to have one standard list. This list of licensing vocabulary from Center for Research Libraries has some useful terms, but it is mostly intended for licensing electronic resources such as databases, journals, and books. PRISM offers a very minimal list of right terms. Creative Commons provides a clearer set of terminology, adapted for RDF, and several linked data licensing vocabularies are rounded up on this handy site.

For our project, keeping it simple will probably be the best course of action, but it is interesting to see how this might be handled in a linked data environment--and to realize how important proper encoding of rights data could be in an environment where the images in your digital library could be linked to outside resources in unexpected or imagined ways.

Sunday, March 15, 2015

Gambling Underdog Westerns: How Netflix uses microtagging to give you what you want

Since it's Spring Break and Netflix is FAR more tempting than thinking about metadata, here's an excuse to think about Netflix and metadata at the same time. Imagine the football image library one could build with the infrastructure, algorithms, staff, and finances of Netflix!

Because recommendations are an important way to drive viewership and improve the user experience, Netflix uses 76,897 micro-genres to tag its videos. The combination of these tags results in those strange, wonderful, and oddly specific row headings like "Cerebral Romantic Thrillers from the 1980's." But to someone studying metadata and digital libraries, the story of how Netflix tags those videos and organizes that information is almost as interesting as the results, and Alexis Madrigal tells all about it in this fascinating article in the Atlantic. I guarantee that you will want to go and read this article, but I'll share a few nuggets from it here:

Netflix uses a controlled vocabulary. For example, it's always "Western," never "Cowboy Movie" or "Horse Opera."
Netflix builds its headings using a defined hierarchy and order--a method anyone familiar with Library of Congress Subject Headings will appreciate. Certain descriptors, like "Oscar-Winning," always go toward the front, while other descriptors, like time periods, always go toward the end. The author sums it up with this general formula: "Region + Adjectives + Noun Genre + Based On... + Set In... + From the... + About... + For Age X to Y."
Netflix tagging is done by humans, who watch every video and tag everything. Many of the tags are scales of 1-5.
Netflix uses an algorithm to analyze the tags and build the "personal genres" it displays for its users. Some genres, like "Feel-good," are based on a formula incorporating several tags.
Netflix combines viewing data and tags to try to recommend videos it thinks you will like. The company believes that this is a better method than either relying on exclusively on your ratings or trying to recommend videos to you based on what other people watched.

And then there's the Perry Mason Effect... but you can read about that on your own!

One more bonus in the article: it includes a "Netflix-Genre Generator" where you can generate your very own personal genres (it includes the alternative "Gonzo" setting (must be experienced) and the bland "Hollywood" setting (that explains most of the movies Hollywood churns out)). Give it a try!

Linked Data, BIBFRAME, and search

This week's reading included several interesting articles on linked data and libraries, including this Library Journal blog post by Enis, this post by Owens, this post by O'Dell, and this article by Gonzales. All of these articles contained similar themes regarding the need to move away from the inflexible MARC format and toward something more amenable to web searching. Some of the problems with current library catalogs include archaic search interfaces, record formats unfriendly to web-based and other "non-print" resources, and their invisibility to search engines.

The hope for BIBFRAME is that it would provide a way forward from MARC and allow library collections to be linked in intuitive and useful ways, so that authors, for example, might be linked not just to their works but perhaps also to members of their social group and events of the time and place where they lived. Perhaps a Google search would be able to easily connect potential users to the nearest available copies of a given book.

All of this is very exciting, and every indication is that this concept, or some aspect of it, will become a Real Thing in the next several years. However, I confess to a little skepticism as to how this will play out in actual practice. For a shared catalog system, among libraries in a system, consortium, regional area, perhaps the world, this could solve a lot of "shared collection" needs. Doing a Google search for something and getting a book from your local library in the top results would be pretty cool.

However, for general web searching, I worry about electronic resources and the issue of authentication and restricted, licensed access. Getting inaccessible results in a Google search quickly dilutes its usefulness, so how, then, does the search handle this? Are the library's results pushed down in relevance or relegated to a "library" silo or special search (like Google Books, perhaps)? Does frustration with licensed access increase the push for Open Access? Or will most of the licensed resources be sufficiently obscure that the main recipients of those results likely be academics with ways of attaining access? Google already includes some licensed content in its regular search results, with some ways of authenticating users by IP address (I believe it does this with Elsevier ScienceDirect content), but it doesn't yet do this on the scale proposed by BIBFRAME. It will be interesting follow this over the next several years!

Rights Element: Firm but friendly

Now that the Rights Element has safely retained its place on the island, I'm starting to think about what it should contain. Dr. MacCall made a very interesting point in class, that the Rights Element needs to be carefully balanced. There's an inherent conflict in placing images in a digital library, available to the world--if we didn't want people to enjoy and use the images, we wouldn't go to the time, effort, and expense of making them available; on the other hand, we have to protect the copyright of the images, the wishes of the rightsholder, and the possible revenue stream use of the images might generate. Therefore, the language in the Rights Element needs to be firm and clear, yet welcoming.

In thinking about this element, I can see that repeatability probably aids clarity. A simple copyright statement stands alone, and is clear and neutral. Information for actually contacting the Bryant Museum or the University of Alabama regarding permission is best handled through a link to the contact page for easier maintenance. But what about that welcoming part?

Tonya helpfully provided a link to the Minnesota Digital Library guidelines in this post, and they are clear and simple. However, I found their Rights example to be a little off-putting from the user perspective: "This image may not be reproduced for any reason without the express written consent...." It's clear, it's firm, but it's not very friendly. I wonder if it would be better to phrase it more like "Interested in using this image? Express written permission is required for all uses. Please contact ... "

What do you all think? Is it more important to invite usage or to discourage unauthorized usage? Can both be accomplished?

Thursday, March 5, 2015

Building controlled vocabularies

I'm a huge fan of using controlled vocabularies for even relatively small tasks. I've been known to hand out journal spreadsheets to subject specialists with renewal decision options locked into drop-down menus to discourage their use of creative, difficult-to-interpret language. And I still grumble about the co-workers who designated items bought with a small pool of special funds with the following notes in our ILS order records: "One-time funds," "One time $," "1-time funds," "1 time $," "1X funds," and "1X $." Seriously, try finding all of those variations with nothing but a text-string based search function!

Needless to say, I was quite interested in this article on how to build controlled vocabularies. The first thing I learned was that, while "controlled vocabulary" is often used generically, there is a specific hierarchy described by different terms. A controlled vocabulary on its own is just a specific list of terms where only terms from that list can be used for certain purposes. A simple example is a list of library locations--"REF," "Children's," "YA," etc. that are consistently used in the catalog. A taxonomy is a controlled vocabulary too, but it has a hierarchical structure of parent/child relationships between the terms, suggesting that it is likely larger and more complex. The list of library locations could be part of a larger taxonomy that is all the controlled vocabularies in the ILS--item locations, fund codes, patron classes, and so on. A thesaurus is even more complicated--like a taxonomy, only with more relationships. Think of LCSH with its various relationships--not just broader term and narrower term (parent/child) but also related term, and use/use for (older term/newer term).

The article's advice for developing a controlled vocabulary (CV) can be condensed down to the following suggestions:

Define the scope of the CV--how large and complicated does it need to be, and what does it actually need to encompass?
Find good sources for vocabulary--representative content, subject matter experts, search logs (what search terms do your users consistently use?), and existing taxonomies. Consider simply licensing an existing taxonomy if it satisfies your needs.
Have a plan for keeping it updated. Things change--new technologies appear, and terminology changes over time.
Gather terms using your subject matter experts and/or representative documents. Organize them into broad categories including parent/child, related term, and preferred/non-preferred terms. Use dedicated software to manage terms. Creating a graphic representation of the taxonomy may assist with review and categorization.
Export the terms into a machine-readable language for better machine interpretation on the web.
Review and validate the final product, and make sure to incorporate review and validation into the maintenance plan.
Post the new CV to a registry or data warehouse where others can make use of it.

The Rights Element connected to the Source Element

(Edited now that I think I have a better grip on Source, "the most ambiguous, misunderstood, and misused of the 15 core elements.")

Last night I was discussing our assigned project elements with my course group, which includes Adam, Amy, and Katie. I find it interesting how our different elements interact. In particular, I think my element, Rights, could be related to Katie's element, Source.

Over in her blog, Katie wonders whether Source will be useful for this project, primarily because we don't yet know if all images will have an identifiable source, or if we'll be given more detailed information regarding the original images our images are derived from (our 2009 images were likely born digital, but the images from 1975 would have an original somewhere). Her question has a lot of bearing on my element, too, because the Rights for each image may be based on whatever rules govern use of the Source.

Our textbook defines the Rights element quite simply as "'Information about rights held in and over the resource' ... [including] intellectual property rights." Depending on circumstances, the Rights element might hold a very simple statement like "Copyright (c)2009 Paul W. Bryant Museum" but it could also include restrictions on use, such as "Written permission required" or "Noncommercial reuse only."

Importantly, the Rights may vary depending on whether someone wants to use the digital surrogate or needs to use a higher-resolution copy or get access to the original photograph or slide. Alternatively, the Rights may not vary from original to digital copy, but whatever Rights govern the original may also govern the copy. The Rights element, like many Dublin Core elements, may also contain information pertaining to the original image as well as to the digital version. In any of these cases, the data in the Source element would help to both clarify and validate the Rights information.

What do libraries want? It's the metadata, ALL the metadata.

This post was inspired by a suggestion from Damen:

"As I thought about metadata, reclined in my seat at the Intercontinental Chicago, watching the cars and people hurry about on Michigan Avenue, I realized that this had a lot to do with metadata, and ordered another steak from room service."

As I thought about metadata, slumped in my seat at the Hilton, after yet another full day of library software demonstrations from vendors I'm not allowed to name, I realized that this had a lot to do with metadata, and wished I could order a new set of vendors from room service.

For three days, I paid close attention, took good notes, and dutifully listened as the vendors demonstrated their systems according to our requested scenarios. They tried their best. They sent their biggest guns. Their systems all looked fantastic ... if it was 2005.

They all had workable models for dealing with print. They had terrific print circulation statistics, which they wasted far too much time in demonstrating. But the closer to present needs they got, and the more they had been called upon to serve our future needs, the less they had, and the less they even seemed to grasp what we wanted. So many things were still in development, or on some vague roadmap, or minimally coming "in 2016," and we're talking functionality we need RIGHT NOW TODAY. One even prefaced his recitation of potentially supported COUNTER usage reports with the word "NISObabble." I can't even conceive of a professional situation in which that kind of snark would be appropriate, but in my mind it absolutely summed up where their development priorities lay. In sum, those three days were more dispiriting than I could ever have imagined when we began this process well over a year ago. Maybe this process will end in a deal, and maybe it won't--I'm not in a position to vote--but whatever happens, I can see it will be a long road before we have the system we actually need.

So what do our campus libraries want? And why do we want it? Maybe we didn't make it clear. This is my view and only my view, informed by what our system discussed, but I do not speak for that system. I also won't speak to which vendors may or may not meet these needs, or in what way. One thing is obvious in looking at our needs: it does indeed have a lot to do with metadata.

Tools for working with the print collection must reflect the reality of shrinking holdings. They must support shared collection management, robust resource sharing, and easy flagging of last print copy. Special collections and digital libraries must be supported and foregrounded. This means shared, standardized metadata for bibliographic holdings, patron information, and loan rules; the system must support ways to ingest and/or discover metadata related to digital libraries, institutional repositories, and other locally-held content.
Tools for acquiring new materials must support a high degree of automation and have the flexibility to handle individual and package purchases and subscriptions, as well as titles and packages procured through multiple consortia or demand-driven programs. This means support for EDI, KBART, and other major standards, and metadata exchange partnerships with vendors, aggregators and publishers, especially in scholarly fields.
Tools for electronic resources must recognize the scale and complexity of academic collections. Many campuses are adopting "e-preferred" collection policies, and purchasing e-books on a scale unheard of in the days of print. The "million volume" e-book collection is already attainable. This means that systems must be able to handle the tasks of activation, linking, authentication, and collection maintenance with minimal intervention from library staff. Ideally, consortial purchases should be maintainable at the consortial level, while still allowing for local additions and control. Electronic collections should be easily compared to existing print collections. Vendors should work with publishers to ensure that metadata standards and codes of practice such as KBART, TRANSFER and PIE-J are followed by all parties, and holdings data exchange should happen at the vendor-publisher level. Holdings metadata for both e-journals and e-books must be exportable in a useful format for use in 3rd-party applications such as RapidILL; direct partnerships would be even better. Finally, easy linking and authentication through course reserves and course management systems such as Blackboard and Moodle are a must.
Tools for analysis and assessment must privilege electronic usage. As collections trend ever more toward electronic, development priorities need to emphasize more than print circulation statistics. While e-book usage and print book circulation are not directly comparable, each reflects a valid and important aspect of library usage; however, as libraries purchase more of their new materials in electronic formats, print usage moves to the "long tail" where it reflects niche collections and older volumes, and often becomes a tool for weeding rather than guiding new purchases. Usage for electronic resources is the new circulation count, informing purchases through a variety of evaluations--overall usage, cost-per-use, use by discipline, use by publication date, and attempted use ("turnaways"). Academic libraries spend hundreds of thousands of dollars on e-resources, pushing evaluation beyond collection development practice into important discussions regarding budgets and justification to campus stakeholders. This means at minimum that library management systems must support all reports designated standard under COUNTER release 4, for e-journals, e-books, databases, and multimedia, for all forms of usage including attempted usage. They must allow for those reports to be combined with cost data and discipline for thoughtful analysis, and they must be flexible enough to take advantage of the additional information present in reports showing usage of older or open access content.

Bottom line, libraries need to manage, share, and assess their entire collections. In the current state, and in the foreseeable future, no system can call itself a complete solution if it doesn't support all three of these needs.