WossaMetaU: controlled vocabulary

Showing posts with label controlled vocabulary. Show all posts

Thursday, March 26, 2015

A Different Kind of Controlled Vocabulary

I read Adam's post about his difficult element, Relation, and it also made me think about some of the other elements that require descriptions that don't necessarily have a straightforward controlled vocabulary to use. How do you achieve consistency of description while still retaining the freedom that allows those less-controlled elements to serve as a useful complement to the more-controlled elements?

For example, if Subject requires a CV that limits it to specific words and phrases like "Tackling" and "Quarterback sack," description might fill in the gaps with keyword-heavy phrases describing the play in more detail. The job of the Relation element would be to link the images of this play, perhaps from the beginning of an attempted tackle through another player's attempt to block and ending with a successful sack.

As long as all the image catalogers are using basically the same style for these elements, all will be well. However, if one describer says "Auburn Right Tackle Smith launches himself" and another says "Tiger offensive lineman number 89 attempts sack" while a third says "Auburn player tries to stop Alabama players by using interpretive dance," then it's not as useful. And I'm going to admit right up front that that third describer is probably me, so I'm concerned about holding up my part of the job when it comes to actually describing these images according to the rules!

That got me to thinking about newspaper and magazine writing, and it occurred to me that they use style guides. So maybe there's a football writing style guide out there that could help all of us to be consistent yet creative and thorough in our descriptions? I looked around and there are some possible choices. Maybe the owners of the more descriptive elements would be kind to the football-challenged among us and consider making a style guide recommended practice for our digital library. Here are a few examples, but there are probably more out there, or the owners of the appropriate elements might just list some suggested terminology and style:

Wednesday, March 25, 2015

A Different Kind of Schema

Tonight in class, Dr. MacCall talked about ways to automate, or partially automate, metadata entry in digital libraries. While highly sophisticated automation is beyond the scope of our class project, it's useful to be aware of the possibilities in case we need them later in our careers. Using our football image library, Dr. MacCall suggested as an example that perhaps instead of manually typing in a whole, possibly difficult-to-spell name (e.g "Baumhower") every time, the image cataloger could instead type in the player number (in this case, 73) and lookup software would provide the correct name.

Of course, it couldn't really be that simple, and the discussion quickly turned to all the ways it could go wrong (reassigned numbers, duplicate numbers, partial numbers, and so on), as well as the myriad rules for uniform numbers allowed on the field every game. I never realized that football uniform numbers were so complicated! Furthermore, that complicated uniform number system would have to be written into the rules of the number-to-name matching software. Thus numbering rules intended to reduce confusion on the actual football field would be re-purposed as a schema intended to reduce errors in a virtual library of football images. How wonderfully poetic to think that the rules and structure of the game itself could become an integral part of creating a football image library!

Monday, March 23, 2015

Rights Vocabulary: the publisher situation

While I was already thinking about Rights vocabulary, the Copyright Clearance Center conveniently tweeted a link to this post about the massive scale of rights and royalties that publishers must deal with, and ways they are considering to automate or at least improve the process. Rights and royalty data, it turns out, are a bit of a Wild West situation, with practically no standardization, and publishers "still receiving royalty statements from their licensees in all imaginable formats — PDFs, Excel documents, and even paper printouts."

Although the issue of ultimately funding the initiative is still up in the air, the Book Industry Study Group's (BISG) Rights Committee has begun work associated with three major themes: the value of standardization and how it will provide return on investment, development of a standard vocabulary of rights terminology, and ways to improve the visibility and discoverability of rights.

As the publishing industry continues to consolidate, with the resulting need to combine massively varied licensing data and control ever larger numbers of assets, the need to automate this process and create interoperability will become increasingly pressing. While publishers have been getting along without standardization and automation thus far, it seems likely that the industry as a whole will become convinced that enough return can be made on the investment, and will move forward on this issue.

Rights Vocabulary for Digital Libraries

Since the Rights Element is my responsibility for our class project, I have been thinking about the appropriate language to use to make sure the information is both understandable and reasonably aligned with rights language in other digital libraries. I checked to see if there is a controlled vocabulary that we might use, but found that while much attention has been paid to structure and rules, actual right-hand-side standard terms don't seem to have one standard list. This list of licensing vocabulary from Center for Research Libraries has some useful terms, but it is mostly intended for licensing electronic resources such as databases, journals, and books. PRISM offers a very minimal list of right terms. Creative Commons provides a clearer set of terminology, adapted for RDF, and several linked data licensing vocabularies are rounded up on this handy site.

For our project, keeping it simple will probably be the best course of action, but it is interesting to see how this might be handled in a linked data environment--and to realize how important proper encoding of rights data could be in an environment where the images in your digital library could be linked to outside resources in unexpected or imagined ways.

Thursday, March 5, 2015

Building controlled vocabularies

I'm a huge fan of using controlled vocabularies for even relatively small tasks. I've been known to hand out journal spreadsheets to subject specialists with renewal decision options locked into drop-down menus to discourage their use of creative, difficult-to-interpret language. And I still grumble about the co-workers who designated items bought with a small pool of special funds with the following notes in our ILS order records: "One-time funds," "One time $," "1-time funds," "1 time $," "1X funds," and "1X $." Seriously, try finding all of those variations with nothing but a text-string based search function!

Needless to say, I was quite interested in this article on how to build controlled vocabularies. The first thing I learned was that, while "controlled vocabulary" is often used generically, there is a specific hierarchy described by different terms. A controlled vocabulary on its own is just a specific list of terms where only terms from that list can be used for certain purposes. A simple example is a list of library locations--"REF," "Children's," "YA," etc. that are consistently used in the catalog. A taxonomy is a controlled vocabulary too, but it has a hierarchical structure of parent/child relationships between the terms, suggesting that it is likely larger and more complex. The list of library locations could be part of a larger taxonomy that is all the controlled vocabularies in the ILS--item locations, fund codes, patron classes, and so on. A thesaurus is even more complicated--like a taxonomy, only with more relationships. Think of LCSH with its various relationships--not just broader term and narrower term (parent/child) but also related term, and use/use for (older term/newer term).

The article's advice for developing a controlled vocabulary (CV) can be condensed down to the following suggestions:

Define the scope of the CV--how large and complicated does it need to be, and what does it actually need to encompass?
Find good sources for vocabulary--representative content, subject matter experts, search logs (what search terms do your users consistently use?), and existing taxonomies. Consider simply licensing an existing taxonomy if it satisfies your needs.
Have a plan for keeping it updated. Things change--new technologies appear, and terminology changes over time.
Gather terms using your subject matter experts and/or representative documents. Organize them into broad categories including parent/child, related term, and preferred/non-preferred terms. Use dedicated software to manage terms. Creating a graphic representation of the taxonomy may assist with review and categorization.
Export the terms into a machine-readable language for better machine interpretation on the web.
Review and validate the final product, and make sure to incorporate review and validation into the maintenance plan.
Post the new CV to a registry or data warehouse where others can make use of it.