WossaMetaU: February 2015

Thursday, February 26, 2015

Rights metadata

The publication "Rights Metadata Made Simple" offers solid, easy-to-follow, no-excuses advice on incorporating rights metadata into a digital library. Basing her recommendations on copyrightMD, "an XML schema for rights metadata developed by the California Digital Library (CDL)," author Maureen Whalen includes the following suggestions:

Capture fields such as title, that would normally be included in basic descriptive metadata anyway, in an automated manner if possible.
Capture author information, including nationality, birth and death dates, from an authority file if possible.
If the institution holds both the original work and a digital surrogate, separate rights metadata should be created and the two works should be clearly differentiated.
Use controlled vocabulary to describe copyright status and publication status to ensure that data entry is consistent and conforms if possible to legal definitions.
Recording the following set of data: creator information, year of creation, copyright status, publication status, and date(s) rights research was conducted, allows both internal and external users of the works "to make thoughtful judgments about how the law may affect use of the work."
Researching and recording the rights information for the contents of a digital collection allows libraries, archives and museums to be more "responsible stewards of the works in our collections and the digital surrogates of those works that we create."
Rights information is not static--it may need to be added to or updated periodically; all staff involved in digitization efforts should know who is in charge of maintaining rights information and how to contact them when new information is learned.

As a final piece of advice, Whalen reminds libraries that, while determining consistent local policies for situations where little copyright or publication is known are important, this should not be used as an excuse for delaying--institutions can start by recording the rights information that is already known and deal with the rest over time.

The Wayback Machine and the Cornell Web Lab

The D-Lib article "A Research Library Based on the Historical Collections of the Internet Archive," published in February 2006, describes an initiative at Cornell University to create a web archive for social science research in partnership with the Internet Archive.

While the Wayback Machine, described in this excellent January 2015 New Yorker article, archives as much of the web as it can (450 billion pages as of this writing), it is not indexed or readily searchable other than by URL or date. For social scientists looking to do serious analysis of social trends, more manageable (though still enormous), indexed collections are needed. Cornell's project aimed to harvest collections of archived web pages--approximately 10 billion at the time of the article--and develop methods of automated indexing to make them useful to researchers. Access would be through scripts or APIs. Designers envisioned researchers conducting projects to trace the development of ideas across the internet, follow the spread of rumors and news, and investigate the influence of social networks.

As William Arms, one of the developers of the Cornell project notes in this article from 2008, researchers were spending as much as 90% of their time simply obtaining and cleaning up their data. "The Web Lab's strategy," he explains, "is to copy a large portion of the Internet Archive's Web collection to Cornell, mount it on a powerful computer system, organize it so that researchers have great flexibility in how they use it, and provide tools and services that minimize the effort required to use the data in research."

The project exists today as the Cornell Web Lab, and though direct links to its site were returning errors as of this writing, a detailed description of current activities can be found on this faculty page, and the tools and services suite is still available through Sourceforge.

Wednesday, February 25, 2015

Superschema or Common Schema?

After reading the articles that inspired them (here and here), it was fun to compare Adam's blog post on the useful simplicity behind Dublin Core to Tonya's post on the plausibility of creating a superschema to rule them all. Using very basic set theory to describe these approaches, it seems to me that Dublin Core takes the approach of the intersection of metadata sets, while the superschema idea consists of the union of all metadata sets. Both ideas seem to posit a system in which all elements would be optional, using only those appropriate to the object being described. However, Dublin Core works by providing a limited number of common elements that can describe nearly anything generically, while the superschema would work by providing an almost unlimited number of elements that could describe nearly anything specifically. What an interesting contrast!

From a practical standpoint, DC has a lot to offer in terms of interoperability, maintainability, and the ease of building fast indexes with understandable browsing facets. The superschema idea would allow a lot of freedom, but describers would need to have a very broad knowledgebase, and even local systems based on it would be highly complex.

From the user's standpoint, what would the superschema system look like? I suspect that it would look a lot like Google. The search algorithm would probably need to rely on keywords, with relevancy heavily informed by the tags from the various schema (so your search for "sage in herbal remedies" wouldn't be swamped by articles published by SAGE Publishing). While I don't know how their proprietary indexing systems work, this sounds to me an awful lot like library discovery layers, and the direction they are moving in.

To me, the good news here is that the mix-and-match can, and probably will, happen at a higher level than the metadata schema. Individual systems could continue to use the specialty schema that work best. Knowing other schema is still important in case of migration, but hopefully combining datasets will come to rely on something more sophisticated than the most basic of crosswalks. It will be interesting to see where it all goes!

Thursday, February 19, 2015

Perma.cc: Addressing link rot in legal scholarship

I've written several posts on the subject of persistent identifiers, especially DOI, and their importance in maintaining access and enabling permanent cross-referencing and citation. However, DOI is intended for stable objects, primarily version-of-record scholarly publications (articles, chapters, etc.). Blog posts, websites, wikis, social media, and similar Internet content can also be useful for research, but authors are forced to fall back on including ordinary URLs when citing these sources. This problem has been specifically documented in legal scholarship--when Harvard Law School researchers recently surveyed the state of legal citations, they made the disturbing finding that "more than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs found within United States Supreme Court opinions, do not link to the originally cited information."

In response, an online preservation service called Perma.cc was "developed by the Harvard Law School Library in conjunction with university law libraries." Perma.cc works like this: if an author wants to cite a website or other online source that lacks a permanent identifier, he or she can go to the Perma.cc site and input the URL. Perma.cc downloads and archives the content at that URL and returns a perma.cc link to the researcher, who then uses it in the citation. When the article is submitted for publication to a participating journal, the journal staff check the perma.cc link for accuracy and then "vest" it for permanent archiving. Readers who click the link are taken to a page that offers a choice of the current live page, the archived page (which may not include linked content such as graphics), and an archived screen shot (which will not include live links). Sites that do not want Perma.cc to make their content publicly available can opt out through a metatag or robots.txt file; in these cases, Perma.cc will place the page content in a dark archive, accessible only to the citing author and vesting organization. Interestingly, Perma.cc still harvests the content, even though it doesn't make it publicly available.

It will be interesting to see if this service (which is still in beta as of this writing) is successful, and if it expands beyond the legal field. It seems to me to be an excellent companion to DOI, as DOI gets around the problem of copyrighted, toll access content by maintaining only links, rather than archiving content, while Perma.cc provides the same level of stability for freely-accessible but less stable web content.

Wednesday, February 18, 2015

Special Collections and Scholarly Communications

Donald Waters' thoughts on The Changing Role of Special Collections in Scholarly Communications come from the broad perspective of a funding organization (The Andrew W. Mellon Foundation), which allows him an informed view of the world of special collections without the potential bias of someone deeply invested in a specific project. Though written five years ago, this piece is still relevant.

Regarding Waters' opinion on the relative commonness or distinctiveness of collections, I think it is important to distinguish between current material and archival material. As scholarly publishing is ever more thoroughly packaged and sold in huge lots (subject or title collections sold by publishers or "complete" aggregations from EBSCO, ProQuest, and others), and purchases consolidate at the level of the library consortium, homogenization of basic research holdings is an inevitable outcome. On the other hand, as libraries are increasingly pressed to make space for more social or student-centered needs within their existing footprints, legacy print collections are indeed likely to become more distinctive, a boutique selection tailored to specific campus needs. This will probably be most noticeable at smaller, non-PHD-granting institutions where it is difficult to make a value case for what one of my bosses called "the book museum" model. These smaller collections will rely more heavily on sharing arrangements with other institutions, with expanded metadata requirements to ensure discoverability and support collection management at individual campuses (e.g. which library holds, or will hold, the last print copy of a title, allowing others to discard the local copy).

I think Waters makes a terrific suggestion that special collections librarians work with the scholars at their institution to prioritize which collections are digitized and processed. Given the huge backlogs and extraordinarily varied nature of most collections, accommodating current faculty needs seems like an excellent way to direct the workflow, justify the expenditure to campus stakeholders, and build important bridges between the library and faculty researchers.

Another great idea in theory is having researchers assign metadata to the materials they are working on. I think this would work well in the examples cited, where the researcher is working on a fellowship. It's a great way to "pay forward" some of the grant money, and leverages the deep subject knowledge of the researcher. However, I think the system would not scale well to the larger world of all digital libraries and special collections. Faculty balancing research, teaching, and administrative duties are unlikely to be interested in "helping" the library, or have the spare time to commit even if they were; thus the metadata tasks would either go undone or be relegated to graduate students who might or might not have the expertise or commitment necessary to do the job well.

I had the pleasure last semester of interviewing Dr. Sylvia Huot, a scholar involved in the early development of one of the Mellon-funded projects mentioned in the article, The Roman de la Rose Digital Library. In regard to their original plans for full transcription and tagging, she told me that despite small volunteer transcription projects going on at various universities involved in the project, the leaders eventually realized how impractical it was to attempt transcription of the entire library and were forced to abandon the effort. "No one with the skills to do a proper transcription job was willing or able to volunteer, and it would have required an enormous influx of funding to hire even skilled graduate students to complete the work."

There is little doubt of the value of excellent metadata, accurate transcriptions, open availability, and cross-institution linking and discovery for special collections. Usage begets citations begets more usage, which pleases stakeholders and funding organizations and leads to more digitization projects in a virtuous cycle. However, there must be a better way than simply shifting the burden from one constituency to another. There must be either greater efficiency, perhaps through automation or better tools, or greater funding, if this valuable goal is to be reached.

Expanding linked data: Why isn't this done?

JPow asks an excellent question in his LS blog: Why can't citation data (cited and cited by) be included in surrogate records?

I'm going to talk about this in terms of discovery layers rather than online catalogs, because online catalogs are typically very basic, MARC-based software that is fading rapidly into the background of library systems.

I think that JPow's concept could be implemented using a combination of CrossRef, Web of Science, and Scopus -- a terrific idea, and one that is implemented already to some degree, in some discovery systems, but definitely not to any universal extent.

What's the catch? I suspect that it's mostly about money and proprietary data. Citation indexes like Web of Science and Scopus are very expensive to maintain, and very big money makers for the companies that own them. They are willing to share basic indexing and citation metadata with the discovery services, but part of the agreement is that libraries must ALSO have licensed the databases separately before they are allowed to see results from those databases included in the discovery layer, and even then much of the "secret sauce" of citation tracing and other advanced functionality isn't included. (In fairness, quite a bit of this wouldn't easily translate to the simpler discovery interface.)

What I haven't seen implemented yet is CrossRef, and that has interesting potential. I think that one catch there is that it tends to be implemented as part of the full text of articles, in the references section. That section of the full text would perhaps have to be separated in some way and included in the metadata stored by the discovery service. I think that's possible, though I don't know if any systems are doing it currently. I think authentication could be the other tricky piece, since CrossRef links directly through DOI. This isn't a huge issue for on-campus users (who are generally authenticated by IP address) but directing off-campus users through the right hoops (proxy server, Shibboleth, etc.) is a potential hurdle.

I did check my library's discovery system (ProQuest Summon) and found that it offers our users "People who read this journal article also read" article links attached to records it gets from Thomson Reuters Web of Science. On the other hand, it doesn't offer any extra links for records it gets from Elsevier (ScienceDirect and Scopus). We see the Web of Science information because we've separately licensed those indexes from Thomson Reuters, and that means Summon is "allowed" to show us those records. We don't see the citation links from Scopus because we haven't licensed that product, so Summon isn't allowed to present any results from that dataset. I also find it interesting that Web of Science appears to share usage-based metadata but is not sharing citation-based metadata; I'm guessing maybe they see that as potentially too cannibalistic to their own service.

So, the short answer? JPow is asking for a rabbit, yes, and it's not from a hat, but from a deep and twisty rabbit hole. I don't think it's asking too much, though I do think it would be expensive.

Friday, February 13, 2015

Fire up the Serendipity Engine, Watson!

So, now that I've bashed the whole idea of serendipity in an earlier post and stressed the importance of both "good enough" discovery and teaching users to effectively utilize the tools we give them, I'm going to go off in a totally different direction and write about something completely whimsical. The terrific article by Patrick Carr that I referenced in my earlier post mentioned a site called Serendip-o-matic. Serendip-o-matic bills itself as a Serendipity Engine that aims to help the user "discover photographs, documents, maps and other primary sources" by "first examining your research interests, and then identifying related content in locations such as the Digital Public Library of America (DPLA), Europeana, and Flickr Commons."

How does the Serendipity Engine work? You paste text into its search box--from your own paper, an article source, a Wikipedia entry, or anything else that seems like a useful place to start--and click the Make some magic! button. The application extracts keywords from that text snippet and looks for matches in the metadata available from its sources, returning a list of loosely related content (mostly images at this point in its development).

The idea behind Serendip-o-matic is the very opposite of most search engines or discovery systems, which seek to deliver results that are as relevant as possible. In fact, the whole point of the Serendipity Engine is to deliver the unexpected, yet somehow related, item that will loosen the writer's block, send the mind whirling in a fresh direction, or make the connection the brain sensed but couldn't quite reach.

Metadata makes magic! Try it and see what you think.

Thursday, February 12, 2015

TEI and Comic Books

The Text Coding Initiative (TEI) defines a set of guidelines for encoding digital texts. TEI encoding does much more than just make texts readable on the Internet--it allows for important bibliographic metadata to be associated with the text, and also makes it possible to encode structural features of the text (such as rhyme schemes, utterances, and character names) to enable better retrieval and deeper analysis. TEI by Example is a good place to get a quick taste of what TEI consists of--Section 5 TEI: Ground Rules stands by itself and is relatively easy to understand.

While TEI Lite, a subset of TEI "designed to meet '90% of the needs of 90% of the TEI user community'" is widely used, TEI is also extensible, allowing for customization to meet the needs of a specific project. One fun example of this is Comic Book Markup Language, developed by John Walsh. This extension allows for encoding of comics and graphic novels, including structural features such as panels and speech bubbles, and sound effects such as "ZAP!" This slide deck from a Charleston Conference 2014 presentation on CBML (screen will appear blank--click the arrows or use left/right arrow keys) explains further about CBML and also provides a useful example of TEI in action.

Wednesday, February 11, 2015

Discovery vs Serendipity

I found the responses to "How far should we go with ‘full library discovery’?" from several of my fellow students to be quite interesting. I noticed that MetaWhat!Data! allowed for the idea that others might find full library discovery to be overload, but she herself liked the idea of "going further down the rabbit hole of discovery." I'm always interested in what people would like more of in their library systems, so I wish she would have actually elaborated a bit more on that. JPow, on the other hand, decided to test different systems and report on the differences, always a useful exercise. I enjoyed his fresh perspective and choice to concentrate on pre-filtering capabilities. I wonder what sort of user studies these systems may (or may not) have regarding whether or not the typical user prefers to refine up-front or after receiving results. MadamLibrarian makes an excellent point about the importance of transparency in discovery and making it clear what isn't being searched. How does a library assess the opportunity cost of a discovery system. MetaDamen brought up very good objections to the use of peer data and previous searches, and Adam raised an important issue regarding next-gen search capabilities and the need for both strict disclosure and an opt-out feature.

Finally, there was much discussion about serendipity and the importance of maintaining an active role in the research process. This is the point where I will choose to be a bit of a contrarian and push back a little. First, let's look at passivity versus mindfulness in the research process. I've worked reference shifts for years, both virtual reference and in-person, and my totally biased opinion is that the mindfulness of the research/writing/learning process really doesn't occur during the "source harvesting" stage, at least not for your average college student. It doesn't matter whether we're talking about the old bound volumes of the Reader's Guide, its newer incarnation in Academic Search Complete, or the latest discovery system, that process involves hoovering up as many articles and books that seem to fit the topic, and winnowing them down later. There's plenty of mindfulness in choosing and narrowing that topic, and plenty later in the reading, synthesis, and writing, but the harvesting? Not so much. If the search isn't going well enough to "satisfice," there's always the helpful librarian ready to offer controlled vocabulary and search tips, at the teachable moment. A better discovery system is like a better car in this case--you may spend less time looking at the scenery as you whiz past, but it was only the strip mall beside the interstate anyway, and you have more time to enjoy your destination.

But what about the serendipity? Serendipity is such a romantic ideal. But why is it that people wax all poetic over a random misshelving in the book stacks, but deny it in the case of a metadata error? Why can't an automated system provide serendipity in searching--indeed, why can't an automated system do a better job of offering it up? In the classic print-only library, if you have a book about both cats and airplanes, you can only shelve it with the cat books or the airplane books, probably on completely separate floors, if not in separate branches. Which set of users gets the serendipity, and which set misses out entirely? On the other hand, if there's a book on cats and airplanes in the library, your discovery system can present it in the results whether you searched for cats or airplanes. Some librarians go even further and suggest that serendipity in the stacks is a negative concept. The Ubiquitous Librarian "would rather teach students to be good searchers instead of lucky ones." Donald Barclay compares browsing the stacks to "hitting the 3-day sales tables" and notes that discovery systems open a world of shared collections much better and more efficient than what any one library can house. Patrick Carr argues in "Serendipity in the Stacks: Libraries, Information Architecture, and the Problems of Accidental Discovery" that "from a perception-based standpoint, serendipity is problematic because it can encourage user-constructed meanings for libraries that are rooted in opposition to change rather than in users’ immediate and evolving information needs." He suggests that libraries strive to develop "information architectures that align outcomes with user intentions and that invite users to see beyond traditional notions of libraries." Perhaps what is really needed is for libraries to repurpose the notion of serendipity and show users how to generate magical results with better searching techniques in better systems.

Thursday, February 5, 2015

Identifier persistence

I agree with Madam Librarian that the article A Policy Checklist for Enabling Persistence of Identifiers was convoluted and difficult to follow in places. For the sake of my own understanding, I'll try to summarize here the basic points I got out of it. Maybe this will help others in the class as well.

The article presents a set of numbered questions, but then proceeds to address them in a completely different order while attempting to map them to a checklist, which is numbered in yet another style. For the sake of sanity, I will just briefly summarize the article's main points in the order presented, and present examples where it seems useful.

What should I identify persistently? Analyze your resources, decide which ones can be consistently identified in some way, and then prioritize these identifiable resources. There will probably be many items that have identifiers, but only a subset of these will require persistence; typically these would represent key access points for your user community.

What steps should I take to guarantee persistence? This is best handled through policies, supported by automation of processes. Information management should be decoupled from identifier management. In practice, this means that information within a system is identified and managed using local keys--e.g. this could be the URL for a journal article. However, the identifiers for this same information that are shared with outside entities--indexing services and library databases, e.g.--should be based on indirect identifiers, which can be updated when necessary in a way that is invisible to users.

An example of this is an article DOI. The hypothetical article "36 Tips for Awesomeness" (local identifier) is published in the Spring 2006 issue of Fabulous Journal, given a URL of www.fabulousjournal.com/36_Tips_for_Awesomeness (local identifier), and assigned a DOI of 10.8992/fj.1234 (persistent identifier). Over the next few years, the journal is bought by another publisher and all content is moved to www.awesomepublisher.com/journal/(ISSN)1000-0001. A year or two after that, the new publisher merges Fabulous Journal with Really Cool Journal, requests a new ISSN, and moves all content to www.awesomepublisher.com/journal/(ISSN)2002-200X. Awesome Publisher has good policies for persistence and updates DOI with each change. This is the result:

10.8992/fj.1234 initially points to www.fabulousjournal.com/36_Tips_for_Awesomeness

10.8992/fj.1234 then points to www.awesomepublisher.com/journal/(ISSN)1000-0001/36_Tips_for_Awesomeness

10.8992/fj.1234 currently points to www.awesomepublisher.com/journal/(ISSN)2002-200X/36_Tips_for_Awesomeness

As long as the services that refer to this article use the DOI instead of the article URL, it will remain accessible despite the changes going on in the background.

What technologies should I use to guarantee persistence? Whichever ones work best with your existing technology and workflow. It's more important that the process works seamlessly and with minimum effort than it is to commit to one specific technology, no matter what That One IT Guy in your division says.

How long should identifiers persist? The answer to this is, as long as is appropriate, but make sure that you (1) don't promise what you can't deliver (no one can actually guarantee "forever") and (2) are up-front about it ("provisions are in place to guarantee persistence for a minimum of 30 years beyond the online publication date" or "this link will expire in 7 days").

What do you mean by "persistent"? The article explains that there are degrees of persistence, and breaks them down into a list (I'll use the same article example to explain).

Persistence of Name or Association:
(1) The title "36 Tips for Awesomeness" will always be associated with that specific article on awesome tips--it won't suddenly be associated with an article on cattle diseases.
(2) The article may continue to be referred to in various places as www.fabulousjournal.com/36_Tips_for_Awesomeness even though that URL no longer works. In other words, the association persists in unmaintained places outside the control of the resource owner.
(3) The article will always be associated with DOI 10.8992/fj.1234, whether or not the publisher updates the DOI information when the article changes location.

Persistence of Service:
(1) Retrieval: Can the item still be obtained over the guaranteed time period? In the case of our article, two of the three listed URLs would eventually fail to retrieve the article, but the DOI should continue to work, resulting in retrieval of the article no matter where it is hosted.
(2) Resolution: A URL may resolve without resulting in a successful retrieval. For example, the original author of our 36 tips might get into a copyright dispute with the new publisher, resulting in the article being taken down. In this case, the publisher might arrange for the URL to resolve to a page with the basic metadata for the article and a brief note about the missing content. If the URL instead results in a "page not found" error, then it lacks persistent resolution.

Whether a service guarantees retrieval or resolution is an important distinction and should be clearly stated. Both retrieval and resolution are essential but different.

Persistence of Accountability:
This is mostly for archival purposes. Is some kind of metadata maintained that gives the history of who has created and edited a specific record?

TL;DR: Persistent identifier policies in an information management environment should clearly outline the following: which identifiers will be persistent, how persistence will be maintained, how long the user can expect persistence to last, and whether persistence guarantees access to a specific item (retrieval) or guarantees access to (at minimum) information about that item (resolution).

Wednesday, February 4, 2015

Clip and Save: Cataloging and metadata in large video collections

Tonight's class lecture included some interesting thoughts on video cataloging, particularly on how granular it needs to be. Is it useful to catalog, for example, the entire NBC news broadcast from February 3, 2015, as if it were a single, monolithic item? Perhaps our users really just want to watch that 2 1/2 minute segment on Harper Lee's new novel.

Thinking along exactly these lines, NBC has digitized and indexed 12,000 separate stories from its archives, drawing from both current material and much older content--as far back as the 1920's. NBC now sells this digitized content to libraries, universities, and K-12 schools (there are both higher education and K-12 versions) in the form of a streaming media database called NBC Learn. Each clip has a full set of descriptive metadata, and what is more interesting, this metadata has been converted into a complete set of MARC records available (readable XML version here) to subscribing libraries if they want to make every individual clip available through their online catalog. While one can certainly debate the value of adding thousands of records for short video clips to the library catalog, it is interesting that the records are available.

VAST Academic Video, a similar-sized video clip collection from Alexander Street Press, which features more scholarly, independent, and/or historical material, likewise fully indexes every clip in the collection and provides MARC records for 80% or more of them. VAST also supplies its index data to the major library discovery products and will soon be available through Google searches.

Comparing the two databases side-by-side, it's easy to see that VAST far outstrips NBC Learn in terms of specific indexed fields; on the other hand, it should be noted that the keyword fields in NBC Learn are comprehensively populated and also individually hot-linked to perform an instant keyword search of the entire database. When converted to MARC 21 format, VAST's records are more complete and standard in content, and the subjects map to LCSH. NBC Learn's records are shorter, and the keywords map to 653 - Index Term-Uncontrolled, resulting in keyword-searchable records, but no controlled subjects. The end result is two good, but quite different, approaches to indexing massive amounts of short video content.

NBC Learn	VAST Academic Video Online
Title	Title
Title	Translated Title
Transcript	Transcript
Citation (MLA, APA, Chicago)	Citation (MLA, APA, Chicago)
Source	Publisher
Source	Producer Institution
Creator	Author
	Director
	Narrator
Air/Publish Date	Release Date
Event Date	Date Recorded
Resource Type	Format
Resource Type	Content Type
Clip Length	Duration
Copyright	Copyright Message
Copyright Date	Original Release Date
Description	Abstract/Summary
Description	Release Notes
Related links
Keywords	Subject
	Discipline
	Field of Interest
	Specialized Area of Interest
	Topic/Theme
	Health Subject
	Anthropologist/Ethnographer
	Client/Patient Age
	Cultural Group Discussed
	Featured Speaker
	Language Of Edition
	Original Language
	Person Discussed
	Place Discussed
	Place Published / Released
	Series
	Series Number
	Therapist
	Therapist/Physcologist Gender
	Therapist/Psychologist Race/Ethnicity

Data ownership indeed

Many thanks to MetaWhat! Data! for bringing this article on "Questions of Data Ownership on Campus" to my attention. This is a very interesting read, and I appreciated the personal perspective that MetaWhat! Data! brought to her write-up. I think the concerns about privacy and security are well-founded, and universities must address these concerns as well as be up-front about disclosing what data that they collect and how they use it. However, the article also makes a valid point that universities increasingly need this data to provide competitive services and justify their existence to ever more tightfisted funding agencies.

To underline the intense interest universities now have in collecting and using data, I will list several ideas that I have heard proposed in the last year (caveat: these are pie-in-the-sky, none are remotely close to implementation):

Software that can associate individual student success (i.e. grades, attendance) with instructor teaching or course style (e.g. flipped, lecture, seminar, experiential, etc.) and recommend specific instructors or courses in which the individual student might be more successful. (For example, at registration, the student selects Physics 201, and the course system says "because you did well with Dr. Jones, Dr. Smith may be a professor you would like," or possibly "other students who did well with Dr. Jones also did well with Dr. Smith." Actually choosing Dr. Smith's class is left to the student.)
Software that counts occupancy of rooms over the course of the school year based on mobile device connections to the building wi-fi. The occupancy data (anonymized) informs decisions on where to upgrade the facility (e.g. "study rooms always packed, this justifies the cost of adding more" or "seating areas near west-facing windows fill last in summer--improve climate control or install shades").
Tracking whether students are more successful when they receive different kinds of library instruction (in-class with peer instructor, with librarian instructor, outside of class with or without extra credit, etc.).
Tracking at-risk students to see which additional support services they use most, and which seem to increase success the most.
Tracking research database usage from initial search to article download to determine optimal paths to offer users and justify continued subscription expenditures.

Are these great ideas, or really creepy? Or both, like Google Now? Is it Big Brother come to campus, or a more benign Big Sister? Tell me what you think in the comments.