Upcoming Events | Past Events

More 2003 IA Summit - Trip Reports
Stacy Surla shares her thoughts on the IA Summit. Comment to ssurla@aspensys.com.

Metadata Harvesting
Karl Fast, University of Western Ontario

Karl Fast presented a vision for the future in which metadata can migrate out of the vertical collections or silos they mostly reside in today, and be more vigorous in getting the content they represent into the hands of the users. He sees this happening via "metadata harvesting" systems that build upon existing technologies and protocols.

The vision is quite thought-provoking. In the first part of his presentation, Fast described the current state of metadata affairs and argued the benefits of logical next steps towards making metadata work in our present larger context.

Fast's practical discussion of currently available technological tools was icing on the cake. Whether metadata harvesting takes off using these or different tools is almost irrelevant to the thesis of the presentation. It is clear that, with the new ways people are looking at metadata and new tools coming available, it is quite possible that theory will break through into practice in the near future. Fast's PowerPoint presentation is an experience.

Overview of Metadata Today

Metadata is fundamental to retrievability in a networked world. However, metadata exchange over the web remains ad-hoc, messy, and informal. Metadata are distributed, not centralized, and heterogenous, not standardized. As these realities impact our networked lives, the way we think about metadata is changing.

Definitions of metadata include the following:

  • Metadata is data about data. This is highly accurate but not very helpful.

  • Metadata is a fielded record of summary information about a document. This index-card model is often how metadata is thought of and created, but it is also a rather narrow description.

  • Metadata is whatever form we use to create metadata. This includes not only the index-card, but also html header tags, passports, sheet music, product brochures, and more. Again, quite useful but somewhat restrictive when trying to look at the big picture.

  • Metadata is a document surrogate. In other words, metadata is something that stands for the main item of interest. Fast favors this definition as an improvement on some of the others - partly for the additional properties a word like "surrogate" can suggest, as described a little further below.

Reasons to use metadata include:

  • Size: Metadata are smaller and more nimble than the documents they represent.

  • Intellectual property: Abstracts, for example, can provide access to the pith of a document while avoiding intellectual property problems.

  • Non-textual documents: Metadata can represent objects other than text, and provide retrievability where indices contructed on the basis of text searches cannot do so.

  • Additional information: Metadata can contain information that is not part of the document.

The Current Search and Metadata Situation

Here is how search engines work from a metadata perspective:

  • Information provider: A web site contains information.

  • Search provider: A search robot goes out to a web site and spiders it for content. The metadata it collects is held in a repository.

  • User: A user's browser searches the search provider's repository for results.
Interesting issues with the current system:
  • Not coordinated: There is no formal coordination between information providers and search providers.

  • Not synchronized: The process is perpetually incomplete and out of sync. The search provider's repository does not give real-time nor complete results, but only the most recently indexed results.

The Vision

Fast suggests that metadata harvesting offers solutions to these issues. It provides a framework for coordinated exchange of metadata, including:

  • Synchronous search and retrieval
  • Asynchronous aggregation

The Tools

The Open Archives Initiative Protocol for Metadata Harvesting (OAI protocol) provides a slightly but significantly different model for search than the one currently used. It enables metadata to become recombinant and migratory. This is how metadata harvesting works:

  • Information provider: In addition to the web site with all the information contained therein, the information provider also provides a Metadata Repository which contains metadata about the site. It uses Dublin Core at least, to comply with a minimal metadata standard.

  • Search provider: A search robot goes out to the Metadata Repository associated with a web site and harvets its content. It can also spider the whole site. The metadata it collects is held in its aggregated repository.

  • User: A user's browser searches the search provider's repository for results.
The OAI protocol has a number of attractive features. It uses open standards (XML, HTTP), is cross-platform, and has a low barrier to entry. It can use any interface. A useful one that has already been developed is called OAIster (as in "oyster"). It is being used as part of a metadata harvesting project at the University of Michigan.

Other Related Initiatives

To put metadata harvesting in context, Fast listed a number of other search and retrieval systems that are kinds of metadata harvesting, or use similar concepts, or will require metadata harvesting to be successful. These include:

  • Napster: A distributed system, wherein users' computers store content that other users can access directly. The information on the location of resources is kept in the metadata repository on a Napster central index server.

  • Gnutella: A very distributed system, wherein each user's computer has not only files but indexes (distributed repositories) to the location of other files downstream. The client does the harvesting.

  • Z39.50 search protocol standard: This ANSI/NISO protocol facilitates communication between clients and servers. It enables the user to search remote databases, identify records which meet specified criteria, and retrieve some or all of the identified records. It enables uniform access to a large number of diverse and heterogeneous information sources. The relationship of Z39.50 to OAI is equivalent to that between SGML and XML. In other words, OAI greatly simplifies putting the principles of search into action.

  • RSS: This is a newsfeed protocol, and it is used in blogging. (RSS stands for "resource description framework (or RDF) site summary.") It is similar to the OAI model, except that instead of a synchronous metadata repository being part of a website, an RSS file, representing whatever should go the feed, is provided as a separate document. This file is accessed by an RSS aggregator to pull a feed that can be published directly for presentation within, for instance, a website or a PDA. RSS contains no concept of queries or archives. In a way, it is OAI-lite.

  • The Semantic Web: In this great and still-elusive vision of the web, metadata are highly granular and embedded in documents. These metadata are meant to enable a user's web agents to browse, select, and do things with content they find on the web. Metadata harvesting will be crucial, Fast states, to the success of the Semantic Web.

Interesting and Weird Thoughts

Fast spent a little time talking about the possible role of metadata not only as surrogates for documents, but possibly also playing the role of agents (or ambassadors or stunt doubles), in a sense. While surrogates are static, if they could adapt to new contexts they would be dynamic. In addition to immutable properties (e.g. title, author, creation date), metadata might also have mutable properties (e.g. relationships to other surrogates, "aboutness" descriptors that are somehow dynamic). He referenced Carl Lagoze's work in this area.

Summary

The presentation was a great overview of metadata and metadata tools in the context of our current environment. This overview served as background to the articulation of Fast's vision of metadata exchange as it might properly function in a distributed, heterogenous, networked environment.

 

For more information, please contact dc-ia-owner@yahoogroups.com