Functional Requirements for Sharing Tag Data

The TagCommons effort is operating something like a software project. The process is very lightweight, but has an important step that is often forgotten in discussions about ontologies and formats: for what purposes are we designing this? These days, a good way to look at functional requirements for software is to identify use cases and then derive engineering requirements. This was the first outcome of the working group, and the results are summarized here. We will describe the use cases first and then the requirements.

Use Cases for Sharing Tag Data

Case 1: Personal Bookmarking across tagging sites

Tagging happens in a context and for a purpose. Tagging a photo might happen in the context of viewing the photo on a photo sharing site and for purposes such as a researching a topic or sharing with friends. Tagging a news article might happen on a different site for the same research or social objective, but the tagging tool is a different system. This use case is to enable an application that can combine the tags from different content and tags sources — such as the photo and the news site and any other bookmarking service — so the user can manage tags across these sources. In short, the problem is to support various tagging goals (tasks, purposes) across disparate contexts (content sources, tagging tools, etc). The technical requirement is rather obvious: to overcome the fact that every site with tagging data is a silo with at best a proprietary API.

Case 2: Browsing and searching others’ tag data across sources

Tag data is not always personal; it can also be useful to see how other people have tagged the same thing, or other things they have tagged under the same label. Some tagging sites provide tools that nicely aggregate collective tagging data — such as folksonomies — to tap into popular opinion, trends, and so forth. Again, the use case is a simple generalization of these browsing and searching services across tag data sources. For example, today you can find common tags used in blog categories on Technorati, but you can’t easily merge them with similar collective tagging data on del.icio.us or flickr. This case has the same technical requirement as personal tagging, but applied to aggregate data: to be able to compare/connect/intersect tagging data across these sources — without owning the sources.

This use case of aggregate tag data also can be viewed from different perspectives, depending on whether the data are about individual tag assertions or about group behavior or collections of tag facts. Collections of tags, in turn, may be filtered, sorted, or otherwise aggregated from individual tag: for example, the “top ten” tags on an object.

Case 3: Social Re-Search Using Tag Data

Google is great, but it takes work sorting through all the noise. Why should we have to repeat the effort if others have already gone through it? When I enter the term “folksonomy” into a search engine I want to be able to see results that lots of others have tagged with folksonomy. I want to find Vander Wal’s writing right away because other people have implicitly marked them as required reading. If some spammer has attempted to hijack the popular tag, I want the masses to reject him with their tagging. I want to be told, without knowing to ask, that the term “folksonomy” correlates with the term “tagging”, because people have tagged the same things with both tags. I want to go to Rojo and see which blogs tagged with this word are most read, across all major blogging systems. Ideally, I want to see what my colleagues around the world have tagged as such, using whatever tagging system they choose. More generally, when I do knowledge work on the web, I want to take advantage of all the other work other people have done. I want to discover other people doing the same work, perhaps to share or connect up. This is the vision that launched the Web, and it drives the goal of accelerating human knowledge and understanding. What does it take? Again, this use case requires that there be some way for a service to meaningfully search, compare, and integrate tag data from multiple, independent applications.

Case 4: Multimedia Cross Reference

If tagging data were available from multiple sources through a common mechanism, it would also be possible to prebuild a rich reference source of connections. Imagine looking at any web site containing a tagged item and seeing links to images, books, music, videos, games, or any other content that were similarly tagged. Today’s “mashups” are built on point-to-point integrations between proprietary APIs, such as a maps API and a local information database. The same holds for mashups that use tags as the data to correlate information. Point-to-point integrations are inherently limited in scale, and tend to reinforce data hegemony. A public cross reference service would need to be agnostic about the sources over which it is generating links.

Case 5: Organizing Documents Using Tags

In our enthusiasm for HTML and the Web, we often overlook how much information resides in document repositories. Makers of commercial document repositories, CAD systems, or open source systems such as source management systems could expose a tag data access mechanism that would enable add-on services. For example, a code module in an open source module could be tagged for the various purposes to which it has been applied, and others could then find it using the labels in that tagging. By opening up tagging across repositories of any kind of digital document, independent of the APIs of particular applications, the makers and users of document repositories could benefit from the value of collaborative tagging that we see in the Internet.

Case 6: Tag MetaSearch and MetaMonitoring

Already, there are dozens of tagging services, each offering a different slant on the basic collaborative bookmark. Like other players in the web ecosystem, they are finding niches and learning to compete. For example, a site called LibraryThing specializes in tagging books, and has shown that an independent tag space can compete with the large commercial interests that dominate the book distribution channels. Similarly, there is room for competition in the UI and search quality for tag-based search, as shown by RawSugar. On top of these services, this ecosystem will spawn a “metasearch” layer for searching and monitoring across such tagging services. Just as RSS readers monitor and aggregate multiple feeds, and metasearch engines crawl the deep web for things like travel deals, a tag metasearch service will allow users to place bets with their attention based on tags across various tagging services. If the tagging data can be exposed consistently by these tagging services, then the meta sites will be able to offer a more powerful and meaningful service to their users.

Case 7: Social Research on Collective Intelligence

As user contributed content on the Web has become a mainstream phenomenon, more and more people will be exposed to the concept of tagging. Their collective tagging activity offers an unprecedented opportunity to study how people learn from groups at different scales, how they use language, how social trends emerge and evolve, and many other research topics. An open mechanism for getting this data, and a mechanism that encourages meaningful comparisons across a wide variety of sources, would be an extremely valuable resource for this area of research.

Case 8: Distributing tagged information to the Semantic Web

Up to now we have been silent about mechanisms for sharing, but an obvious candidate is the Semantic Web. The vision of the Semantic Web is to allow powerful computation over the structured data that runs through the unstructured content on the Web. The Semantic Web has a standards-based set of mechanisms for data sharing, including, notably, ontologies for expressing agreements on meaning and a tuple-based format for exposing data called RDF. Tag data can be viewed as a kind of structured data (at the very simplest, tuples relating people, labels, and objects), and some of the more interesting scenarios of tag data integration would involve automated reasoning over these tuples. For example, to find other people who share your interest in some topic, you might ask a service that can find tags which are related to your topic, sources of tagging data mentioning those tags, and people associated with those tag assertions in those sources. Sophisticated reasoning could be employed to decide, for instance, whether two users on different sites are the same person, or that two uses of the same tag label are about the same concept or word sense. The technical requirements for tagging systems to participate in the Semantic Web include the need to define a ontology of tagging data that could be used across the variety of tagging systems which could participate. An excellent example of this is the Revyu.com site, which is a user review site with tagging that exposes its tag data using a tag ontology and exposes its data in RDF.

Functional Requirements for Sharing Mechanisms

Given these use cases, we can now rationally proceed with identifying the technical functionality required to enable them. (If we started from functionality, we may have become lost in “wouldn’t it be cool if” discussions, which rarely converge.)

There are several ways one could imagine enabling tag data sharing. One is to have a central authority run a giant tagging service in the sky, but of course that is not in the spirit or culture of the Web. Another is to create a piece of open source software that makes it compellingly cheaper and better to use the open source code than to build your own tagging features. The open source software could then expose an API that all the world could access. This approach has missed the window of opportunity, since tagging is already built in to many proprietary systems and because it isn’t that hard to do.

We can, however, envision a kind of interoperability mechanism, whether an official Standard or just a Web convention, in which tag data can be exposed and accessed. Given the reality of heterogeneous systems, sources, and tagging contexts, such a mechanism could not dictate a single, standardized data model for tagging. Instead, it could specify the concepts of tagging in such a way that when the data is exposed, it can be meaningfully mapped, matched, compared, combined, or otherwise manipulated. As has been argued elsewhere, a tag ontology — an “ontology of folksonomy” — could be core to this approach. On top of the ontology, Semantic Web technologies can be used expose the data, connect to other sources of data, and enable services that act on the data.

In general, our process for developing an ontology-based data sharing mechanism goes like this. We identify a common conceptualization, and work out a specification at the semantic level. We identify and build systems that commit to the specifications at various levels of commitment, and hook up the ecosystem. In particular, we come up with a conceptualization of tagging that enables the power we want while allowing innovation in implementation, optimization, and extension. We hash out those concepts that are clear, and try to make unambiguous definitions for terms. We identify those concepts that are vague, and set out to clarify them. And we lay out a conceptual framework for identifying those areas where systems will differ.

So, what does this have to do with functional requirements? Ontologies are designed, like software, against functional requirements. The use cases tell us which distinctions are required in the conceptualization. They tell us where concepts need to be made explicit. And the process of defining the ontology helps us see the differences between implementation-level details that can be abstracted away and fundamental distinctions that are inherent in the kind of data we want to share.

All that having been said, analysis of the use cases above led to identifying the following requirements for a tag ontology.

  1. Core Concepts. Tag data should include, at a minimum, the relation between a person doing the tagging, an object tagged, and a symbolic label for the tag. This information is found in all tag data sources, and is core to the services we envision enabling. Other information might also be found, but it is auxiliary to this basic concept. For example, it is possible to capture the date of a tag relationship being asserted, but the information in the relation can be meaningfully used without knowing its date. Similarly, there might be information about a set of tag labels or tagged objects or tagging people, but the meaning of these sets could be derivative from the notion that they are involved in this basic relation. The requirement for tag ontology is to clarify the core concept and how it can relate to other information that is auxiliary or derivative.
  2. Tag Data Sources. There must be a way to get tag data from multiple sources, retrieving data by person, tagged object, or label. For example, there needs to be some way to say “find all the tag assertions about this item from that source” and “find all items tagged with this label on that source”. The ontology must allow these distinctions, and therefore it must account for the identification of tag data sources.
  3. Auxiliary and Derived Information. Among the data found in existing tagging systems that could be addressed in this ontology are the date of tag assertion, the language of the tag label, the polarity of an assertion (allowing for negative tagging to vote against spam), and a notion of privacy or access control. Some systems have a notion of “tag type” or “tagged item type”, which also can be used to constrain searches and comparisons. In order to enable the use cases outlined above, we do not have to specify a semantics or representation for all of these data. The requirement is that the ontology design consider these data and to allow some mechanism to account for them, even if it is optional.
  4. Identity and Matching. In order to meaningfully retrieve tag data or compare it across sites, there must be some source-independent way of specifying whether two tag assertions are different. In particular, there needs to be a way to identify taggers, tagged objects, and tag labels and some way establish whether two taggers, tagged objects, or tag labels are the same, given a represenation of their identity. This is a nontrivial requirement; for example, determining the identity of people on the web is its own area for research.
  5. Namespaces and Entity Mappings. Because different tag data sources will use their proprietary namespaces and identity management systems, a data sharing agreement will need to allow for services that map from the identities of taggers, tagged objects, and tag labels across systems. For example, there needs to be expressive power to say that one source ignores case in tag labels, and another canonicalizes multiword phrases as atomic tokens. In addition to matching the string value of tag labels, the agreement must address the issues of how a service might support matching on synonyms, variants of word morphology, lexical context (where a tag label has different meanings if in the context of other labels in the same tagging assertion), translation across natural languages, alternative identities for the same person, and alternative identities for the same object.
  6. Mappings to Standards. Given the open and distributed nature of the applications envisioned, the agreement should address how it would be mapped to related notions in proposed standards or conventions, such as microformats and other ontologies (eg, FOAF, SKOS, Annotea, Dublin Core).

Prior Work

In addition to the TagCommons Working Group, other sources of inspiration for this analysis are cited in hyperlinks inline in the text.

One Response to “Functional Requirements for Sharing Tag Data”

  1. We are all end users . . . » Blog Archive » Standards for tagging Says:

    [...] been a lot of talk in the blogophere about whether or not tags should be standardized. Most of the discussion revolves around the potential for interoperability between social bookmarking sites or social search [...]