Ontologies Vs. Formats Vs. Schema Vs. APIs
Friday, March 2nd, 2007The TagCommons Working Group is having a fascinating discussion about the mechanism by which a community can agree to share tag data. Here are some of the options before us:
- point-to-point translation of static data files with a proprietary data format
- point-to-point integration using data derived by crawling and screen-scraping sites
- point-to-point integration using an API (REST, Web Service, etc) that assumes a particular data model and encapsulates the format in code
- point-to-point integration accessing databases with documented schemas
- common content formats, such as microformats and I-tags
- common database schemas using a standard schema definition language
- common ontologies and RDF / SPARQL for interchange
Folks in the group who have done a variety of these options described how it might work. Interestingly, this at the heart of a cultural clash between pragmatists and theoreticians, programming and knowledge representation, Web 2.0 style mashups and and Semantic Web style data integration.
This is a context for us to examine why data sharing agreements exist at different levels and the roles of specifications at each. This blog post is an attempt to clarify the potential role of ontology in data sharing agreements, and to hopefully dispel some common misunderstandings.
Commitment and Clarity in Agreements
In any social agreement, there are choices of commitment (how much to require vs how much to let vary), and clarity (in how much detail to specify commitments vs how much to leave to contextual interpretations). The U.S. Constitution, for instance, commits the government to fundamental rights using vague language, yet sets up a very clear judicial mechanism to make interpretations. The early days of HTML and JavaScript were very lenient in what was required (low commitment) but also rather vague on compliance (lots of optional features about which implementers were told to “do their best”). The result was rapid adoption/implementation of the first versions of HTML and then a long medieval period where the de facto standard was bounced around by a feudal war among browser vendors and versions. We have learned from that experience and others like it that a good agreement should require as little commitment as possible from the parties but be as clear as possible about what those commitments mean.
I think ontologies are a technology to make a minimal commitment while being as clear as possible. The minimalism comes by abstracting away from implementation details, which are biased by needs of efficiency and convenience. The clarity comes from careful specification, with at least some of the specification document couched in a formal language that forces one to be explicit about assumptions and the meanings of terms.
At the same time, standards and conventions are almost always propagated by some useful tool, service, or content. So I think we need to consider both the semantic level and the possible format- or schema- or API-level interfaces that could be consistent with the semantic level. It I not practical to assume that the world will adopt a standard data format or API; the world likes to make many of these to suit individual needs. However, I think it is practical — a pragmatic and not a theoretical choice — to specify a common conceptualization, anticipating and accounting for lower level conventions which exist and which could be built.
Ontological Commitment
That raises another question: is it better to try to get agreement on a single conceptualization (a common denominator) or to enrich it with concepts that can account for extensions and additions which are not universally held? Here I feel there is room for healthy debate, although in my experience one can have one’s cake and eat it too. The notion of ontological commitment can help us here. If we say that an ontology is like a contract or treaty, and the parties agree to commit to the contract, what is the nature of the commitment? In standard programming interfaces, one might be asked to implement all the methods described in an interface specification (and pass all the compliance unit tests, etc.). In the case of an ontology, however, the notion of commitment is to use the vocabulary in a logically consistent manner, but not to require that all queries that can be expressed in the vocabulary can be answered. In others words, there is no implied commitment to data or inferential completeness.
This property allows us to make an ontology that specifies a way of thinking about the world which can represent and access the data from many systems that have different data and offer different computational services. For example, we can say that the common conceptualization of tagging is the assertion of a relation between tagger, tagged item, and tag label; the ontology would give these roles standard names and define axioms that constrain their well-formed use. Then we could say that tag assertions happen at points in time, and specify how this might be stated. This does not imply that all parties to a common ontology of tag data need to offer the date of a tag assertion — only that, if they do have date data, the formal vocabulary for representing it (eg, in an RDF tuple) is specified in a well-defined way. Similarly, we can say that a tag assertion is a special case of a “bookmark” defined in some other ontology such as Annotea (or the other way around, it doesn’t matter for this point). We could say that tag labels are related in some semantic way to Dublin Core vocabulary, which has its own ontology, or that taggers are related to some notion of person defined in an ontology such as FOAF. Committing to a tag ontology does not imply committing to a bookmarking ontology or a metadata ontology or a personal identity ontology. It just means that the notion of tag assertion and its formal specification is defined to be consistent with these other ontologies.
Thus, an ideal data sharing agreement would include a common conceptualization, developed collaboratively, attempting to account for the needs of many use cases and allowing for the expression of data from many systems. The conceptualization would be specified in an ontology, defined carefully and delivered in standard languages. It would build on and integrate with existing work on other public ontologies. And it would be deliberately designed so that it could be — and will easily be — mapped to existing and desired conventions at the levels of schema, format, and API. Ideally, these mappings would be also delivered in an open forum, and reference examples of data would be available to demonstrate how the how stack works.