Managed Vocabularies, Ontologies, and an Enterprise Scale Knowledge Graph: A Primer

This blog provides a primer on the concepts of managed vocabularies and ontologies and how they relate to and can be implemented in a knowledge graph. This is an excerpt from the ebook The Rise of the Knowledge Graph.

Managed vocabularies are not a new thing—the West Key Number System dates back to the middle of the 19th century! Most enterprises today, regardless of size, use some form of controlled vocabulary. Large organizations typically have a meta-index, a list of its controlled vocabularies. But let’s take a closer look at how these vocabularies are actually managed in many enterprises.

In this video Stephen Kahmann from Cambridge Semantics' partner iNovex answers your questions about the use of a knowledge graph to enable ontology management.

On the face of it, a vocabulary is a simple thing—just a list of names and identifiers. A vocabulary can be easily represented as a spreadsheet, with only a couple of columns, or as a simple table in a relational database. But suppose you want to build a relational database that uses a controlled vocabulary, say, one that lists the states in the United States. You build a table with 50 rows and two columns, using the two-letter postal code as a key to that table. There aren’t any duplicates (that’s how the post office organized those keys; they can’t have any duplicates in the codes), so the table is easy for a relational database system to manage. Any reference to a state, anywhere in that database, refers to that table, using the key.

But what happens when we have multiple applications in our enterprise, using multiple databases? Each of them could try to maintain the same table in each database. But, of course, maintaining multiple copies of a reference vocabulary defeats many of the purposes of having a controlled vocabulary at all. Suppose we have a change of policy—we decide that we want to include US territories (Puerto Rico, Guam, American Samoa, etc.), as well as the District of Columbia, and treat them as states. We have to find all these tables and update them accordingly. Since the data is repeated across multiple systems, do we actually know where it even is, let alone that it is consistent and unambiguous?

In enterprises that manage a large number of applications, it is not uncommon to find several, or even dozens, of different versions of the same reference vocabulary. Each of them is represented in some data system in a different way, typically as tables in a relational database. There is no single place where the reference data, or knowledge, is managed.

This is the most common state of practice in enterprises today. The value of reference knowledge is acknowledged and appreciated, and that knowledge is even explicitly represented. But it is not well managed at an enterprise level, and hence fails to deliver many of its advantages.

What is an ontology?

We have already seen, in our discussion of controlled vocabularies, a distinction between an enterprise-wide vocabulary and the representations of that vocabulary in various data systems. The same tension happens on an even larger scale with the structure of the various data systems in an enterprise; each data system embodies its own reflection of the important entities for the business, with important commonalities repeated from one system to the next.

We’ll demonstrate an ontology with a simple example. Suppose you wanted to describe a business. We’ll use a simple example of an online bookstore. How would you describe the business? One way you might approach this would be to say that the bookstore has customers and products. Furthermore, there are several types of products—books, of course, but also periodicals, videos, and music.

There are also accounts, and the customers have accounts of various sorts. Some are simple purchase accounts, where the customer places an order, and the bookstore fulfills it. Others might be subscription accounts, where the customer has the right to download or stream some products.

All these same things—customers, products, accounts, orders, subscriptions, etc.—exist in the business. They are related to one another in various ways; an order is for a product, for a particular customer. There are different types of products, and the ways in which they are fulfilled are different for each type. An ontology is a structure that describes all of these types of things and how they are related to one another.

More generally, an ontology is a representation of all the different kinds of things that exist in the business, along with their interrelationships.

An ontology for one business can be very different from an ontology for another. For example, an ontology for a retailer might describe customers, accounts and products. For a clinic, there are patients, treatments and visits.

The simple bookstore ontology is intended for illustrative purposes only; it clearly has serious gaps, even for a simplistic understanding of the business of an online bookstore. For example, it has no provision for order payment or fulfillment. It includes no consideration of multiple orders or quantities of a particular product in an order. It includes no provision for an order from one customer being sent to the address of another (e.g., as a gift). But, simple as it is, it shows some of the capabilities of an ontology, in particular:

An ontology can describe how commonalities and differences among entities can be managed. All products are treated in the same way with respect to placing an order, but they will be treated differently when it comes to fulfillment.
An ontology can describe all the intermediate steps in a transaction. If we just look at an interaction between a customer and the online ordering website, we might think that a customer requests a product. But in fact, the website will not place a request for a product without an account for that person, and, since we will want to use that account again for future requests, each order is separate and associated with the account.

All of these distinctions are represented explicitly in an ontology. For example, see the ontology in Figure 1. The first thing you’ll notice about Figure 1 is probably that it looks a lot like graph data; this is not an accident. As we saw earlier, graphs are a sort of universal data representation mechanism; they can be used for a wide variety of purposes. One purpose they are suited for is to represent ontologies, so our ontologies will be represented as graphs, just like our data. In the following section, when we combine ontologies with graph data to form a knowledge graph, this uniformity in representation (i.e., both data and ontologies are represented as graphs) simplifies the combination process greatly. The nodes in this diagram indicate the types of things in the business (Product, Account, Customer); the linkages represent the connections between things of these types (an Account belongs to a Customer; an Order requests a Product). Dotted lines in this and subsequent figures indicate more specific types of things: Books, Periodicals, Videos, and Audio are more specific types of Product; Purchase Account and Subscription are more specific types of Account.

Figure 1. A simple ontology that reflects the enterprise data structure of an online bookstore.

Put yourself in the shoes of a data manager, and have a look at Figure 1; you might be tempted to say that this is some sort of summarized representation of the schema of a database. This is a natural observation to make, since the sort of information in Figure 1 is very similar to what you might find in a database schema. One key feature of an ontology like the one in Figure 1 is that every item (i.e., every node, every link) can be referenced from outside the ontology. In contrast to a schema for a relational database, which only describes the structure of that database and is implicit in that database implementation, the ontology in Figure 1 is represented explicitly for use by any application, in the enterprise or outside. A knowledge graph relies on the explicit representation of ontologies, which can be used and reused, both within the enterprise and beyond.

Just as was the case for vocabularies, we can have ontologies that have a very general purpose and could be used by many applications, as well as very specific ontologies for a particular enterprise or industry. Some examples of general ontologies include:

Just as was the case for vocabularies, ontologies can be made for multiple enterprise users (such as the ones mentioned previously), or for specific enterprise uses.

Example Enterprise Ontology Use Cases

A model of media productions

A television and movie production company manages a wide variety of properties. Beginning with the original production of a movie or television show, there are many derivative works: translations of the work into other languages (dubbed or with subtitles), releases in markets with different commercial break structure, limited-edition broadcasts, releases on durable media, streaming editions, etc. To coordinate many applications that manage these things, the company develops an ontology that describes all these properties and the relationships between them.

Process model for drug discovery

A pharmaceutical company needs to usher several drugs through a complex pipeline of clinical trials, tests, and certifications. The process involves an understanding not only of drugs and their interactions with other chemicals, but also of the human genome and cascades of chemical processes. The company builds an ontology to describe all of these things and their relationships.

Government model for licensing

A government agency manages licenses for many purposes: operating motor vehicles, marriage licenses, adoptions, business licenses, etc. Many of the processes (identification, establishing residency, etc.) are the same for all the forms of license. The agency builds an ontology to model all the requirements and expresses constraints on the licenses in those terms.

A key role that an ontology plays in enterprise data management is to mediate variation between existing data sets. Each data set exists because it successfully satisfies some need for the enterprise. Different data sets will overlap in complex ways; the ontology provides a conceptual framework for tying them together.

Explicit ontologies provide a number of advantages in enterprise data management:

Since the ontology itself is a data artifact (managed in RDF), it can be searched and queried (e.g., using SPARQL).
Regulatory requirements and other policies can be expressed in terms of the ontology, rather than in terms of specific technical artifacts of particular applications. This allows such policies to be centralized and not duplicated throughout the enterprise data architecture.
The ontology provides guidance for the creation of new data artifacts, so that data designers don’t have to start from scratch with every data model they build.

Ontology Versus Data Model

In the ebook The Rise of the Knowledge Graph, we made the distinction between reference and conceptual knowledge, where reference knowledge is represented by a vocabulary and conceptual knowledge is represented by an ontology.

An ontology is conceptual in that it describes the entities, or concepts, in the business domain, independently of any particular application. An application might omit some concepts that are not relevant to its use; for example, a fulfillment application only needs to know about the product and the recipient. It doesn’t need to know about the distinction between a customer and an account, and might not include any reference to those concepts at all. Conversely, the data model for such an application might include artifacts specific to fulfillment that are not even visible to the rest of the business, such as tracking information for third-party delivery services. The ontology coordinates all the data models in the enterprise to provide a big-picture view of the enterprise data landscape.

Ontology and Representation

A simple ontology is relatively easy to build; the classes correspond to types of things that are important to the business. These can often be found in data dictionaries for existing systems or enterprise glossaries. The basic relationships between these things are a bit more difficult to tease out, but usually reflect relationships that are important to the business. A simple starting ontology based on this sort of enterprise research will usually have a complexity comparable to what we have in Figure 1. Even a simple ontology of this sort can provide a lot of value; the basic relationships represented in such an ontology are usually reflected in a number of data-backed systems throughout the organization. A simple ontology can act as a hub for an important subset of the data in the enterprise.

Further development of an ontology proceeds in an incremental fashion. For example, you might be able to imagine how the category “products” can have different types of products. These are distinguished not just by their names, but also by the sorts of data we use to describe them. In a bookstore example, both books and videos can be products. Books have numbers of pages, whereas videos have run times. Both of them have publication dates, though we could easily imagine products (perhaps not from a bookstore) that don’t have publication dates. We can extend an ontology in this manner to cover different details in different data sources.

Get the full ebook to continue on your vocabulary and ontology journey.

The Smart Data Blog