This post discusses some history of knowledge graph technology, what went wrong, and key advances that have allowed knowledge graphs to reenter the enterprise “data arena.” We also highlight some important outcomes of the technology.
In May 2001, Tim Berners-Lee, James Hendler and Ora Lassila published an article in the Scientific American magazine that provided a vision of a “Semantic Web” that empowers “intelligent agents” to automate many tasks.
Tim-Berners Lee is credited with “inventing” the World Wide Web as we know it. In this space, humans publish and share documents. This Web is oriented toward human presentation and consumption of information. In the Semantic Web we publish data that is oriented toward machines. The vision articulated in the article describes how intelligent software agents will leverage the rich, machine understandable data to automate many tasks. In other words, the machine “knows what we mean.”
The vision inspired what is now commonly called the “Semantic Web stack” of standards and elements that provides a repeatable means to create, publish, discover and consume machine understandable content. The graphic below is a common depiction of the stack.
Semantic Web Standards Create Machine Understandable Context.
Subsequent to the initial release of Web Ontology Language (OWL) in 2004, the World Wide Web Consortium (W3C) articulated a set of standards and methods under the “Semantic Web” label. In this construct, the primary standards that adopters and vendors implement to create machine understandable, contextualized knowledge include Resource Description Framework (RDF), RDF Schema (RDFS), OWL, and SPARQL Protocol And RDF Query Language (SPARQL).
RDF provides the means to create, store and exchange semantic data. RDF is a Directed Acyclic Graph (DAG) which, for our purposes, means that concepts are neither defined in terms of themselves nor in terms of other concepts that indirectly refer to them. RDFS is a set of classes with certain properties that build on RDF to provide basic elements for the description of concepts in the RDF data. OWL builds on RDFS to add significantly more constructs to specify or model domains or applications.
SPARQL provides the means to query semantic data, including from distributed sources. The “Protocol” portion of SPARQL standardizes the means to publish and communicate with semantic data services, known, in general, as “SPARQL endpoints.” RDF, RDFS and OWL are materialized in a simple, three element structure, commonly called a “triple,” statement, or fact. Existing data sources may be represented as triples. Triples from otherwise disparate data sources may be linked to create a universal and machine understandable “knowledge graph.” Generally, think of OWL as the “knowledge” and RDF as the “graph;” thus, knowledge graph.
In addition, Semantic Web standards enable machine reasoning services that infer new facts from existing facts; that is, semantic technologies make implicit data explicit. Semantic Web standards allow human and machine data consumers to know what data producers mean.
Referring back to the diagram, the first layer URI/IRI (Uniform Resource Identifier / International Resource Identifier) is a string of standardized form to universally and uniquely identify resources and documents. The Uniform Resource Locator (URL) is a subset of URI included to localize the web resources. In other words, URLs allow accessibility to data. The International Resource Identifier (IRI) is an internationally accepted form.
XML with XML namespaces and XML schema originally were intended to ensure a common syntax for the Semantic Web. XML provides platform independence and data exchange using a common language. Importantly, RDF is NOT “another XML language.” Rather, RDF can be serialized in XML, although other standardized RDF syntaxes are more compact and efficient, so RDF serialized in XML is less common, but still compatible.
Logic and rules support trusted inputs to conclude trusted outputs. The idea of “provenance” or lineage is an important aspect of semantically integrated data. Provenance provides a means to trace the data throughout its lifecycle from source to consumption. Implementers apply digital signatures and cryptography to maintain security during information exchange and interactions.
Semantic Web standards create machine understandable context in a standard and repeatable methodology. Please see “Machine Understandable Context and Why it Matters” for more information about machine understandable context. Just as the Web is distributed and decentralized, so are data sources that employ Semantic Web technologies. The idea is for data producers to publish machine understandable content that software and human consumers can discover and consume in a reliable and repeatable manner.
The concepts in Service Oriented Architecture are germane. As the ecosystem grows, the need for standardized publishing, finding and invoking semantic data and services applies. Unlike conventional data standards, ontologies need not be centrally managed. Ontologies grow, evolve and adapt over time as adoption increases. The superior ontologies naturally become more popular and gain traction. Because ontology is based on the study of the existence of beings and their relationships, terminology in information systems tends toward alignment. In a community or enterprise setting, ontologies (and ontology architecture) can be managed to optimize ontology lifecycle and application.
The need for “top down” or “highly coordinated” planned and implemented data architectures is diminished because the model is decentralized, distributed and based on formal ontology which is designed to achieve semantic interoperability in a federated manner. A subject beyond the scope of this article, ontology based approaches make the assumption that “one never has all the facts.” Previous approaches did not make this assumption, which resulted in inflexible designs wherein requirements (e.g. queries) had to be known in the design phase.
In the 1990s there was a recognition that languages such as HTML and XML were insufficient for knowledge representation. HTML is oriented to rendering information in a human friendly presentation. XML provides a platform-independent data exchange model. (See “THE DESCRIPTION LOGIC HANDBOOK: Theory, implementation, and applications, Edited by Franz Baader et al, 2003.)
Path to a Standard Ontology Language.
In 1999, the European Union sponsored development of the Ontology Inference Layer (OIL). Note, sometimes “Information” is used in place of “Inference.” OIL was based on strong formal foundations of Description Logics, namely SHIQ. OIL was compatible with the very lightweight RDFS model, which was already standardized in 1998.
In 2000, the Defense Advanced Research Projects Agency (DARPA), at the time led by James Hendler, initiated the DARPA Agent Markup Language (DAML) project. DAML was to serve as the foundation for the next generation of the Web which would increasingly utilize “smart” agents and programs. One goal was to reduce the heavy reliance on human interpretation of data. DAML extended XML, RDF and RDFS to support machine understandability. DAML included “some” strong formal foundations of Description Logics but focused more on pragmatic application.
Circa 2001, groups from the US and the EU collaborated to merge DAML and OIL which was known as DAML+OIL. DAML+OIL provided formal semantics that support machine and human understandability. This new language also provided axiomatization, or inference rules to expand reasoning services, that provided machine operationalization.
In 2004, the W3C derived OWL from DAML+OIL and published it as a “standard” knowledge representation language for authoring ontologies. The initial OWL specification featured three “species” OWL: OWL Lite, OWL DL, and OWL Full, each providing increasing expressiveness and sophistication. In 2009, the W3C released OWL 2 which articulated different versions of OWL tailored to different reasoning requirements and application areas. The latest W3C OWL 2 recommendation is dated 11 December 2012.
This entire evolution can be aptly characterized as “making the data intelligent instead of the software.” Since the data is “common” to all software processes, we can more effectively realize intelligent systems.
Well, if this is so great, why is there not more broad adoption of these standards and associated methodologies?
As evidenced by the aforementioned Scientific American article, there was considerable enthusiasm. Many knowledge engineers developed elaborate ontologies using DAML, then OWL. But these models had limited utility; they sort of just “sat there.”
Application developers slowly began to use “some” ontology to link disparate data; however, processing (e.g., querying) graphs was compute intensive and the volume of data that could be queried at acceptable performance levels was limited.
Another factor that limited adoption was the allure of reasoning. Many excited developers were soon discouraged because reasoning is complex and often counterintuitive. So semantic technologies endured the “trough of disillusionment” and many lost interest.
But there are reasons to be reenergized.
The advent of “big data” ushered in broad adoption of innovative and affordable data processing techniques, namely Massively Parallel Processing (MPP). MPP implementations scale horizontally on commodity compute resources, making them more affordable. MPP presented an opportunity to affordably scale graph data as well, including graphs built using RDF and OWL.
Another adoption enabler might be best described as Model Driven Architecture (MDA). MDA presented an approach to designing systems and applications that made extensive use of metadata (e.g., OWL). MDA ushered in a new level of abstraction and flexibility that facilitated adoption of OWL-based designs. The abstraction enabled by MDA meant that vendors could empower end users to employ OWL-based systems without requiring any specialized knowledge of OWL or RDF.
MPP, MDA and a data-driven approach (aka epistemology) combine to usher semantic technologies into the mainstream, and to enterprise scope and scale. Even today, however, I urge caution and limited (i.e., narrowly scoped) use the reasoning services offered by semantic technologies.
What does this combination look like?
The viable implementation of OWL and RDF results in the knowledge graph. For a more thorough treatment of this concept, please see this article. Briefly, a knowledge graph is a connected graph of data and associated metadata applied to model, integrate and access an organization’s information assets. The knowledge graph represents real-world entities, facts, concepts, and events as well as the relationships among them. Knowledge graphs yield a more accurate and more comprehensive representation of an organization’s data.
Creating a knowledge graph is pretty straightforward. But as we said earlier, we need to query very large knowledge graphs at acceptable performance levels. This is achieved using MPP which allows one to quickly load and to flexibly query very large knowledge graphs. A successful MDA implementation allows users to ask questions of the all or parts of the knowledge graph in a familiar point-and-click paradigm. Cambridge Semantics’ Anzo platform represents the intersection of MPP, MDA in a way easily adopted by non-technical users.
The result is a uniform information access layer in which, for example, users configure analytic dashboards that provide seamless access to interconnected information. Automated clients, such as AI and ML services enjoy a similar on-demand approach.
For decision makers, the knowledge graph shrinks the decision cycle and enables more complete access to information, which results in better decisions. For executives and managers, the knowledge graph allows for reallocation of technical engineering resources. For example, as the knowledge graph provides a uniform structure, less resources are required to discover, select and prepare data for consumption. Thus, more resources can be shifted to higher order tasks such as analytics development.
Knowledge graph solutions deploy as an overlay atop an enterprise’s existing landscape of data sources. It allows users to build blended analytics-ready datasets against the underlying data resources without displacing or disrupting any existing processes or platforms.