A portion of this blog is an excerpt from the O’Reilly ebook The Rise of the Knowledge Graph co-authored by Ben Szekely, Dean Allemang, and Sean Martin. If you’d like to learn more about knowledge graph and how it stitches together the concepts in this post, please check out the ebook as well as the other posts in this blog series.
This is the second post in the series An Integrated Data Enterprise. All of the concepts described in my previous post and to which knowledge graphs contribute—data fabric, data mesh, data-centric revolution, and FAIR data—have been motivated by a common awareness of some of the ways in which typical data architectures fail. These failure modes are common not just in enterprise data systems but in any shared data situation, on an enterprise, industrial, or global scale.
Data Architecture Failure Modes
Probably the most obvious failure mode is centralization of data. When we have a system that we call a “database” with a “data manager,” this encourages data to be stored in a single place, harking back to the image of the temple at Delphi, where insight was a destination, and those who sought it made a journey to the data, rather than having the data come to them. In an enterprise setting, this sort of centralization has very concrete ramifications. When we do realize that we need to combine information from multiple sources, the central data repository—whether we call it a data warehouse, a data lake, or some other metaphor—has to be able to simultaneously scale out in a number of dimensions to succeed.
Temple of Delphi interpretation from Assassins Creed.
Obviously, this scale must consider data volume and data complexity, as many more types of data must be represented and integrated. However, it also means delivering what is required by the business in reasonable time frames, given the inevitable constraints on your skilled data delivery resources. Naturally, the business will always have a never-ending queue of requests for new combinations of data, and this puts an enormous strain on both the technology and the people behind this kind of approach. Bitter experience has taught the business that data centralization projects are slow, inflexible, expensive, and high risk, as they fail to deliver so frequently.
A corresponding weakness of a centralized data hub is that it has a tendency to suppress variation in data sets. If we have very similar data sets in the organization with considerable overlap, the tendency in any centralized system is to iron out their differences. While this may be a valuable thing to do, it is also costly in time and effort. If our system fails to be fully effective until all of these differences have been eliminated, then we will spend most of our careers (and most of the business time) in a state where we have not yet aligned all of our data sources, and our system will not be able to provide the capabilities the business needs.
Another failure mode is completely nontechnical, and has to do with data ownership. For the purposes of this blog post, the data owner is the person or organization who is responsible for making sure that data is trustworthy, that it is up to date, that the systems that serve it are running at an acceptable service level, and that the data and metadata are correct at all times. If we have a single data hub, the ownership of the data is unclear. Either the manager of the hub is the owner of all the component data sets (not a sustainable situation), or the owner of the data does not have ownership of the platform (since not everyone can manage the platform), and hence their ownership is not easy to enforce. As a result, it is typical in such an architecture to find multiple copies, of varying currency, of the same data in very similar data sources around the enterprise—each in its own data silo, with no way of knowing the comparative quality of the data.
Another failure mode, which is emphasized by Dave McComb in The Data-Centric Revolution (Technics Publications), is that data is represented and managed specifically in service to the applications that use it, making the enterprise data architecture application centric. This means that data is not provided as an asset in its own right, and is notoriously difficult to reuse. Data migration processes speak of data being “trapped” in an application or platform, and processes being required to “release” it so the enterprise can use it.
In military intelligence, there is a well-known principle that is summarized as “need to know”—intelligence (i.e., data) is shared with an agent if and only if they need to know it. It is the responsibility of a data manager to determine who needs to know something. But a lesser-known principle is the converse—“responsibility to provide.” In a mission-critical setting, if we discover that a failure occurred because some agent lacked knowledge of a situation, and that we had that knowledge but did not provide it, then we are guilty of having sabotaged our own mission. It is impossible to know in advance who will need to know what information, but the information has to be available so that the data consumer can determine this and have access to the data they need. The modern concepts we are talking about here—data mesh, data fabric, data-centric operation—represent a shift in emphasis from “need to know” to “responsibility to provide.”
The attitude of “responsibility to provide” is familiar from the World Wide Web, where documents are made available to the world and delivered as search results, and then it is the responsibility of the data consumer to use them effectively and responsibly. In a “responsibility to provide” world, describing and finding data becomes key.
Data-centric Example: NAICS Codes
As a simple example of the shift in emphasis from application-centric, centralized data management to a data-centric, distributed data fabric, we’ll consider the management of NAICS codes. NAICS is a set of about six thousand codes that describe what industry a company operates in. Any bank that operates with counter-parties in the US will have several reasons to classify their industry, from “know your customer” applications—in which transactions are spot-audited for realism, based on the operating industry of the players in the transaction (e.g., a bicycle shop is probably not going to buy tons of bulk raw cotton direct from the gin)—to credit evaluation (what industries are stable in the current market), and many others. It is common for dozens or hundreds of datasets in a financial enterprise to reference the NAICS codes.
The NAICS codes are probably the easiest data entity to reuse; they are maintained by the US Census Bureau, which publishes them on a very regular and yet quite slow basis (once every five years). The codes are very simple: six numeric digits, along with a name and a description. They are in a hierarchical structure, which is reflected directly in the codes, and are made available free of charge in many formats from many sources, all of which are in perfect agreement on the content. The new releases every five years are fully backward compatible with all previous versions. NAICS codes present none of the usual problems with managing shared vocabularies.
Nevertheless, the management of NAICS codes as shared resources in actual financial institutions is always a mess. Here are some of the things that go wrong:
- They are represented in just about every way imaginable (XML documents, spreadsheets, tables in relational databases, parts of tables in relational databases, etc.). Despite the simple form of a code, it can be represented in idiosyncratic ways; the names of the columns in a spreadsheet or of the elements in an XML document often do not match anything in the published codes.
- Despite the fact that the NAICS codes are standardized, it is quite common to find that an enterprise has minted its own codes, to add new ones that it needs but were missing in the published version. These augmented codes typically don’t follow the pattern of the NAICS codes, and in no circumstance are they ever fed back to the NAICS committee for integration into the upcoming versions.
- Because they are external to the enterprise, nobody owns them. That is, nobody takes responsibility for making sure that the latest version is available, or that the current version is correct. Nobody views it as a product, with service agreements about how they will be published or the availability of the servers that publish them.
As a result of these situations, it is typical to find a dozen or more copies of the NAICS codes in any organization, and even if a data catalog somewhere indicates this, it does not indicate the version information or any augmentations that have been made. The simplest possible reusable data resource turns instead into a source of confusion and turmoil.
Requirements For A New Data Management Paradigm
When we think of how new approaches to enterprise data management are changing the data landscape, a number of recurring themes come up, in terms of requirements they place on data management:
- Flexibility in the face of complex or changing data
- Description in terms of business concepts
- Ability to deal with unanticipated questions
- Data-centric (as opposed to application-centric)
- Data as a product (with SLA, customer satisfaction, etc.)
- FAIR (findable, accessible, interoperable, and reusable)
Let’s take a look at how this new paradigm deals with our simple example of NAICS codes. A large part of the value of a standardized coding system like NAICS is the fact that it is a standard; making ad hoc changes to it damages that value. But clearly, many users of the NAICS codes have found it useful to extend and modify the codes in various ways. The NAICS codes have to be flexible in the face of these needs; they have to simultaneously satisfy the conflicting needs of standardization and extensibility. Our data landscape needs to be able to satisfy this sort of apparently contradictory requirements in a consistent way.
The NAICS codes have many applications in an enterprise, which means that the reusable NAICS data set will play a different role in combination with other data sets in various settings. A flexible data landscape will need to express the relationship between NAICS codes and other data sets; is the code describing a company and its business, or a market, or is it linked to a product category?
The problems with management of the NAICS codes become evident when we compare the typical way they are managed with a data-centric view. The reason why we have so many different representations of NAICS codes is that each application has a particular use for them, and hence maintains them in a form that is suitable for that use. An XML-based application will keep them as a document, a database will embed them in a table, and a publisher will have them as a spreadsheet for review by the business. Each application maintains them separately, without any connection between them. There is no indication about whether these are the same version, whether one extends the codes, and in what way. In short, the enterprise does not know what it knows about NAICS codes, and doesn’t know how they are managed.
If we view NAICS codes as a data product, we expect them to be maintained and provisioned like any product in the enterprise. They will have a product description (metadata), which will include information about versions. The codes will be published in multiple forms (for various uses); each of these forms will have a service level agreement, appropriate to the users in the enterprise.
There are many advantages to managing data as a product in this way. Probably the most obvious is that the enterprise knows what it knows: there are NAICS codes, and we know what version(s) we have and how they have been extended. We know that all the versions, regardless of format or publication, are referring to the same codes. Furthermore, these resources are related to the external standard, so we get the advantages of adhering to an industry standard. They are available in multiple formats, and can be reused by many parts of the enterprise. Additionally, changes and updates to the codes are done just once, rather than piecemeal across many resources.
The vision of a data fabric, data mesh, or data-centric enterprise is that every data resource will be treated this way. Our example of NAICS was intentionally very simple, but the same principles apply to other sorts of data, both internal and external. A data fabric is made up of an extensible collection of data products of this sort, with explicit metadata describing what they are and how they can be used. At Cambridge Semantics, we focus on the data fabric as the vehicle for this distributed data infrastructure; most of our comments would apply equally well to any of the other approaches.
This blog is the second post of the series An Integrated Data Enterprise. Keep an out for the next chapters:
Find part one of the series here: