If data is a strategic asset, why is so much emphasis placed on analytics?
People who work in data management often hear and talk about the importance of data. We hear phrases like, “Data is a strategic asset;” and, “Data is the new oil.” But trends such as AI, ML, and analytics, in general, capture the imagination — and budgets — of decision makers. Why? Perhaps, it’s because AI, ML and analytics are just cooler tools — with eye catching visualizations, brainy models, and the promise of wow factor insights. Yet, the “data is a strategic asset” refrain persists. In some circles, we talk about “connecting the dots,” but we pay little homage to “creating” the dots.
Bad data leads to bad answers.
The phrase, “garbage in, garbage out” comes to mind. In other words, logically, if one’s premise is flawed, so also are one’s conclusions. Nevertheless, we can’t seem to tame our fascination with “shiny objects.” This article discusses a “prerequisite” foundation on which to build successful AI, ML and advanced analytics capabilities. It also discusses how this should be a preliminary consideration to any analytics strategy.
The successful analytics strategy provides complete and precise information, on demand, from which analytics derive new information and insights. In Defense communities, there is a concept called the “Decision Cycle.” The idea is to get through the decision cycle faster than your opponents; that is to say, your competition. However, if the demand for actionable information is great, and the underlying analytic apparatus is not efficient, then analysts will necessarily offer what amounts to conjecture to decision makers. We can call this a “dangerous shortcut;” and it becomes a vicious cycle that leads to undesired outcomes.
To avoid the “dangerous shortcut” and provide the best analytic outcomes, data must flow to and through the analysis stage with minimum friction and maximum context. Context implies an holistic and accurate understanding of your domain of interest. For example, in a manufacturing setting, if an analytic is unaware of all the factors influencing a given machine in an assembly line, it may not make the best predictions for maintenance. This same dynamic applies to AI and ML initiatives. In other words, without effective “Information Architecture” enterprises will struggle to operationalize AI initiatives at any meaningful scale and effectiveness. So how do we realize an effective Information Architecture?
First, let’s abstract the decision cycle in terms of an “insight creation cycle” — or process — and partition it into four stages: collect data, process data, analyze data and then act on the findings. In practice, data collection is usually not the challenge — at least not in the macro — so we won’t discuss data collection here.
Information Technology can be abstracted as a cycle.
Many, if not most, enterprises lament an “analysis bottleneck;” that is, difficulty in achieving insights as fast as possible. But actually, a disproportionate and manually intensive part of the “insight cycle” is spent in the data “processing” stage in which we search, identify and prepare data for analysis. I assert that the “data processing” stage is the real challenge (i.e. bottleneck) enterprises should confront. For example, the market research firm International Data Corporation (IDC) stated that 73% of the cycle is spent on data searching and preparation — in other words, data processing. Therefore, a critical enabler to a successful analytics strategy is effective and efficient data processing, prior to analysis.
The Process stage is the actual bottleneck and where Knowledge Graphs enable organizations to successfully implement advanced analytics strategies.
In fact, one of the byproducts of data collection was arguably an impetus for the “data lake,” which essentially creates a large repository of unprocessed data. I recall, circa 2010, the senior technical director for a large government enterprise proclaimed something like, “We’ve made the bold decision to move [data] to the cloud. Collocating our data will provide a common access point from which we will realize more effective analysis.” As a data integration “expert,” my personal reaction was, “Yeah, right. Juxtaposing data does not equal data integration, context and seamless access.” Indeed, arguably, data lakes were a tacit acknowledgment of an unmet need for improving the process stage.
Collocating data does not integrate data. Data processing remains a predominant activity.
A newer architectural construct known as the “data fabric” applies technologies and methodologies to harmonize and provide frictionless access to integrated data. In other words, data fabrics seek to overcome the limitations of simply collocating data in data lakes. The idea is to create interconnected, business-oriented information for more seamless, on demand access for analysis.
The Data Fabric powered by knowledge graph technology provides normalized, contextualized and harmonized information for human and machine consumption.
Another dynamic contributing to the need for seamless access to information is uncertainty. Organizations must assume — and embrace — the unexpected. Therefore, data models and practices that assume change are more responsive and adaptable to the unforeseen. “Quickly” updating a relational schema, for example, is not what I mean by adaptable. I’m implicitly referring to data models based on an “open world assumption.” I’ll dwell on that in a future post.
So, how do we actually construct a data fabric? More importantly, how do we create “seamless access” to data, especially at enterprise scope and scale?
From the early 2000s until circa 2009, many, including me, would have touted using W3C Semantic Web technologies, namely RDF and OWL. After all, the primary aim of these standards was to create machine understandable data — at Web scale! In my opinion, we were right! But, alas, these standards were difficult to scale from technology and methodology perspectives. RDF is based on a graph data model (think “links and nodes”), and OWL is an extension of RDF-S, which is an extension of RDF. OWL’s purpose is to provide human and machine understandable meaning. Disparate data are normalized to RDF, and harmonized with OWL.
Technically, graph processing is much more compute-intensive than, say, the relational model. That characteristic made querying [RDF] at scale cost prohibitive. From a practice point of view, we got carried away with OWL, and the promise of machine reasoning. As a result, we were overwhelmed by complexity. Further, the open world assumption combined with machine reasoning often yielded counterintuitive results! Semantic Web technologies entered the so-called “trough of disillusionment.”
Fortunately, the arrival of “big data” — ironically helped at least partially by data lakes — provided new life to Massively Parallel Processing (MPP) designs and implementations (eg., HDFS, Apache Spark). MPP has become an enabler for scalable applications of graph-based solutions, including RDF and OWL. Additionally, methodologies have emerged that make it easy to build and maintain solutions based on RDF and OWL.
Now, the knowledge graph, built on RDF and OWL, has become viable for enterprise data integration and access. Using knowledge graph platforms like Cambridge Semantics’ Anzo organizations can rapidly and incrementally create knowledge graphs to power their data fabrics. We can de-bottleneck the data process stage.
Early in my Navy career, I attended a micro-miniature electronics soldering course. The instructor emphasized that 90 percent of the effective soldering process was preparation. The actual act of soldering was a small portion of the overall task.
Similarly, I would argue the most effective strategies place a premium on prerequisite activities related to data architecture. If they do not, organizations will remain stuck in error prone and manually intensive data preparation activities, at the expense of effective and efficient decision making and associated outcomes.
So before you go too far into your advanced analytics journey, consider investing in establishing your enterprise data fabric built on knowledge graph technology.