Corporate CIOs make dozens of critical decisions every day, and while many will be made with instinct and experience others will depend on deep insight built on data analysis. These decisions not only affect company strategy and outlook, but require asking and answering unforeseen and complex - read computationally and logistically expensive - questions in rapidly evolving data ecosystems. So what’s the solution to maintaining a data-driven corporate strategy with an ever changing data environment?
The solution will need the ability to answer complex unanticipated questions and create an accurate connected model that supports the query of combined data from a diverse and changing collections of enterprise and publicly available sources. There has been a push for several architectures to ensure these qualities, at the moment data mesh and data fabric are the two most popular.
At Cambridge Semantics, we don’t care what you want to call your preferred architecture, but our team does have strong opinions on its implementation. Our stipulation is simple:
Any approach that can’t quickly integrate multiple data sources spanning both structured and unstructured content and rapidly query across them at scale is fundamentally flawed.
What is your best data architecture option? Let’s consider a few alternatives, each of which might be considered politically safe but carries the baggage of substantial limitations.
Data warehouses and online analytical processing (OLAP) cubes are the old school approach in the reporting and analytical querying world. They rely on established technologies that have been refined over decades. The downside is the infamous inflexibility and high costs of a methodology that requires pre-defining a data schema. Changes require involvement of experts who must create or modify the ETL pipelines or ELT transformations that shape the data to the updated model.
Furthermore, structured systems cannot natively integrate the estimated 80% of existing enterprise data that is unstructured: text and images in documents, emails, product catalogs, instant messages, transcriptions and presentations. While this approach lacks flexibility, it is tailor made for reporting and analysis on a prescribed set of metrics.
Data virtualization attempts to overcome many of the challenges with the management of a data warehouse by leaving source data in the operational systems where it is first created. Source data can be directly queried in a pattern called a “federated query”. Proceed with caution, as the details may quickly get thorny. Virtualization can create significant additional performance overhead resulting in latency caused by suboptimal federated sub-queries that drag too much intermediate result data across the network unnecessarily. Decision making and analytic queries generate large scans and cross joins of data which are hard to optimize even when accumulated into a single database, let alone located all over an enterprise.
To be successful this approach requires careful query planning that is optimized only through an understanding of the remote source statistics, management of network load and tactical caching. Unforeseen queries can result in major performance degradation which may upset operational source system administrators and application end users. Another problem that can be harder to deal with in a federated approach is how to deal with data quality issues in the remote systems requiring on-the-fly transformation and validation. On the other hand, data virtualization sometimes provides a straightforward way for applications that make point queries to unite the latest data from several source systems and can solve social issues surrounding data ownership and security.
Some graph databases are great! Our product, Anzo®, actually includes its own proprietary graph database, AnzoGraph DB, at the heart of its software package. More on this in a second. What many other graph database vendors often obscure is that their underlying design has the same drawbacks of a structured and/or data virtualization approach. There is no other graph database that is OLAP (data warehouse style) in nature - meaning that it is optimized for complex transformation and analytics queries at huge scale.
At best, a few other graph databases will allow for the creation of virtual graphs that push down queries to source systems with the same underlying virtualization complexities discussed earlier. Furthermore, many of these graph databases are based on label property graphs (LPGs). This is a topic by itself, but LPGs usually require users to pre-create a schema in which to receive the incoming data with the aforementioned flexibility limitations of a relational database when any changes are required or a new data source is added. Most are really a destination for data that has been pre-integrated using external pipelines and therefore best for point use cases and applications that utilize the inherent graph structure such as network analysis or fraud detection.
Now nearly any approach to data integration can succeed to a certain extent given enough time plus resources, and there is certainly considerable nuance to the descriptions, but there is a better design - the enterprise knowledge graph.
Cambridge Semantics flagship offering Anzo generates an enterprise knowledge graph that renders these legacy approaches obsolete. Anzo excels at both critical parameters: rapid data integration of diverse data sources and query at scale. Let’s circle back and discuss the technical underpinnings of Anzo and its embedded knowledge graph database, AnzoGraph.
Uniquely, AnzoGraph is an in-memory OLAP graph database which has significant advantages over relational data warehouses because it is essentially schemaless with regard to incoming data structure. Instead, Anzo automatically generates and catalogs the instructions for creating connected data models from source data. This includes unstructured data, Anzo offers distributed pipelines that connect with natural language processing (NLP) services to natively handle these formats and their content. This approach is beneficial for manipulating the data model, and the data itself. In the Knowledge Graph world ontologies are used to describe data instead of schema and since these are just descriptive metadata themselves, they can be easily manipulated through queries and offer unprecedented flexibility for a data engineer performing an integration. This also means that regardless of the shape, complexity, or dirtiness of source data it can be loaded sight unseen and then restructured and cleansed using powerful transformation queries.
This late binding approach is possible because from a performance perspective, AnzoGraph scales both without limit and abstracts the complexities of data virtualization. AnzoGraph does this through the utilization of a massively parallel processing (MPP) architecture which partitions all the data across a compute cluster and breaks incoming queries down into steps that are executed simultaneously on every data partition or shard at the same time. Not only that, but AnzoGraph also supports a federated query capability which we call Virtual Knowledge Graphs. What this means is that our customers get the benefit of the approaches mentioned above in one system because users can create a knowledge graph that mixes and matches data that is materialized (loaded) as well as remotely sourced. When performance is needed for aggregation or transformation queries, that data is materialized into the knowledge graph. Remotely sourcing data through automatically generated push-down queries is useful when accessing part of the knowledge graph remotely, perhaps to retrieve the most up-to-date information.
Due to our unique approach customers don’t come to us with straightforward problems. It's precisely because the aforementioned options have failed that our team with decades of experience building knowledge graphs has been engaged. If you are interested in learning about our growing list of success stories, join my upcoming webinar The Future of Data Integration.