Corporate CIOs make dozens of critical decisions every day, and while many will be made with instinct and experience others will depend on deep insight built on data analysis. These decisions not only affect company strategy and outlook, but require asking and answering unforeseen and complex - read computationally and logistically expensive - questions in rapidly evolving data ecosystems. So what’s the solution to maintaining a data-driven corporate strategy with an ever changing data environment?
At Cambridge Semantics, we don’t care what you want to call your preferred architecture, but our team does have strong opinions on its implementation. Our stipulation is simple:
Any approach that can’t quickly integrate multiple data sources spanning both structured and unstructured content and rapidly query across them at scale is fundamentally flawed.
What is your best data architecture option? Let’s consider a few alternatives, each of which might be considered politically safe but carries the baggage of substantial limitations.
Data warehouses and online analytical processing (OLAP) cubes are the old school approach in the reporting and analytical querying world. They rely on established technologies that have been refined over decades. The downside is the infamous inflexibility and high costs of a methodology that requires pre-defining a data schema. Changes require involvement of experts who must create or modify the ETL pipelines or ELT transformations that shape the data to the updated model.
Furthermore, structured systems cannot natively integrate the estimated 80% of existing enterprise data that is unstructured: text and images in documents, emails, product catalogs, instant messages, transcriptions and presentations. While this approach lacks flexibility, it is tailor made for reporting and analysis on a prescribed set of metrics.
Data virtualization attempts to overcome many of the challenges with the management of a data warehouse by leaving source data in the operational systems where it is first created. Source data can be directly queried in a pattern called a “federated query”. Proceed with caution, as the details may quickly get thorny. Virtualization can create significant additional performance overhead resulting in latency caused by suboptimal federated sub-queries that drag too much intermediate result data across the network unnecessarily. Decision making and analytic queries generate large scans and cross joins of data which are hard to optimize even when accumulated into a single database, let alone located all over an enterprise.
Some graph databases are great! Our product, Anzo®, actually includes its own proprietary graph database, AnzoGraph DB, at the heart of its software package. More on this in a second. What many other graph database vendors often obscure is that their underlying design has the same drawbacks of a structured and/or data virtualization approach. There is no other graph database that is OLAP (data warehouse style) in nature - meaning that it is optimized for complex transformation and analytics queries at huge scale.
Now nearly any approach to data integration can succeed to a certain extent given enough time plus resources, and there is certainly considerable nuance to the descriptions, but there is a better design - the enterprise knowledge graph.
Cambridge Semantics flagship offering Anzo generates an enterprise knowledge graph that renders these legacy approaches obsolete. Anzo excels at both critical parameters: rapid data integration of diverse data sources and query at scale. Let’s circle back and discuss the technical underpinnings of Anzo and its embedded knowledge graph database, AnzoGraph.
Uniquely, AnzoGraph is an in-memory OLAP graph database which has significant advantages over relational data warehouses because it is essentially schemaless with regard to incoming data structure. Instead, Anzo automatically generates and catalogs the instructions for creating connected data models from source data. This includes unstructured data, Anzo offers distributed pipelines that connect with natural language processing (NLP) services to natively handle these formats and their content. This approach is beneficial for manipulating the data model, and the data itself. In the Knowledge Graph world ontologies are used to describe data instead of schema and since these are just descriptive metadata themselves, they can be easily manipulated through queries and offer unprecedented flexibility for a data engineer performing an integration. This also means that regardless of the shape, complexity, or dirtiness of source data it can be loaded sight unseen and then restructured and cleansed using powerful transformation queries.
Due to our unique approach customers don’t come to us with straightforward problems. It's precisely because the aforementioned options have failed that our team with decades of experience building knowledge graphs has been engaged. If you are interested in learning about our growing list of success stories, join my upcoming webinar The Future of Data Integration.