MPP Data Virtualization

Posted by Ben Szekely on Mar 24, 2020 7:31:59 PM

The Cambridge Semantics Breakthrough That Finally Allows You to Leave Your Data Where It Is While Integrating It at Scale - Without Compromises.

Background - The Enterprise Data Fabric

The Enterprise Data Fabric is an architecture for modern data management that anticipates the need to connect data across the enterprise at speed and scale. Central to the success of any data fabric is a semantic graph-based integration layer that uses business concept models to blend complex and diverse data and present them to users as easy-to-consume data products.

Realizing a data fabric by moving all your data to one place - even to scalable cloud storage - is not feasible for today's digital transformation initiatives. The data fabric is best conceived of and implemented as an overlay architecture spanning existing transactional databases, data warehouses, data lakes, catalogs, document stores - all either in the cloud, on-prem or both. The discovery and integration layer in the fabric integrates and blends data, drawing on subsets of data from across the underlying data landscape as required.

The devil has always been in the details behind each take on "subsets as required", ranging from pure federation, to pre-positioning data in an efficient manner, to boil-the-ocean data lakes and Enterprise Data Warehouses.

The Seduction of Pure Federation

Pure federation has always represented an attractive approach to data virtualization and some companies and vendors are exploring this route for the Enterprise Data Fabric. Pure federation moves data only at query time, decomposing a query or API call into queries or API requests against multiple source systems - bringing it back and performing joins locally.

While offering the seductive benefit of not having to pre-position, copy or store data, this approach has practical scale and performance limitations that cause it to break down quickly as the data sources and integration requirements approach data fabric scale. For example, underlying distributed join operations are not planned optimally, and as a result, end up dragging far too many results across the network. Another problem is that the query engines in relational and semantic/graph pure virtualization tools are typically single threaded based on OLTP- system architectures. Therefore each query or request benefits from only a single thread to fetch data from multiple systems and join it together, leading to impossibly long query times for all but the simplest use cases. Most federation platforms address this reality by caching - introducing data copying, currency and latency challenges the approach seeks to avoid.

MPP DV - Pure Federation (1) (1)


The Practical Cambridge Semantics Approach

At Cambridge Semantics, we have long known that federation was not the answer to integrating and providing interactive access to complex data at scale. Therefore over the last decade, culminating in Anzo 5.0, we have pioneered a highly scalable method that affords the data engineers and architects great latitude over when and how much data is "pre-positioned" in scalable cheap storage prior to blending and integration. We accomplished this using metadata to automate the generation of Spark ETL pipelines, scaled most recently on Kubernetes to transform large amounts of non-graph data into a fully managed compressed graph form. Users of our catalog then select which pre-positioned data sets they want to combine into Graphmarts within our MPP Graph Engine, AnzoGraph. AnzoGraph uses many vCPU cores in parallel to load the data into memory for blending, transformation, and of course, access and analytics. Administrators and application owners decide the life-cycle of the pre-positioned data, and manage to available resources and operational considerations.

MPP DV - Anzo Prepositioning (2)

This approach has worked exceptionally well for our data fabric customers to date, but more flexibility and scale is required as the number and complexity of use cases and required data sources grow to encompass the entire data landscape of the enterprise.

MPP Virtualization - The Cambridge Semantics Breakthrough

AnzoGraph, the MPP Semantic Graph Engine has thousands of cool features that boil down to two main superpowers:

  1. The ability to parallelize each query across all vCPU cores in a cluster, taking advantage of the abundance of network speed, CPU, and memory in today's cloud offerings.
  2. The ability to rapidly load data into memory from disk or network - also fully parallel.

Anzo's practical approach above uses these capabilities to rapidly load pre-positioned data into memory and then run integration and transformation queries - all fully parallel across clusters in any cloud. It's amazing.

However, Sean Martin, CTO of Cambridge Semantics, had often wondered "Could these superpowers be directed toward a purer approach to virtualization, and in many cases, avoiding the pre-positioning of graph data with Spark?"

Sean and his cross-functional R&D team that spans both our middleware and database organizations, have proven over the past year that the answer is a resounding YES.

The Graph Data Interface: Parallel Loading from Any Source

The first part of the team's breakthrough journey was enabling AnzoGraph to load from any source, not just the pre-positioned graph files. The team architected the Graph Data Interface (GDI) within AnzoGraph - allowing the engine's parallel loading capability to target any source - RDBMS, streaming, API, graph, flat files, etc.. The GDI includes a SPARQL-based mapping extension so query authors (or generator tools working on behalf of non-technical users) convert the results of queries to data sources to RDF in-memory. The GDI is highly intelligent, able to use database keys and concurrent RDBMS connections along with a specialized fully partitioned push-down SQL query generator to parallelize both loading and on-the-fly federated query access to remote data sources. Our initial tests yielded load speeds in the same ballpark as loading pre-positioned graph from disk! What this means is that often nothing is lost by leaving your data in the RDBMS and only loading it into memory when blending and access is required. Of course, the option remains to materialize to graph files on disk if source system or operational realities and latency requires.

MPP DV - Anzo MPP Virtualization (1)


The Power of Views

For several releases now, AnzoGraph has featured views - the ability to define virtual or materialized (in memory) named graphs with SPARQL CONSTRUCT queries. The views can be referenced from within other SPARQL queries just like any other named graph, including joins with other named graphs. Building views on top of queries built with GDI loading clauses means that we can now use the MPP and in-memory power of AnzoGraph to query data in parallel while leaving the data where it is. Administrators or business analysts do not need to pre-load the data into memory or anticipate the types of data users will need in advance, yielding highly efficient and elastic use of cloud computing resources.

The Virtualization Nirvana

All of these MPP capabilities work seamlessly with the semantic metadata-driven approach to the data fabric already embodied within Anzo. Users now have total control over how they elevate data for discovery and integration within the data fabric - no matter where it lies. Great news for our customers is that all their existing data mappings in an Anzo solution will now generate not only the current Spark ETL jobs that they are accustomed to, but also, and without any changes at all, direct to database Virtual ELT data load queries, as well as fully Virtualized federation style query access to all of their data. In due course, switching between these three options will come seamless and highly granular. Most importantly, users will no longer have to choose between leaving their data where it is (lake, warehouse, repository) and having lightning fast integration and query performance on blended data.

Getting Started

These MPP Virtualization capabilities will be coming soon in GA releases of Anzo and AnzoGraph. However, we are already rolling out these capabilities for evaluation to customers developing their virtualization strategy for their data fabric and data integration initiatives. Please contact us to get involved. We'd love your feedback and use cases to continue pushing the art of the possible.


Tags: Virtualization, MPP

Subscribe to the Smart Data Blog!

Comment on this Blogpost!