Integrating Unstructured Data Sources into an Enterprise-Scale Knowledge Graph

Interested in this case study in PDF form... look no further.

A Real-World Example in Production

Because much of the information stored in modern enterprise data ecosystems comes in the form of textual data, an enterprise-scale knowledge graph platform must be able to integrate large collections of this unstructured data. Knowledge graphs need to surface facts from these sources, connecting them to and analyzing them alongside more structured data, such as relational databases, data lakes or APIs. This blog showcases one example of a production customer using Cambridge Semantics’ knowledge graph platform, Anzo, to do exactly this.

Anzo treats unstructured data as a first-class citizen in the knowledge graph. Anzo onboards unstructured data -- sources that contain text, such as PDFs, text messages or text snippets embedded in structured data -- directly into the knowledge graph using configurable, scalable pipelines that require no customized coding. These pipelines generate a graph model for the unstructured text and extracted metadata; they create connections in the graph between these elements and related entities; and, optionally, they build an ElasticSearch index that can be used for highly performant, fully-integrated search queries that look across both free-text and semantic relationships within the graph.

Several of Anzo’s customers are leveraging this capability to build scalable, complex knowledge graphs that stretch across diverse sets of structured and unstructured data. To better understand the transformative impact that Anzo’s support of unstructured data provides, let’s take a look at one of these use cases, where Anzo is being used in a mission-critical search application backed by a knowledge graph.

Anzo p72 architecture diagram - AU-focus (3)

Architectural overview of the customer’s Anzo-backed Knowledge Graph Platform, with the unstructured data components highlighted

Context and Requirements of the Use Case

A large US company needs to build an integrated, large-scale search application that leverages data from multiple applications across the enterprise. Unstructured and structured data from these previously siloed applications must be harmonized into a single, flexible model on which the search application is built. Anzo delivers the scalable, comprehensive knowledge graph that brings these numerous structured and unstructured data sources together. There are a number of complex technical requirements:

Allow analysts to execute text-based searches across millions of documents, going back multiple years in the past, with interactive (1-3 second) response times.
Accommodate hundreds of thousands of new documents added to the knowledge base daily, searchable within hours of the documents becoming available.
Enable complex filtering and sorting of text-based searches with additional criteria of related but distinct metadata attributes, e.g. people, dates, document classification categories.
Support the development and use of a flexible data model (Ontology) that harmonizes structured and unstructured data and can be updated over time
Incorporate and surface results from a cloud-based, ML-driven text analytics engine used for document classification; facilitate development, testing and validation of this analytics engine to further improve its efficacy over time.

Implementation Details

Onboarding

On a daily basis, Anzo’s pipelines crawl hundreds of thousands of documents to bring them into a production knowledge graph. As part of this onboarding process, Anzo’s pipelines send the text from each document to an Amazon Sagemaker endpoint that hosts the text analytics engine; the Sagemaker endpoint provides back to Anzo structured information about the documents. Anzo then incorporates this output from Sagemaker back in the knowledge graph, connecting it to the related entities. Additionally, Anzo’s pipeline builds an ElasticSearch text index that indexes the text contained in all communication records, so that text searches can be executed within the knowledge graph queries.

Anzo p72 architecture diagram - AU onboarding focus (4)

Overview of the architectural components used in the customer’s automated pipelines for unstructured data onboarding

This near real-time pipeline is scaled to process thousands of electronic documents per minute, every day, with no human intervention required. The customer’s knowledge graph is populated with references to the documents, the text content of the documents, and the output from Sagemaker; all of these artifacts are linked directly to other related entities in the knowledge graph.

Querying and Analysis

The result of the onboarding process described above is an enterprise knowledge graph enriched with the output from processing unstructured text content comprising millions of documents stretching back 3 years in time. Loading this massive, heterogeneous knowledge graph -- with tens of billions of triples -- into AnzoGraph alongside the associated ElasticSearch indexes enables a querying and analysis experience with unique sophistication and performance.

The customer’s analysts leverage the platform to ask complex questions with unprecedented speed and ease. Using a tailored front-end that sits on top of the text-enriched knowledge graph, the analysts issue free-text searches across the millions of documents stored within the knowledge graph. They are able to filter their search results based on connections with the harmonized structured data in the graph. The searches and other UI controls are translated to queries, executed against Anzo, which combine ElasticSearch text querying alongside SPARQL queries that traverse multiple hops in the knowledge graph. Leveraging the processing power of hundreds of CPU cores operating in parallel in the AnzoGraph MPP query engine cluster and ElasticSearch cluster, the vast majority of queries return results in under 2 seconds. The end users experience a dynamic, interactive search platform powered by free text searching coupled with complex queries across many different dimensions of data in the knowledge graph.

Anzo p72 architecture diagram - AU querying focus (3)

Overview of the architectural components leveraged by the customer when querying the unstructured data in AnzoGraph and ElasticSearch

Anzo Unstructured KG Logical Overview - Query Overview (5)

Logical diagram of hybrid Knowledge Graph/ElasticSearch queries against unstructured and structured data in the knowledge graph

Text Analytics Engine Development

In addition to forming the basis of the complex search and query platform exposed to end users, the client’s internal data science team uses Anzo’s knowledge graph platform and unstructured data integration capabilities as a flexible validation framework to drive development of their text analytics engine. In a pre-production environment, the data science team uses Anzo’s unstructured data pipelines to rapidly feed large volumes of historical unstructured data as input into their text analytics engine. Anzo brings the engine’s output of these historical runs back into the knowledge graph, where it can easily be analyzed at scale alongside the related structured data.

Anzo p72 architecture diagram - AU Training Focus (1)

Architectural overview of the customer’s validation and testing framework for their text classification model

The customer’s data science team leverages this non-production pipeline as the basis of a powerful backtesting and validation framework. The team iteratively runs historical documents through the pipeline, adjusting the classification model between runs. They then analyze the outputs alongside each other in the graph and compare precision-recall thresholds across runs. Anzo’s scalable onboarding pipelines and AnzoGraph’s in-memory query engine significantly reduce the operational and financial costs associated with this process. The result is a comprehensive and agile workflow that enables iterative development of the text classification model, driven by and fully integrated with the customer knowledge graph.

What are you waiting for?

In this post, we’ve demonstrated how a production customer has used Anzo to build a complex knowledge graph of structured and unstructured data with millions of documents and billions of data points. This massive, rapidly expanding knowledge graph is used to harmonize previously siloed structured and unstructured datasets, to power a state-of-the-art search application, and to facilitate development of a text classification engine.

If you’re interested in learning more about how Anzo can help your organization build a robust and cutting edge solution using a knowledge graph platform, get in touch with us at Cambridge Semantics and we’d be happy to show you more.

Did you appreciate this blog post? Would you like this story in a sharable PDF? Here you go.

The Smart Data Blog