At Cambridge Semantics, Inc. my team and I work with a diverse array of customers across such verticals as Life Sciences, Healthcare, Financial Services, Government and Manufacturing. Even though these industries are diverse, as presales engineers we commonly are asked similar questions regarding barriers to launching a knowledge graph project. In this blog and video we dive into these common themes and share our responses.
Can we connect to product X, Y, or Z?
Product X, Y, or Z could be an upstream system, such as Oracle, Teradata, or Snowflake. It can also be a downstream system, such as Tableau or another BI tool that consumes data from Anzo. The short answer is Yes.
From an upstream perspective, Anzo has out of the box connectors to connect and adjust data from a multitude of sources. It could be unstructured data like text files or PDFs, or structured/semi-structured files like CSV, XML, or JSON. It could also be data from a traditional relational source, or any data source that provides a JDBC driver through our partnership with CData—there are over 100 different sources that are available through that driver option. Customers can also create direct queries to API sources. In the case closed systems, we create semantic services to directly ingest information from that source.
Now, let’s talk about pushing data downstream. Anzo connects to traditional BI tools such as Power BI or Tableau, and to data science tools or relational databases. So, those API's that allow for downstream access, also include endpoints for ODBC, JDBC, or OData. If Jupyter Notebooks are your goal, there’s a library called PI Anzo for directly pushing data into those notebooks.
Will Anzo replace my ETL tool and/or my Data Catalog?
No, these are very complimentary tools to the Anzo platform. Your traditional ETL tools are great. They're incredibly useful for the data transformation steps needed to pull data from your operational data sources and place them into your data warehouses. Anzo does not replace that step.
Anzo does create its own metadata catalog, but it is complementary to standard data catalogs. Collibra and Alation contain a wealth of metadata that can be used to link and harmonize data in the knowledge graph. Anzo often treats this information as another data source used to connect entities in the knowledge graph.
Is Anzo copying, moving, or virtualizing data?
Anzo has a hybrid approach of three routes for ingesting data from a data source. The first and most common route is Spark. Anzo leverages Spark pipelines to move and convert traditional relational sources into RDF and store these resulting compressed RDF within your data lake. Data is not being moved. As needed, we will accelerate that into our in-memory graph database to create the knowledge graph.
The second route uses our horizontally scalable back-end graph database (GDI) to bring data into memory on demand. The main difference between this and the first option is when you're taxing your sources of data. In the first approach, it is orchestrated to run at intervals. The second approach, you activate and hit the source system when the user requests it.
The third route, uses direct virtualized connections into the data source.
The best route will vary based on the data source. If you have something like an API, where the data is changing quite frequently, it might make sense to use a virtualized connection versus historical data where you might use the Spark-based approach.
How do I build my first model?
Anzo generates models directly from data sources using the available metadata. If you're generating a model from a database, Anzo will use the foreign key connections to link the entities that it creates in the generated model. Using out-of-the-box tooling, you can edit that model as you’d like without needing to know OWL. Meaning, it’s easy to extend that model, add classes, or add properties as needed.
There’s also what we call the bring your own model (BYOM) approach. If you already have a model, you can import it directly into the Anzo Modeling Editor. In this approach, as you're bringing in data sources you will have to create virtualized mappings and data layers to transform your source data into that model.
What Skills do I need?
SPARQL and RDF are not the most widely distributed skills. We’ve seen the spectrum of customs ranging from lots of knowledge, graph, and semantics knowledgable people to the just diving in. The need to know some of these concepts will vary based on user roles.
Analysts, your data consumers using the BI tools, don’t really need to be knowledgeable in these concepts. They will continue to use their BI tools, but the data pushed to them will be accelerated thanks to Anzo.
For those users building models or linking and harmonizing data, data architects and engineers, there are a couple of useful skills:
Data Modeling
There's a lot of traditional data modeling tools overlap (e.g. Erwin Diagrams) these skills will translate easily. Anzo also has tooling to help accelerate the generation of those models.
SPARQL
Being able to write SPARQL queries can be important to help harmonize and transform data. SQL and SPARQL have a lot of overlap, so don’t be daunted. But again, Anzo contains a lot of features that will help accelerate learning and development. You may not even have to write SPARQL queries. Anzo directly suggests connections for you and will generate different visualizations and tables without a user writing queries. However, in some advanced use cases, you will want to know and understand how SPARQL works.
When diving into more advanced use cases, feel free to reach out to Cambridge Semantics. We would be happy to help you out.