This document was written as a guide to both those applying industry standard and other publicly available graph database benchmarks and those devising new ones in the context of Cambridge Semantics' database offering. The reason for writing it is that when it comes to AnzoGraph DB, we are learning that graph database practitioners are too frequently either applying the wrong yardsticks or misunderstanding the results of their comparisons of AnzoGraph DB against other offerings. The hope is that this article can help interested parties to appropriately position and correctly benchmark AnzoGraph DB.
Introduction
Many organizations are beginning to explore graph database technologies because they offer the potential to revolutionize the way they create, aggregate, connect up and analyze even their most heterogeneous and complex data. As with any new technology, data practitioners are looking for ways to quickly familiarize themselves with the capabilities of commercial offerings from vendors in the category, as well as those of comparable open-source projects. This helps them decide specifically which programs might provide the most appropriate solutions to their various business data problems.
One of the ways they do this is through comparative benchmarking of the different software options under consideration by attempting to determine the relative performance of different features in as close to an "apples versus apples" environment as can practically be achieved. In the case of database software, this is usually accomplished by using identical or very similar hardware, data sets and query mix as is practically possible, and then carefully recording the times for each element under test for each program being evaluated and then finally comparing all the results side by side. In some cases, comparison of relative costs may also be included as a criterion.
The test driving of graph databases has broadly been approached in the same manner as their RDBMS brethren as any quick hunt with Google search will reveal. Various organizations and even some vendors have published the results of graph specific benchmarks (e.g. Graph 500, Lehigh University Benchmark (LUBM) or SP2B) in order to help educate themselves or the market on what the sweet spot for each specific graph database program is and how they rank on various measures. In many cases, potential graph database users will also devise their own benchmarks consisting of datasets and queries representative of the business problem they need to solve. So the questions are what benchmarks are most appropriate when evaluating software like AnzoGraph DB and is there an easy way to try them? Read on for answers.
With AnzoGraph DB you are probably not comparing apples to apples
While general-purpose graph database technology has evolved over the last couple of decades, AnzoGraph DB represents a completely new and exciting departure for this technology. The reason for this is that until now, all graph database technologies that we are aware of have been optimized around a transaction processing data access pattern, known in the industry as OLTP (Online Transaction Processing). In complete contrast to this, AnzoGraph DB is the first of its kind enterprise-scale Graph Data Warehouse software. In other words, it specializes in the OLAP (Online Analytical Processing). We call what it does GOLAP (Graph Online Analytical Processing) and you need to think about and benchmark it differently.
Analytics (OLAP) versus Transactions (OLTP) briefly
The difference between these two styles of complimentary database systems and their target use cases is well understood in the world of Relational database technologies. But until now this distinction has never been a factor in the selection of a graph database because there was really only one form of graph technology available. Relational style OLTP (e.g. MySQL or AWS Aurora) describes mission-critical, real-time transactional workloads. It enables processing in which the system responds immediately to user requests. Examples include a shopping cart on an online web site, an EHR (Electronic Health Record) system or the data manipulation that supports an ATM banking withdrawal. This pattern of access is usually characterized by extremely frequent and fast access for insert, update and delete read/write operations, where an individual or a small number of records or "nodes/vertices" in graph speak, are pinpointed quickly, along with their related records connected by "edges".
Transactional application style queries are usually relatively simple compared to those performed by data warehouse technologies like Oracle's Exadata, Amazon Redshift, Snowflake, Teradata and now AnzoGraph DB! In contrast to transactional workloads, the OLAP analytics access patterns contain many aggregations, groupings, filters and join operations that read in very large amounts of integrated data blazingly fast, allowing users to slice and dice that data to analyze it and provide forecasts, roll-up reports, and insights. Another difference is that OLAP systems are designed to very quickly load enormous amounts of data, originating in other systems, and then post-process it inside the database as needed using ELT (Extract [from data source], Load [into data warehouse], Transform [using queries to clean, reshape, integrate and connect up the data]) queries.
Often solving a business problem completely requires solutions that employ both types of database systems, for example, the OLTP systems in a supermarket context would be responsible for recording an individual sale and decreasing a stock level or facilitating a transfer of payment. Meanwhile, an OLAP system would have been responsible for figuring out which coupons to issue the customer or providing management reports and analysis on aggregations of all the sales data from the entire supermarket chain for the day, month, year or even decade.
Does this mean we can finally scale up and perform complex queries across very large knowledge-graph data volumes?
Yes, it does. AnzoGraph DB can certainly scale and support sophisticated queries across very large amounts of integrated, connected data described as a knowledge-graph and that answer enterprises' knottiest data questions. The software supports enormously long and complex queries. Indeed, we have often seen query strings in the hundreds of kilobytes. These can include instructions to touch the farthest reaches of enormous graphs and through multiple hops, everything in-between, needing huge numbers of join operations, either to pull back and compute on data or in the multitude of filters that help to shape the questions being asked across many dimensions at the same time.
It is important to understand that the design of AnzoGraph DB is fundamentally different from the design of all the graph stores you may have been used to using in the past. AnzoGraph DB is a true MPP (massively parallel processing) compute engine designed for use on a multiple-CPU core cloud or data-center commodity server or clusters of such servers. Unlike other MPP style graph engines, it works by turning all the queries handed to it into pipelines in which every step is executed in parallel by every processor in the cluster, each against a separate segment of the data, in exactly the same way that HPC (High-Performance Computing) supercomputers do.
Does AnzoGraph DB support transactions?
Yes. AnzoGraph is ACID compliant, however, it is not designed or optimized for high-throughput small transactions. Use cases that require transactions, may use AnzoGraph unless there is a need for high-throughput sub-second response time for transactions.
What about Graph algorithm support?
Most graph databases support the execution of analytics algorithms that are specific to a graph data representation. AnzoGraph DB is no exception, so in addition to being a high-performance GOLAP engine, it also supports a growing library of algorithms that you can execute in parallel against very large volumes of data.
Do I have to have an enormous server or cluster of servers to run AnzoGraph DB?
No, you don't, unless you have a lot of data and/or complex queries. AnzoGraph DB will run on laptops and servers as tiny as a single/dual-core CPU and 8GB of memory and on such small machines the modest datasets (in the tens to one hundred millions of triples range), can load and execute OLAP style queries just fine. However, the difference in performance between AnzoGraph DB and all the other graph stores will likely be a bit less visible at these scales because all of them will seem to do reasonably at small data volumes and when the queries remain simple. However, once the data volumes begin to increase in size (into the hundreds of millions to multiple billions of nodes and edges) or the query complexity rises, then the difference in performance will quickly become clear. Obviously you will likely need to scale up the available RAM and number of CPU cores to accommodate rising data volumes. The good news is that you can because AnzoGraph DB can be scaled up from a laptop to a supercomputer sized server farm depending on your needs.
So just how much data can AnzoGraph DB handle?
Way more than you'll need. Most graph stores scale up both data volume and query throughput capacities by either increasing the size and power of the database server hardware (this is called vertical scaling) and after that replicating the server horizontally by taking either a complete or partial copy of the data onto additional "replica" machines in order to service more query requests in parallel. For this reason, users often find that they can quickly come to the point where it becomes impractical to add any more data to their database since they are essentially limited by the storage and computational capacity of a single server machine for their total database volume. Copying that entire database to another server to provide the capacity for additional simultaneous queries, does not increase the overall amount of data that can be stored and queried. In our experience, most other graph databases have fairly modest upper limits for practical use, even on large hardware servers and this should not be too surprising given they are not designed for OLAP.
With AnzoGraph DB's nothing shared parallel architecture, horizontal scaling is implemented differently. When the data volumes (or indeed query throughput) demands start to rise beyond what a single Linux server can accommodate, AnzoGraph DB can be configured to use the combined resources of multiple commodity servers and it will automatically shard out (the AnzoGraph DB term is "slice" ) the data across them evenly as data is loaded.
In this way, additional servers may be added to an AnzoGraph DB cluster as required to take advantage of its ability to scale up performance linearly. The largest AnzoGraph DB cluster created to date was assembled in 2016 with the help of Google's cloud team and comprised of two hundred regular Intel-based 64 core servers systems linked together on the same network, creating a single logical database running on all 12,800 CPU cores simultaneously. Every query sent to the system was decomposed into many thousands of smaller step operations and run in parallel by every CPU core over every slice of the data making up the complete data set containing over a trillion facts. This particular configuration shattered the previous record for the LUBM benchmark by doing in a couple of hours what previously took a high-end industry-standard relational data warehouse system many days. Given the multitude of software and hardware performance improvements made over the intervening years, we believe that today AnzoGraph DB would replicate the feat much faster on a cluster that is a fraction of that size!
So what does AnzoGraph DB do really well?
AnzoGraph DB is the first of its kind graph data warehouse software. You should expect it to be able to perform like existing data warehouse technologies and this is what you should consider comparing it to when you benchmark it...except to note at the same time it also delivers on the extraordinarily flexible capabilities of a schemaless graph. Unlike most graph databases, this means that AnzoGraph DB is very well suited to facilitating highly scalable, iterative, data integration and associated analytics. In fact, because it is schemaless and based on RDF triples, it solves many of the most difficult problems associated with data integration that plague and limit the use of traditional RDBMS data warehouse and Hadoop Big Data technologies.
After your raw data is loaded in parallel from heterogeneous application systems and data lakes with ELT (Extract [from upstream raw data source], Load [into AnzoGraph DB], Transform [clean, connect and reshape your data into a graph]) ingestion queries, AnzoGraph DB can then be used to perform all your data integration work in-database. It does this using business rules implemented as a series of high-performance transformation queries. Transformation queries can take advantage of the fact that AnzoGraph DB is a multi-graph system, which provides a simple means to optionally keep the results of a transformation separated from all other data sets that the transformation query might have acted on, each stored isolated in their own graph containers but queried as configurable collections that form a single logical graph. This method of refining raw data enables extraordinary agility in data integration through fast iteration, especially as compared to the laborious process of creating and debugging traditional ETL pipelines.
The process described above is quite unlike the ETL activities required to integrate and load most other graph databases where it is usually necessary to understand all the data sources upfront in order to pre-create the graph structure outside of the database and then load it afterwards into a target predefined fixed graph schema. AnzoGraph DB is far more flexible than graph databases that require upfront schema creation, since it can load all required data first without regard to its schema and is powerful and flexible enough to use queries to iteratively transform it into the knowledge-graphs needed for analytics or fed to downstream tools.
Anzograph DB supports both batch and interactive queries in exactly the same way. Depending on the level of query complexity and the number of available CPU cores, interactive analytics against even enterprise scale knowledge-graphs is perfectly feasible, with many queries returning in a fraction of a second. The multi-graph query support also has utility when it comes to managing access to data, since different elements of the graph (and even individual properties on nodes) can be stored in isolation in their own graph containers, with incoming application queries scoped to the composite logical graph provided just by the contents of the graph containers to which that user has access.
What about AnzoGraph DB in the cloud?
Another place to think about AnzoGraph DB differently is when it comes to cloud deployments and how to reduce the overall cost of data analytics operations. This is another dimension that data practitioners may often wish to benchmark when looking for a solution. While AnzoGraph DB will naturally operate 24x7 like other data warehouses, it was also designed to let our users take advantage of cloud computing's on-demand infrastructure and it's accompanying pay as you consume business model. For many, it is worth understanding how AnzoGraph DB can be quickly and automatically deployed and undeployed using cloud provider APIs and Kubernetes.
Depending on how fast the cloud provider data center can parallel provision its servers, this can take as little as two or three minutes and just a minute if Kubernetes hosts have been pre-provisioned. As mentioned earlier, AnzoGraph DB's parallel data loading is extremely fast. For reference, all the data for our two hundred node cluster benchmark in 2016 was fully loaded in 1740 seconds. After loading, any of ELT, analytics, and data extraction queries can be run as required and then the server or cluster can be automatically destroyed ending the charging period.
Since AnzoGraph DB scales linearly it is often as economical to use larger clusters (and thereby make more parallel processing available for loading and analytics) achieving the same results but in a much shorter time than leaving less cluster hardware to run longer. Since the charges for cloud hardware is often linear, using more servers but for a shorter period to achieve the same computing results costs the same as using fewer servers for a longer period. In situations where an analytics problem has a time-critical dimension, the ability of the cloud to on-demand quickly provide enormous amounts of hardware for parallel processing can provide both an easy and economical solution.
For detailed information on how to deploy AnzoGraph DB on various cloud providers and other platforms, please refer to the deployment guide.
I don't have all that much data and I don't want to create a multi-server cluster... is AnzoGraph DB still something I should try?
Yes, definitely. We have seen significant value achieved in even small systems where there is not much data but the query complexity is high (e.g. analytical queries) or fast load speeds are required. Obviously hardware with a higher number of CPU cores and hyperthreading enabled is going to perform better because AnzoGraph DB is a parallel database. Many people use AnzoGraph DB on their laptops using Docker if they are not running Linux natively and sometimes even WSL under Windows 10 although this is not yet officially supported. As additional data science functions are added in support of feature engineering as well as support for integration with popular data science software packages like Zeppelin, Jupyter, R and Pandas, users may find AnzoGraph DB extremely useful when integrating data from multiple sources, including unstructured data, into a knowledge-graph in order to support advanced data analytics and machine learning activities.
General benchmarking advice
Benchmarks are good for a general comparison of common features. However, if you are doing an evaluation of different graph databases, try to use as much real-world business data and queries designed to solve your specific business problems as possible for the most useful results. You should also define what success means for each of the tests you create. How much data will need to be loaded and queried? Are there SLAs in terms of how many simultaneous queries and response times? How difficult is it to configure and tune to get the results you need consistently and as data volumes rise? We have seen these real-world results will be much more relevant and useful in a readout versus some algorithm comparison that might not ever be used or if one database completes a few seconds faster but is extremely complex to query, configure or tune, that might not be what is really important to your organization.
AnzoGraph DB Benchmark do's and don'ts
- Do appreciate that most of the benchmarks available were not designed to test a graph data warehouse and make adjustments where you can because otherwise, it is simply not a fair comparison.
- Do use bigger test data volumes if you can. At least 100 million to a few billion triples at a minimum on an appropriate server will help you understand what the system can do and in the multiple billions if you decided to try benchmarking using a cluster.
- Do design and run complex queries that include many subqueries and complex filters. Queries should span many entity types that in an equivalent RDBMS system would mean many SQL JOIN operations.
- Do enjoy the extremely fast data loading. In the right hardware environments, we have seen parallel loads with as many as 3-4 million triples per second per server cluster node on systems where there is good inter-server network bandwidth and SSD disks.
- To achieve the fastest loading times, split your large RDF data files into many similar sized files so that the load can be spread evenly across every slice in the cluster. Each available core will be responsible for simultaneously loading an individual file into its slice, so if the data files are unevenly sized then the biggest files may still be loading while most of the cores on the cluster are idle.
- One method you can use to split up your files before benchmarking is to load them into AnzoGraph DB and then use the COPY command to copy out the contents of the system to multiple (optionally) compressed data files that will be similarly sized and so when you benchmark data loading you can use these files instead. RDF data compresses very well so it makes sense to keep the data you want to load in a compressed format since AnzoGraph DB can load these directly too.
- Do use the most recent version of AZG, we are making improvements all the time.
- Do please involve Cambridge Semantics if you can. We are very interested in understanding our customer's benchmarks and often use them to improve AnzoGraph DB's query performance.
- AnzoGraph DB includes a diagnostic tool called XRay that you can access from the web console and the command line. It can be used anywhere AnzoGraph DB is deployed to capture all the internal system state needed by our engineering team to understand every detail of how your parallel, distributed queries are executing.
- Note that an Xray file does not contain any of your data at all, so you have no need to worry about confidentiality when you send Cambridge Semantics an Xray to look at.
- Each Xray is an RDF data file that we then load into, you guessed it, AnzoGraph DB, to do query performance analytics using a front end query tool that we call ́The Doctor'.
- After examining an Xray the CSI engineering team may use the information to improve the performance of the database in future releases and often will suggest an alternative way of writing your query that will perform immediately better.
- Do run queries twice during benchmarking. The very first time each of your queries is run, it is converted to C++ code and compiled, distributed and only then executed which can take a little time. The second time the query is simply executed and this is the run to measure. There is no need for multiple warm-up runs.
- Do use appropriate hardware. While individual users will achieve value using AnzoGraph DB on a laptop for some relatively small scale data integration and analytics, the system comes into its own on a server-class machine or better yet a cluster of server-class machines that have sufficient CPU cores and RAM.
- For clusters, use a minimum of 4 nodes (4x8, 4x16, 8x16, 4x32, etc.) to overcome any network overhead with additional parallel processing capacity. A two node cluster will generally perform more slowly than a single server. AnzoGraph DB needs to move data between nodes while running a query, a 2-node system only has two hardware links, whereas a bigger cluster has more hardware links available to it, so is faster because the amount of data moved is a function of the query's needs rather than the hardware supporting it. It is usually better to opt for more nodes with less CPU cores than fewer or a single node with higher numbers of cores (choose 4x16 cores over 1x64 cores).
- For clusters, ALWAYS match the hardware so that all cluster server nodes are identical in terms of CPU and RAM so that the cluster's compute capacity and the shards on each node remain in balance. You want to avoid having a single part of your cluster dragging down the performance of the entire system which is what can happen given the parallel nature of AnzoGraph DB.
- Do try your benchmarks using both persisted as well as in-memory configurations.
- AnzoGraph DB is a high performance parallel in-memory engine (like Apache Spark, but with high performance!), with optional persistence backing to disk.
- Persistence enables faster restarts of the database without reloading.
- However, if your problem does not need persistence (i.e. you can parallel load your data from scratch each time instead of from AnzoGraph DB's persistent store) then turn off persistence for significantly faster load and insert/delete query performance. The performance of other kinds of queries will not be affected by this setting.
- Do ensure cluster networking hardware provides a minimum inter-node bandwidth of at least 10GB.
- Networking hardware beyond the minimum can dramatically improve load and query performance. In fact, less than the 10GB minimum possibly may function but is not supported by Cambridge Semantics.
- When you are configured to use an AnzoGraph DB cluster, you can run the network benchmark from the AnzoGraph DB web console to check to see if your cluster's networking hardware really meets AnzoGraph DB's minimum requirements.
What else will it be useful for me to know?
In addition to graph specific algorithms, AnzoGraph DB also has a growing list of data science algorithms that support in-graph feature engineering for Machine Learning.
Here is a list of what to expect (and perhaps test in your benchmarks):
- Load large amounts of data loaded very fast
- AnzoGraph DB requires very little or no tuning or configuration to achieve high-performance, it's fast out of the box.
- Developers are not required to create and maintain partitions, or decide how data is distributed across the cluster, it's automatically handled.
- There are no indexes to create and maintain, AnzoGraph DB handles all of that for you.
- Very fast in-graph complex queries for both ELT and various forms of analytics
- Testing ELT transformation query performance is important as they offer the fastest and easiest way to clean, integrate and shape your data in-graph to create Knowledge-Graphs.
- Data warehouse operations like Advanced Grouping, Windowed Aggregates & Views
- Standards (SPARQL 1.1, RDF, RDFS-Plus inferencing)
- Labeled Property Graphs (RDF* proposed W3C standard)
- An extensive library of Excel-like functions and algorithms to support business analytics
- An expanding library of graph algorithms
- An expanding library of data science functions and algorithms
- C++ and Java SDKs with API's that allow the implementation of custom UDFs (user defined functions, UDAs (user defined aggregates) and UDSs (user defined services) that allow the creation of parallel connectors to third-party systems and services for loading and exporting data in parallel as well as integrating remote processing in sub-queries.
To learn more, read our release notes and get AnzoGraph DB, today.
Please note: The license agreement that accompanies AnzoGraph DB and governs your use of the software, restricts the general publication of benchmark results for the AnzoGraph DB software without the written permission Cambridge Semantics, Inc.