This document was written as a guide to both those applying industry standard and other publicly available graph database benchmarks and those devising new ones in the context of Cambridge Semantics' database offering. The reason for writing it is that when it comes to AnzoGraph DB, we are learning that graph database practitioners are too frequently either applying the wrong yardsticks or misunderstanding the results of their comparisons of AnzoGraph DB against other offerings. The hope is that this article can help interested parties to appropriately position and correctly benchmark AnzoGraph DB.
Many organizations are beginning to explore graph database technologies because they offer the potential to revolutionize the way they create, aggregate, connect up and analyze even their most heterogeneous and complex data. As with any new technology, data practitioners are looking for ways to quickly familiarize themselves with the capabilities of commercial offerings from vendors in the category, as well as those of comparable open-source projects. This helps them decide specifically which programs might provide the most appropriate solutions to their various business data problems.
One of the ways they do this is through comparative benchmarking of the different software options under consideration by attempting to determine the relative performance of different features in as close to an "apples versus apples" environment as can practically be achieved. In the case of database software, this is usually accomplished by using identical or very similar hardware, data sets and query mix as is practically possible, and then carefully recording the times for each element under test for each program being evaluated and then finally comparing all the results side by side. In some cases, comparison of relative costs may also be included as a criterion.
The test driving of graph databases has broadly been approached in the same manner as their RDBMS brethren as any quick hunt with Google search will reveal. Various organizations and even some vendors have published the results of graph specific benchmarks (e.g. Graph 500, Lehigh University Benchmark (LUBM) or SP2B) in order to help educate themselves or the market on what the sweet spot for each specific graph database program is and how they rank on various measures. In many cases, potential graph database users will also devise their own benchmarks consisting of datasets and queries representative of the business problem they need to solve. So the questions are what benchmarks are most appropriate when evaluating software like AnzoGraph DB and is there an easy way to try them? Read on for answers.
With AnzoGraph DB you are probably not comparing apples to apples
While general-purpose graph database technology has evolved over the last couple of decades, AnzoGraph DB represents a completely new and exciting departure for this technology. The reason for this is that until now, all graph database technologies that we are aware of have been optimized around a transaction processing data access pattern, known in the industry as OLTP (Online Transaction Processing). In complete contrast to this, AnzoGraph DB is the first of its kind enterprise-scale Graph Data Warehouse software. In other words, it specializes in the OLAP (Online Analytical Processing). We call what it does GOLAP (Graph Online Analytical Processing) and you need to think about and benchmark it differently.
Analytics (OLAP) versus Transactions (OLTP) briefly
The difference between these two styles of complimentary database systems and their target use cases is well understood in the world of Relational database technologies. But until now this distinction has never been a factor in the selection of a graph database because there was really only one form of graph technology available. Relational style OLTP (e.g. MySQL or AWS Aurora) describes mission-critical, real-time transactional workloads. It enables processing in which the system responds immediately to user requests. Examples include a shopping cart on an online web site, an EHR (Electronic Health Record) system or the data manipulation that supports an ATM banking withdrawal. This pattern of access is usually characterized by extremely frequent and fast access for insert, update and delete read/write operations, where an individual or a small number of records or "nodes/vertices" in graph speak, are pinpointed quickly, along with their related records connected by "edges".
Transactional application style queries are usually relatively simple compared to those performed by data warehouse technologies like Oracle's Exadata, Amazon Redshift, Snowflake, Teradata and now AnzoGraph DB! In contrast to transactional workloads, the OLAP analytics access patterns contain many aggregations, groupings, filters and join operations that read in very large amounts of integrated data blazingly fast, allowing users to slice and dice that data to analyze it and provide forecasts, roll-up reports, and insights. Another difference is that OLAP systems are designed to very quickly load enormous amounts of data, originating in other systems, and then post-process it inside the database as needed using ELT (Extract [from data source], Load [into data warehouse], Transform [using queries to clean, reshape, integrate and connect up the data]) queries.
Often solving a business problem completely requires solutions that employ both types of database systems, for example, the OLTP systems in a supermarket context would be responsible for recording an individual sale and decreasing a stock level or facilitating a transfer of payment. Meanwhile, an OLAP system would have been responsible for figuring out which coupons to issue the customer or providing management reports and analysis on aggregations of all the sales data from the entire supermarket chain for the day, month, year or even decade.
Does this mean we can finally scale up and perform complex queries across very large knowledge-graph data volumes?
Yes, it does. AnzoGraph DB can certainly scale and support sophisticated queries across very large amounts of integrated, connected data described as a knowledge-graph and that answer enterprises' knottiest data questions. The software supports enormously long and complex queries. Indeed, we have often seen query strings in the hundreds of kilobytes. These can include instructions to touch the farthest reaches of enormous graphs and through multiple hops, everything in-between, needing huge numbers of join operations, either to pull back and compute on data or in the multitude of filters that help to shape the questions being asked across many dimensions at the same time.
It is important to understand that the design of AnzoGraph DB is fundamentally different from the design of all the graph stores you may have been used to using in the past. AnzoGraph DB is a true MPP (massively parallel processing) compute engine designed for use on a multiple-CPU core cloud or data-center commodity server or clusters of such servers. Unlike other MPP style graph engines, it works by turning all the queries handed to it into pipelines in which every step is executed in parallel by every processor in the cluster, each against a separate segment of the data, in exactly the same way that HPC (High-Performance Computing) supercomputers do.
Does AnzoGraph DB support transactions?
Yes. AnzoGraph is ACID compliant, however, it is not designed or optimized for high-throughput small transactions. Use cases that require transactions, may use AnzoGraph unless there is a need for high-throughput sub-second response time for transactions.
What about Graph algorithm support?
Most graph databases support the execution of analytics algorithms that are specific to a graph data representation. AnzoGraph DB is no exception, so in addition to being a high-performance GOLAP engine, it also supports a growing library of algorithms that you can execute in parallel against very large volumes of data.
Do I have to have an enormous server or cluster of servers to run AnzoGraph DB?
No, you don't, unless you have a lot of data and/or complex queries. AnzoGraph DB will run on laptops and servers as tiny as a single/dual-core CPU and 8GB of memory and on such small machines the modest datasets (in the tens to one hundred millions of triples range), can load and execute OLAP style queries just fine. However, the difference in performance between AnzoGraph DB and all the other graph stores will likely be a bit less visible at these scales because all of them will seem to do reasonably at small data volumes and when the queries remain simple. However, once the data volumes begin to increase in size (into the hundreds of millions to multiple billions of nodes and edges) or the query complexity rises, then the difference in performance will quickly become clear. Obviously you will likely need to scale up the available RAM and number of CPU cores to accommodate rising data volumes. The good news is that you can because AnzoGraph DB can be scaled up from a laptop to a supercomputer sized server farm depending on your needs.
So just how much data can AnzoGraph DB handle?
Way more than you'll need. Most graph stores scale up both data volume and query throughput capacities by either increasing the size and power of the database server hardware (this is called vertical scaling) and after that replicating the server horizontally by taking either a complete or partial copy of the data onto additional "replica" machines in order to service more query requests in parallel. For this reason, users often find that they can quickly come to the point where it becomes impractical to add any more data to their database since they are essentially limited by the storage and computational capacity of a single server machine for their total database volume. Copying that entire database to another server to provide the capacity for additional simultaneous queries, does not increase the overall amount of data that can be stored and queried. In our experience, most other graph databases have fairly modest upper limits for practical use, even on large hardware servers and this should not be too surprising given they are not designed for OLAP.
With AnzoGraph DB's nothing shared parallel architecture, horizontal scaling is implemented differently. When the data volumes (or indeed query throughput) demands start to rise beyond what a single Linux server can accommodate, AnzoGraph DB can be configured to use the combined resources of multiple commodity servers and it will automatically shard out (the AnzoGraph DB term is "slice" ) the data across them evenly as data is loaded.
In this way, additional servers may be added to an AnzoGraph DB cluster as required to take advantage of its ability to scale up performance linearly. The largest AnzoGraph DB cluster created to date was assembled in 2016 with the help of Google's cloud team and comprised of two hundred regular Intel-based 64 core servers systems linked together on the same network, creating a single logical database running on all 12,800 CPU cores simultaneously. Every query sent to the system was decomposed into many thousands of smaller step operations and run in parallel by every CPU core over every slice of the data making up the complete data set containing over a trillion facts. This particular configuration shattered the previous record for the LUBM benchmark by doing in a couple of hours what previously took a high-end industry-standard relational data warehouse system many days. Given the multitude of software and hardware performance improvements made over the intervening years, we believe that today AnzoGraph DB would replicate the feat much faster on a cluster that is a fraction of that size!
So what does AnzoGraph DB do really well?
AnzoGraph DB is the first of its kind graph data warehouse software. You should expect it to be able to perform like existing data warehouse technologies and this is what you should consider comparing it to when you benchmark it...except to note at the same time it also delivers on the extraordinarily flexible capabilities of a schemaless graph. Unlike most graph databases, this means that AnzoGraph DB is very well suited to facilitating highly scalable, iterative, data integration and associated analytics. In fact, because it is schemaless and based on RDF triples, it solves many of the most difficult problems associated with data integration that plague and limit the use of traditional RDBMS data warehouse and Hadoop Big Data technologies.
After your raw data is loaded in parallel from heterogeneous application systems and data lakes with ELT (Extract [from upstream raw data source], Load [into AnzoGraph DB], Transform [clean, connect and reshape your data into a graph]) ingestion queries, AnzoGraph DB can then be used to perform all your data integration work in-database. It does this using business rules implemented as a series of high-performance transformation queries. Transformation queries can take advantage of the fact that AnzoGraph DB is a multi-graph system, which provides a simple means to optionally keep the results of a transformation separated from all other data sets that the transformation query might have acted on, each stored isolated in their own graph containers but queried as configurable collections that form a single logical graph. This method of refining raw data enables extraordinary agility in data integration through fast iteration, especially as compared to the laborious process of creating and debugging traditional ETL pipelines.
The process described above is quite unlike the ETL activities required to integrate and load most other graph databases where it is usually necessary to understand all the data sources upfront in order to pre-create the graph structure outside of the database and then load it afterwards into a target predefined fixed graph schema. AnzoGraph DB is far more flexible than graph databases that require upfront schema creation, since it can load all required data first without regard to its schema and is powerful and flexible enough to use queries to iteratively transform it into the knowledge-graphs needed for analytics or fed to downstream tools.
Anzograph DB supports both batch and interactive queries in exactly the same way. Depending on the level of query complexity and the number of available CPU cores, interactive analytics against even enterprise scale knowledge-graphs is perfectly feasible, with many queries returning in a fraction of a second. The multi-graph query support also has utility when it comes to managing access to data, since different elements of the graph (and even individual properties on nodes) can be stored in isolation in their own graph containers, with incoming application queries scoped to the composite logical graph provided just by the contents of the graph containers to which that user has access.
What about AnzoGraph DB in the cloud?
Another place to think about AnzoGraph DB differently is when it comes to cloud deployments and how to reduce the overall cost of data analytics operations. This is another dimension that data practitioners may often wish to benchmark when looking for a solution. While AnzoGraph DB will naturally operate 24x7 like other data warehouses, it was also designed to let our users take advantage of cloud computing's on-demand infrastructure and it's accompanying pay as you consume business model. For many, it is worth understanding how AnzoGraph DB can be quickly and automatically deployed and undeployed using cloud provider APIs and Kubernetes.
Depending on how fast the cloud provider data center can parallel provision its servers, this can take as little as two or three minutes and just a minute if Kubernetes hosts have been pre-provisioned. As mentioned earlier, AnzoGraph DB's parallel data loading is extremely fast. For reference, all the data for our two hundred node cluster benchmark in 2016 was fully loaded in 1740 seconds. After loading, any of ELT, analytics, and data extraction queries can be run as required and then the server or cluster can be automatically destroyed ending the charging period.
Since AnzoGraph DB scales linearly it is often as economical to use larger clusters (and thereby make more parallel processing available for loading and analytics) achieving the same results but in a much shorter time than leaving less cluster hardware to run longer. Since the charges for cloud hardware is often linear, using more servers but for a shorter period to achieve the same computing results costs the same as using fewer servers for a longer period. In situations where an analytics problem has a time-critical dimension, the ability of the cloud to on-demand quickly provide enormous amounts of hardware for parallel processing can provide both an easy and economical solution.
For detailed information on how to deploy AnzoGraph DB on various cloud providers and other platforms, please refer to the deployment guide.
I don't have all that much data and I don't want to create a multi-server cluster... is AnzoGraph DB still something I should try?
Yes, definitely. We have seen significant value achieved in even small systems where there is not much data but the query complexity is high (e.g. analytical queries) or fast load speeds are required. Obviously hardware with a higher number of CPU cores and hyperthreading enabled is going to perform better because AnzoGraph DB is a parallel database. Many people use AnzoGraph DB on their laptops using Docker if they are not running Linux natively and sometimes even WSL under Windows 10 although this is not yet officially supported. As additional data science functions are added in support of feature engineering as well as support for integration with popular data science software packages like Zeppelin, Jupyter, R and Pandas, users may find AnzoGraph DB extremely useful when integrating data from multiple sources, including unstructured data, into a knowledge-graph in order to support advanced data analytics and machine learning activities.
General benchmarking advice
Benchmarks are good for a general comparison of common features. However, if you are doing an evaluation of different graph databases, try to use as much real-world business data and queries designed to solve your specific business problems as possible for the most useful results. You should also define what success means for each of the tests you create. How much data will need to be loaded and queried? Are there SLAs in terms of how many simultaneous queries and response times? How difficult is it to configure and tune to get the results you need consistently and as data volumes rise? We have seen these real-world results will be much more relevant and useful in a readout versus some algorithm comparison that might not ever be used or if one database completes a few seconds faster but is extremely complex to query, configure or tune, that might not be what is really important to your organization.
AnzoGraph DB Benchmark do's and don'ts
What else will it be useful for me to know?
In addition to graph specific algorithms, AnzoGraph DB also has a growing list of data science algorithms that support in-graph feature engineering for Machine Learning.
Here is a list of what to expect (and perhaps test in your benchmarks):
To learn more, read our release notes and get AnzoGraph DB, today.