The Future of Data Management, Collective Intelligence & the Wisdom of Knowledge Graphs

Posted by Boris Shalumov on Oct 14, 2021 7:15:00 AM
Find me on:

In this blog post you will find the documented conversation between myself, Sr. Knowledge Graph Engineer Boris Shalumov, and Knowledge Graph Expert and CTO Sean Martin. Below you will find the embedded full conversation, selections for your favorite podcast channels, and the full written transcript. 

To better help you sort through the transcript, here are categories that might be of interest:

 

 

Find Your Channel

YouTube: https://bit.ly/2Wf5yxS
Spotify: https://spoti.fi/3o8hxZq
Google Podcasts: https://bit.ly/3ufCrXU
Apple Podcasts: https://apple.co/3EOBSJj

 

PodcastTranscript

 

Boris Shalumov: 

Welcome to Chaos Orchestra episode number 10 - “The Future of Enterprise IT”.

We live in a world where Google is able to understand the context of our questions and find sources with the right results from billions within milliseconds. Where Facebook has more accurate cognitive models of individuals and in society than social science departments of the best universities in the world, and is thus able to present us a very tailored perspective on the world. Where Amazon knows our needs and desires better than most of our friends and is able to adjust recommendations to us every day. And like in the parallel world, there are these companies from the life science and healthcare, automotive, and financial industries that kind of accepted the impossibilities to integrate all data, to make all of the companies knowledge transparent, accessible and to be able to build sustainable and flexible solutions rather than an application for each problem.

This is something that sometimes scares us, yet it’s amazing how easy it seems for the big five and how much knowledge they're actually able to monetize. Today's guest is Sean Martin, co-founder and CTO of Cambridge Semantics. With the help of Sean, I hope to understand why, despite knowledge graph technologies, natural language processing and machine learning algorithms being more accessible than ever to the outside world, companies are still struggling to enable collective intelligence and what he thinks about how far we are from this Google like search within enterprises over all data structured and unstructured.

 

Crossing the chasm in technology adoption

 

Sean Martin: They're actually quite a few reasons, but they boil down to a question of focus and having the resources to attack the problem. Integrating data at scale is, at any scale really, is generally very expensive because it's fiddly and it takes a lot of attention to detail. So you need people with the right skills. But it's also very exacting, it's very easy to get it wrong and to make mistakes. Then applying more advanced techniques in search, machine learning, and so on also takes enormous amounts of skills, and all of that is extremely costly.

So if you've got a focused business like Google, it's really your job to produce a very good search result when somebody enters something. Or Facebook, which is really attempting to create a massive amount of engagement, have you spend as much time as possible on their system because they're making money in both cases, out of advertising. They are able to bring to bear huge amounts of resources, in terms of buying the best skills and paying for enormous amounts of R&D, and doing an awful lot of testing at enormous volumes too. They're attacking a problem that has a very large number of ways of testing it. So many, many millions of people using their services, if not billions. That gives them a way to hone what they're doing whereas other organizations don't get that kind of scale.

And so when you look at what enterprise is doing, they can truly only afford a very small fraction of that kind of investment for their problems. And they're not just tackling one problem. They've got many dozens and dozens and dozens of problems that they need answered to pursue their business, which often isn't the core business, unlike Google delivering a good search with a lot of context. But if I'm making a motor vehicle or a chemical or a drug, for example, that's what my business is and everything else is sort of ancillary to that.

So the investment tends to follow where the products are, even if it was possible to put in the kind of resources that Facebook or Google or one of the others you mentioned can put in.

So for me, the question is how does one attack this?

Knowing that enterprises have a limit even if they are very, very profitable enterprises. They are still limited in what they can do in terms of hiring of people to do this particular job, which again isn't necessarily right in the core of their business, but it's an add to it. And so what can we do, given what they want to do, to help them bring down the cost of data integrations to a point where they can do more and more of them than they can at the moment? Right now, they're really very limited.

 

The data integration business case

 

Sean Martin:

Only the most valuable problems get addressed, the ones that have business cases that either are saving money because they're not getting fines (If you're in the banking industry, you know, massive fines, a few fringe regulatory stuff. Same in the pharmaceutical industry.). So either they're saving money and keeping themselves out of legal trouble, or it's something that can bring down costs. Or in some cases, maybe it's a new business which is data driven and businesses are starting to get into that where there's actually a profit motive. But essentially there's a business case that has to be pretty valuable for them to want to decide out of all the data integration problems that they could potentially tackle that might be useful, which ones really, in the end, get done? 

 

And then there's a second problem, which is this extremely high rate of failure in enterprises. Google, Facebook, and so on have employed legions of PhDs, quite literally, who are the people at those academic institutes or at least were there for a year or two ago. They were the leading researchers on machine learning or this or that or the next thing and they hire these guys. And the problem they give them [to solve], more often, is not that hard. And not surprisingly, they hit it out of the park. All the time! Simply because you're applying very clever people to the problem and giving them sufficient resources to tackle them.

I’d guess there's plenty of room for failure within the margins of the amount of money those businesses are able to bring in through advertising. This is not the case with most enterprises. 

 

50% of Data Warehousing Initiatives Fail

Now, I've read the statistics somewhere and I have no particular reason to doubt it. I don't know if it's really as high as this or not, but it's still high. Apparently, 50 percent of data warehousing initiatives fail for one reason or another, and there's a whole lot of reasons why they can go wrong. But in general, it's a question of matching two elements. One is the people part of the problem, having the right skills and the right discipline and the right processes and so on. And then there's the technology, the choice of technology. These two things have to go together and work perfectly to avoid a failure or failure over time. In general, these systems need to be long lived if you're putting significant investments in them.

So anyway, I think the problem really is that integration is very hard. It is very exacting and it takes a lot of resources to bring to bear on it. The technology, sort of the most recent technology, is relatively expensive to deliver. It's always been quite expensive too, to deliver that integration. And these other big companies are just highly focused and get massive rewards from getting it right. And it's partly that they've picked their battles, right? There's another thing. 

If the reason Amazon knows more about us than our friends is because I don't spend my time telling my friends about, I want this or I want that, I want the next thing. But do I add them to my list in Amazon? Yes, I do. I'm giving them the information. Whereas very little of that is probably shared with anybody else in my life, and I'm sure that's a common pattern.

Similarly, Facebook is recording all of your interactions with everybody else. You're actually giving them that information in a nice, easy way to digest it. They picked the interface. They can capture it in a way that's very clean to them, and they've got plenty of time to get it right. Google is interesting in that I think they've always got a human in front of the screen who's responding to what's searched for. So even if they don't get the context of what I'm asking right, why do they always get it right with context? They've actually got me sitting there ready to adjust my search if I didn't like what the response is, if they didn't interpret it correctly and sometimes they'll give you two or three choices. If you put in an address, you may see some search results, you may see a little map thing. You've got to kind of take the next action and help them narrow that context down. But in every case, they've picked the battlefield and they've thrown enormous resources at the problems, and they're being very successful doing so. I think if any business could attack the problem that they choose with the same kind of resources, they'd have the same kind of success. 

But that's the problem. There just aren't those kind of resources. There aren't those kind of skills available as widely.

And so the question for me is how do we try and level the playing field as best we can? 

So long, long answer. And there's many pieces to unpack within that, but I don't even think I've covered it all. But it's really not a fair competition. That's what I would say.

 

Boris Shalumov: Yeah, agreed, the companies you mentioned have a lot of research resources, but aren’t we at a point-in-time where all these technologies are so accessible, that you don't need to be a pioneer in order to make use of them?

Sean Martin: Well, historically, the most successful data integration technology has been the data warehouse and various relational variants of that. And we're going back to the 70s when Clyde was writing his papers and Larry Ellison was reading them and starting Oracle and IBM ended up with two and so on. And then we had a whole sort of multiple generations of data warehousing systems for gathering data up ETL systems like Informatica. We had the cubes and the BI revolution and so on. So there's been successive generations that have built on that original relational base and it has been very successful.

However, it also had a lot of issues over the years. As I said before, there's a fairly high failure rate. Maybe you didn't get the right system, or it wasn't scalable enough, or you didn't budget properly for it, or you just weren't able to secure the skills that you needed to design the models, or the discipline you needed to be able to do these processes day in, day out and do them cleanly. Whatever those combinations of things are that you had to get right. And that's many years worth of building on these. The ecosystem that started with relational technology has been the most successful to date.

Just the facts. And the people are not that happy and comfortable with it because of all the expense and the difficulties associated with it. So they're constantly looking for easier ways to do things. And I think the whole big data revolution we had over the last sort of 15 years. I think in many ways that was a reaction against some of the constraints of the relational world. But the big data revolution really never delivered a better way of integrating data. And you see a return to people looking at relational and data warehousing, as if they ever went away.

When we started Cambridge Semantics about 14 years ago, I remember being warned by people to not use the word. And in fact, I even heard it in conferences, not to use the words data warehouse. They were dirty words within enterprise. And if you were proposing a project which was a data warehouse, you'd have to call it something else. Otherwise the business would jump away from you. And so people try to get into big data. And there were a bunch of vendors who try to make that an alternative way of integrating data where we went from schema on write to schema on read and what we ended up with, not unpredictably, was the data swamps and the chaos and so on that followed.

And not only that, those big data technologies were, for the most part, invented by some of the companies you mentioned at the head for a very specialized purpose. They were for log processing and not for generalized data integration, where you've got a lot of heterogeneous data that you're trying to tie together. The big data technologies, the Apache Technologies weren't great for pulling data together.

They were good for aggregating a lot of data, but not really for pulling it together into a coherent whole. What has also happened over the same period and perhaps over the last 20 years, is people have been learning about graph technologies. And how graph technology starts with a different kind of database from the principles in which the relational built, and the problem has been to scale that technology to a point where it could deal with meaningful amounts of data, practical amounts of data. Not only that, the database is just the bottom layer. It's where data is stored. You've still got to get data into this database, into a graph form. And then you need tools and middleware that understands data which is no longer in a tabular form. So you've got since the 1970s environment, all built around tabular representations and big data was, for the most part, tabular except for processing of text.

But let's put that all to the side. In the end, it was things like Hive and Hadoop for dealing with an enterprise. We're dealing with tabular structures where they were trying to avoid licensing fees from some of the big vendors, but also it was fashionable and people wanted to kick the tires. So we've had tabular structures for most of our structured data since the 70s, whether it be the big data environments or  the relational ones, and we've had the software that understands how to drive those things also built on tabular technologies and be relatively poor on metadata, too.

Metadata was always kept on the side in spreadsheets. But often there were practical limits to how much you could cram into a set of linked up tables. So the graph technologies have had to climb that whole sort of build their whole ecosystem to essentially deliver the same kinds of things, but with graph. Starting with really no software that understood graph from the point of view of the database, the middleware, the user interface tools, the ETL tools that actually populate your graphs.

All of that didn't really exist, and it was always a second best. So not only were people trying to understand what graph was and how to use it, but they were always working with mainly broken software or very limited software. And the thing didn't scale. So there were a lot of problems with it. We didn't have middleware that was graph aware or visualization that was graph aware except in patches, just occasionally you'd see something. 

 

We Built the Entire Graph Stack

 

But it would always be just a part of the problem, and you'd never see a comprehensive solution that tied all the pieces you need together in one coherent way, like assembling a stack on top of a relational database today. It's pretty easy. You pick your database, you pick your middleware, you pick your BI tool, and they're all going to work together because they were designed to work together. That was not true of the graph world when we started looking at it.

Me, colleagues, and competitors have essentially built our entire ecosystem since then. We essentially had to build every part of the stack since then. And really, what has happened in the last few years is that Stack has now reached, and in some cases is going beyond, what you can do with the traditional stack. But of course, as customers, you've got to make these two words coexist. They've invested in this database or BI tool and they've got skilled users who understand those. There's a lot of trust in those tools, because they've been using them successfully for a long time. 

 

Graph Scalability is Here

 

So the graph people have always had to figure out how to retrofit themselves into this world. But the news I think of the last couple of years is, it's getting there. Graphs can be scaled now. We see a number of companies who are building very big graphs. 

Even data virtualization with graphs has reached a similar level of maturity as relational technologies have. Our customers are already using it for enterprise-scale data volumes.

The middleware which essentially packages it with a lot of best practices that were not there before. Users, more like developers and researchers, would have to kind of invent those best practices and figure out what works and go down a lot of blind alleys. All that experience has now been packaged into systems, into platforms. And there are a number of graph platforms out there, and they are starting to make it into the enterprise. I mean, what middleware essentially does is it abstracts and makes it easier to use that on which it is sitting.

Now we have good middleware and good tools for graph. We've got good ETL tools for graph, middleware for graph, good BI tools for graph, and we've got good connectors to the more traditional world. Now finally, we're getting to the point where graph can actually come into its own and play on a more level playing field and deliver far more value as it continues to get improved.

Simply, the reason people are invested in graph is they saw the potential for data integration of complex data, heterogeneous data: that would be tabular data and semi-structured data. So tabular data is considered to be simple data, semi structured data like xml and maybe what you've got in Parquet and JSON and those kind of systems. Then complex data, which you extract from textual sources, and video, and audio, and so on.

And actually, there's been a quiet revolution, or maybe even not so quiet revolution on that side, too. When we were first looking at integrating information extracted from text, we were essentially doing partnerships with companies that specialized in textual extracts, and each extractor was highly specialized. In the medical field there were far more of them than in the manufacturing field. Also, obviously, the government had invested a lot for various reasons in extracting data out of text. So, you found very specialized tools that were either statistically-based, or dictionaries, or whatever their secret sauce was. But in the last 10 years, we've now seen machine learning being applied very successfully and potentially much more successfully to that. So it can be done by people who've got data science training. You don't need a specialist vendor anymore to make some highly specialized extractor like you would have 10 years ago. 

Ten years ago, you relied on vendors to build them and there was enough of a market there for those things to emerge. Now, you can go ahead and just use many of the machine learning frameworks to trade in your own models to extract whatever you want. So the ability to pull signal out of complex data has enormously improved over the last 10 years. But again, it's not something you get for free. You've got to invest in it. 

 

You don’t need to know the questions in advance

 

For some you've got to pay a couple of data scientists or a team to build that, and you've got to be ready to invest in the iterations needed to get it together. But the point is you can build that. And so from our point-of-view, as someone who consumes that kind of data, it's just getting better and better because you've got the ability to actually pull signal out of text and bring it out in a coherent way that we can then combine with those other forms of less complex data to make it increasingly relevant.

Perhaps your listeners/readers don't know about this, but from my point-of-view, graph structures and particularly ones that are schemaless, where they can be grown, we use OWL as a modeling language, but any system where you can essentially grow the schema dynamically as data enters it, and when I say grow, either grow the schema or you’re schemaless and essentially you are modeling the data with data, more data, those systems are absolutely perfect for pulling data out of unstructured sources where you really don't know what kinds of things you're going to find in there. You're constantly discovering new entity types, and you've got to end up with a model that can expand to accept those new entity types with their attributes and so on. And so the form of knowledge graph that our company uses is exactly very good for that. And that has also been, I think, true of some of our competition. 

So the graph definitely is now showing promise in terms of reducing the skills you need to actually start adopting it. And that means you’re reducing the costs, and the time, and the risk. And graph is increasingly capable of addressing more and more of the practical problems you have, because it can address the scale. The scale that you previously were only able to tackle with well-tuned relational databases, you're now able to tackle the same kinds of problems. 

So I would say the landscape is changing in the favor of the knowledge graph as people are realizing that it's possible, that there's a lot of tools to help them, that they're going to get results, with less risk, and more quickly than they previously could. And the solutions they're going to be able to deliver, and in many cases would not be able to deliver any other way (they would just not have had an approach to delivering them). So it's actually a pretty exciting time as we watch this technology start to flow out into the market. Obviously it's still an early adopter market.

You have to be knowledge graph interested. But for those that are, the cost of getting into it is just constantly dropping every year. It just gets cheaper and cheaper to do. And I think that's going to continue. 

So another long answer for you there.

 

Boris Shalumov: Yeah, and I think one of the most critical requirements is actually being able to answer questions that you were not prepared to answer. So not building systems that are tailored to one specific use case, but having this kind of flexibility that allows you to build sustainable and long term systems. But one thing that you mentioned is that these companies that really exploit graph technologies like the Big Five, for example, they're always playing in an area with massive investments and massive people. So how do enterprises deal with that? It's not the core of their business model to sell this knowledge. How do they deal with this late breaking point?

 

Sean Martin: First of all, I agree with you.

There are different dimensions to this, but I firmly believe that the data integration enabled by graph is a killer application, that we scale.

And it's more than that. It's the nature of the data integration that makes it so interesting. Which kind of goes to the second thing you said, which is people need questions to be answered, right? And one of the shortcomings of the previous generation of data integration, the data warehouse relational approach, is you kind of had to know the questions you wanted to ask up front so you could design a system specifically about delivering the answers. And that was difficult unless you were doing something which was very obvious. But business moves very quickly and those technologies would often take a long time to implement.

And by the time you had it working, the business had often moved on or had given up on getting particular questions to be answered or they weren't relevant anymore. So one of the big complaints about the previous warehouse style technologies is that it was too slow, too inflexible. If you had a second order question, one that you hadn't quite anticipated, maybe based on a question that you had anticipated that was being answered by the system because it was designed to do so (it had a schema, it had the data, it had everything you needed), you kind of had a dead stop there until the system could be adjusted through some new version. 

We had banks that were adjusting their systems once every six months, you would get a new release of the warehouse. You know, it just didn't work. So how do you get your questions answered much more quickly? If you think about that previous paradigm, which was really about, I'm going to know what questions people want to ask. I'm going to prepare the schema and figure out what data they need and put that data into that schema. Now you can answer those questions, run reports, run BI. The BI revolution was a way to begin to address the problem of inflexibility, because you can now essentially get an extract and then use a much more agile tool. A Spotfire, or Tableau, or QlikView. One of these tools could at least give you a way of allowing end users to answer the questions.

But they're always doing it on a narrow extract of data coming from somewhere upstream. And quite often your question would fall off the edge of what you'd actually extracted. They were not very good at complex data where you've got lots of different types of things in relationships to each other. They tended to be again more tabular, but at least gave you the beginnings of an answer to how to make this more flexible. That was kind of where the BI revolution came from. 

What graph promises is to turn the paradigm a little bit on its head, because what we're able to do there is not think about what questions you want to answer. I mean, you do in broad strokes because, we're going to answer questions about drugs, or this, or that, or the next thing. You have to know your domain of questions. But because graph technology is so flexible in terms of being able to ingest data of really any kind of schema and model you just ingest it all. As opposed to thinking very carefully in advance, which questions am I going to answer? What data do I need? Then going through the painful process of setting up the ETL programs or whatever your mechanism is for populating your database. All that was very expensive and painful.

Whereas now we just say, we need all this data related to these things which are going to be answering questions about. We load it all up and we figure out how to knit it together. And unlike in relational, where you start to worry about keys, and joins, and you've got three levels of models (your physical model and logical models, etc.). All of that is kind of painstakingly knitted together where any change would be an expensive breaking thing to have to fix on the applications. The SQL queries against that data were also brittle. Any time you wanted to add in more data, for example, if you were missing something. We don't have that problem with graphs because we simply map all the data available into a graph and we build a comprehensive knowledge graph of everything in the domain. Then the questions you ask are kind of the thing you do after you've got that all working. We've built tools to allow you to formulate those questions where they will traverse the data in any direction to be able to look at even very complex combinations of things.

I remember when we first saw this starting to emerge. When the technology first started. As I was working in life sciences. They were an early user of knowledge graphs and they were looking at systems to do something which was very important to them, very expensive. They had huge implications for cost. So it's one of the areas they were prepared to invest in and they were interested in trial site selection where you're essentially trying to decide where in the world you are going to run your clinical trial? And there are many dimensions to this: What's the drug? Where is the population of people who have this problem? Can I find medically qualified people who know this disease, maybe who know how to administer this?

Drugs are administered in different ways. We've got people qualified to do that. Are my competitors in the same place? What other trials are being run? That may mean that I can't get a population together to do my trial. Many, many dimensions of things. The way people were doing it previously was painstakingly looking at each element of that separately, whereas we were able to say, okay, well, let's just bring all of the data that concerns us into one place.

Now you just start asking your questions and add as many dimensions as you need to filter them. We'll add those as we go and we can kind of spin the question on a dime and change the question. But because the data is already there, we haven't got this long lag. And if you've got somebody who knows what they're doing sitting in front of the screen who knows the domain, they're very quickly able to direct their questions in. If they then find they're missing some data, it's not this huge hike and a hugely disruptive thing where everything is going to break to bring in some additional data. You're simply attaching it into the graph. You find a place where the new data conceptually links through some relationship to the data that you've already got. Nothing that you've already got in terms of reporting breaks. But now you've got this extra dimension you can bring into reporting if you want it. So answering the questions has become much easier. 

There's another benefit of the approach, which is that once you've got data into this form, it'll often answer adjacent questions and adjacent solutions. So a second solution or third solution can simply be more data coming into the graph to answer more questions in adjacent problems. So you get this huge sort of reusability of data once you've established a knowledge graph with your data.

The reason why this is important is it tackles the problems of cost and skills. If your skilled people are able to more quickly deliver more data products. A data product is some data where you've done the work to pull it together, clean it up, and deliver. The time and skills that humans put into getting data together and to get it coherent is expensive. It's one of the barriers. 

 

Increasing Data Product Usability

If you have to start every project again, it's an expense. There's only a limited number you can do. But if you can make the data products more quickly, because you're getting more reusability out of the sources of data. You're not having to redo large amounts of work, which tends to be the case if you're lifting up all the data. You're not just picking and choosing a little bit here and a little bit there to answer questions. You're taking all the data and you clean it en masse. And now it's reusable for many different data products. So the product essentially is the valuable thing the IT business group will deliver to the business. That is a combination of data that reads on a solution that has value to the business.

So how do we increase the number of data products you can deliver reliably and repeatedly? 

This is really how we're starting to think about this. For people who want to get started, the promise of being able to do that and quickly, is exciting. It's the technology that is now mature enough. They are available to start to build a data fabric, or a knowledge graph, or a data fabric built on knowledge graph that knits data together in a coherent way that makes additional data products more interesting. We are seeing this play out in our customers. Instead of addressing the onesies and twosies solutions, which is where we started. We're now seeing the initial use case get started and often in a matter of weeks you can actually see how this is going to work and get value. It's quite common, given the approach doesn't require a massive six month schema design, relational entity relationship diagrams and all that stuff. 

Instead, you can get started in weeks. Lift up the data you've got, knit it together as quickly as you can, and then start to deliver data products. Data products being queries in any direction against that data. If you're missing data for a second project or a third project, just add that data in. Now it's reusable. So I think we're changing the face of that integration. I think it's early days. I think it's early adopters at these customers. But it's the early adopters who see the promise of some advantage that they can gain by embracing a new technology. And it's very typical in product adoption that the people who see it first are those who are ready to take a risk, because they're looking for some disproportionate advantage and those are the people that we're dealing with, and there's enough to keep us plenty busy. Some very big organizations are making some very big bets now. But I think that's where we stand.

 

How do Knowledge Graphs fit into Enterprise IT Landscapes

 

Boris Shalumov: Can you paint a mental picture of the future Enterprise IT architecture that would enable these things, and - how do knowledge graphs fit into the existing landscape?

 

Sean Martin: Well, as with any business or organization you deal with has a spectrum, and a risk tolerance, and so on. They have an existing landscape of what they've invested in already, both skills and systems and licenses. So we're coming in from the outside and looking at that and we're looking for what is the most rational entry point into a business like that. And generally, it's not a good idea to go in somewhere and say, “we're going to throw out everything you've done before. Your mainframes can go. We don't need this. This technology is much better.” People aren't ready. They want to learn. Plus, they've got all this investment and they've got a business to run. Generally, for most people, this is not their first order of concern. 

So for us, the most reasonable place to start is essentially positioning the Knowledge Graph as an overlay technology. It can sit on top of whatever you've got already and net the data that comes from all those different siloed systems into a coherent, connected graph. A knowledge graph that you can create data products out of. That seems to be working very well because it's low risk if it doesn't work out in the grand scheme. Didn't break anything that was there. Nothing needed to be replaced. I think in the longer run graph technology is likely to replace other systems and warehouses and so on. If you've got a graph database that is essentially a warehouse, then there's no reason, in the long run, why that couldn't happen. But those systems tend to be run by the most conservative part of the business. They're going to want to see a long track record of all of this working before they're ready to replace their data warehouse with graph data warehouse technology.

What's more likely to happen, is the big data warehouse vendors are going to simply bring graph into their products and extend those products' lives that way.

So in the short run (say, could be talking 10 years), the data fabric that relies on knowledge graph technology is the most reasonable way to enter the enterprise. It addresses these problems that people have integrating data and disallowing data. 

We've had 10 years of being told data is the new oil, but the problem is that oil is like at the bottom of the Atlantic or the North Sea or somewhere inaccessible up in Alaska, that's the equivalent of it. 

 So to free it up requires some amount of investment and it's expensive and tricky. It’s  this technology that can start to free up all those data assets, so that they can be turned into value for the business. That positioning works, and we're very quickly showing that if you've got a business use case that is valuable and you know where the silos are, then a knowledge graph can be positioned over the top of it. But it's important to build on open standards, because what you really don't want to do is build a new silo. Which is something large enterprises are constantly doing with data warehouses. They need a data warehouse to aggregate this data, and then four years later, that data warehouse has become its own silo. By using open technologies, we're hopefully building these aggregate views of data, sometimes if we use virtualization–not even moving the data. At the same time, by using standards you’re making sure that the data is neutral to vendors and is future proof. 

In 10 years time, the application may have gone, but the data will still be there and still be readable and understandable simply because of the open data standards we're using.

 

The power of GOLAP

 

Boris Shalumov: Can you explain the term GOLAP? Because I feel like there's something that's shifted in recent years that made the synthetic technology stack way more mature, and maybe if you put yourself in the shoes of a strategic IT executive or a CIO - how might it lower the risk of symptomatic semantic technologies? So, what exactly changed in the last few years?

 

Sean Martin: GOLAP became viable, this is a really big change. That was led by Barry Zane of data warehouse fame, he was one of the leading technical guys behind Netezza. He founded ParAccel, which happens to be the code base on which Amazon Redshift is built too. Both of those are relational warehouse systems. By OLAP, I mean, online analytic processing, OLAP. GOLAP is graph online analytic processing. In other words, a data warehouse but built as a graph. Barry built the first one of these at a company he founded that spun out of ParAccel called SPARQL City. In recent years, there's many graph vendors out there, a couple of which are kind of GOLAP capable. 

I would contrast GOLAP with GOLTP, which is where most graph databases have their design point. A transaction processing system is a system that captures data through point-of-sale or it's what's behind a website. It's essentially small transactions, lots of users, very fast, touching a couple of records. Insert, update, delete and so on. That's transaction processing (TP) and OTP systems are sort of the architectural design point for most graph systems. It certainly was when I was getting into graph and we were building middleware, tools, and so on.

We were using mainly other people's graph databases, and they're all very quick to update an individual instance in a class or to find an instance in class, because they had indexes every which way. What they were very poor at were multi-hop joins where you touched enormous amounts of data, the equivalent of entire table scans in a relational database. There's a very different design point for that. In the case of Barry’s system, it's basically an MPP system. It's a nothing shared architecture very similar to things like Netezza, Redshift, Snowflake, and Teradata and so on.

These are big MPP systems where you cluster your software and nothing is shared. The query is running parallel across this cluster and the results are aggregated and returned. None of the graph databases did that prior to Barry doing it. And so, when he built that system, we now have a system that is capable of doing warehouse style operations which are often transformational queries. I briefly mentioned Informatica, they have pipelines for populating relational data, moving data from one place to another. They mainly do ETL: extract, transform and then load. So, you extract the data, you transform it in the pipeline and then you load it into your system.

Prior to Informatica and still enormously popular in relational is ELT, where you extract the data, you load it into the database, and then you use sequel generally PLSQL (if you're using Oracle), to write transformation. So, you're transforming the data in the database. Graph technology was too slow. The TP systems, their transactional systems were just too slow to do those transformations. They were also very slow to load data. And then, of course, the other thing you want to do in a data warehouse is analytic queries, where you ask these questions about not one person, but an entire population. As in, don't show me this particular sale that happened yesterday, this transaction, but show me all the transactions over all my stores over the last 12 months. That's the kind of thing you ask a data warehouse that you don't ask a transaction oriented system. 

Graph was missing this. But after Barry started SPARQL City and built that first system, it was no longer missing. Now we are able to do that kind of processing, ask those kind of questions, and have the data integration element. People had graph in a box. A niche where the data needed lots of relationships, it needed to be highly connected, and we want to run these particular algorithms that you could only do if the data is arranged in the graph database. That was where people had graph for the longest time.

Now, once you got to GOLAP, you're able to address a much broader set of questions because you're now dealing with much, much more data.

At this point, we have not got to the end of how big the systems can be built.

No customer has actually taken us to the point where we have to say, “hey, we can't scale beyond this.” It's simply a question of adding an MPP system, adding more processing in parallel, to address the problem. 

Now Graph has this other dimension to it that it had been missing. But I'm not sure how widely that is appreciated yet. I think people don't know that GOLAP is a thing, it's viable, and it's able to take on not just regular warehousing style loads, but obviously graph oriented loads.  

Lots of unstructured data, which we're told is something like 80 percent of the data within an enterprise–the textual data, presentations, emails, regulatory documents, product manuals. All that stuff that every enterprise has got tons of, actually is far more voluminous in terms of size of data than the tabular structured data. I'm being very general here, but that data has been ignored up to now. Maybe not ignored, maybe the people had sort of brute force search engines and things to try and help you find things within those corpses. But actually, using that data as part of an analytic has not really been possible. GOLAP makes that possible, because you can address very large volumes of data in a coherent way and combine it with the structured data. Essentially, what you're doing is mining unstructured sources.

Structuring the data in the graph, but in a way that is richer and truer than you can with tabular or relational approaches to capture the same data. And now you can query across both of them. Plus the transformation queries that you can do in graph are how you knit the graph together. You can use it for the same thing you can use PLSQL for, for cleaning data and for rearranging it in ways that are comfortable to get to your data product that some users are going to gain to access. Same thing with graph, only on multiple dimensions. Tables are kind of single dimension, single entity usually. Whereas with graphs you're dealing with potentially thousands of entity types and you need a similar way to knit them together, which you can now do with ELT queries in graph. It's a huge step function up. PLSQL and ELT have been very popular for many decades. 

I don't think people realize that you can do the same thing with Graph, and it's just a much easier way of integrating data than anything we've had before. 

I will say what's been misleading is that transactional oriented graphs have generally had to have external pipelines for preparing data. So, you have to figure out what the graph is going to look like and then you pull data into it. Often you'll create a schema in the graph, kind of like you do with a relational database, you create the equivalent of a table. You're going to create a no description that is less flexible than a seamless OLAP/GOLAP system that can essentially arrange itself around the data as the data is loaded.

 

Boris Shalumov: Exactly. You're no longer tied to two dimensions, right?

 

Sean Martin: Exactly.

 

What do Knowledge graphs mean for the IT Department?

 

Boris Shalumov: This also brings some organizational changes, because IT departments have to transform. So, you no longer need those roles where people are focused on maintaining the schema of databases, right? How will these roles change? I guess the question is: what's waiting on the other side? Like what things have to change for the enterprise?

 

Sean Martin: That's an interesting question, because I think it's kind of out there still. We're still somewhat speculating on it. One of the things that graph technology enables is this ability to create a de-siloed view of data across a lot of silos. But often those silos are owned by people and fiefdoms, or there's regulations that show resistance. In the financial services industry, the people who sell things are not allowed in trade and are not necessarily allowed to be involved in the advisory business that is helping a company go public. So there are regulatory reasons and there are technical reasons why things are siloed within enterprises.

Desiloing that data is often a social problem as much as a technical problem. People, they own the data, they own the people who maintain that data, and that's a little fiefdom. There's a whole lot of social stuff that I think will have to get worked out in individual businesses as they try to figure out how to unlock the combined value of data. That’s one big change. It really now is viable to start building these views across data that is owned by lots of different people. That's going to need some people to change their minds, and that's probably going to come from the top more than from the bottom.

Another one is simply starting to think graph first about your data, knowledge graph first. How do we describe our data in a way that we want? Once you start to build a knowledge graph for an enterprise, you start to change the culture of all the different people who are currently creating databases. As you can imagine, everywhere out there businesses have thousands and thousands of access databases and spreadsheets, and then there's all the formal systems run by an IT, etc. There’s a shadow organization generating a lot of data, as well as the formal organization. All that unstructured data is being generated by individuals within the business as well as formal things around product catalogs, the website, and so on. How do you get people to start to think about doing whatever they've been doing with the tools that they know, but to get the data into the graph? 

The data has an inherent value in terms of a network effect. For the web, when we got our first web server, then you've got a bunch of people with their browsers, then the second web server came up and was online, and then more people with their browsers. The same with the telephone system or the fax system. The more people involved, the more servers involved providing data. The more people who are on the network who could answer the phone or get a fax, the more valuable that network became. And so, they became sort of ubiquitous. Well, the technology is now there to do that for data within a business right now. 

But how do you start to arrange things within your business so that all data that is generated finds its way onto this graph, onto these sort of big, logical enterprise level knowledge graphs? Because it's going to be more valuable connected up to everything else with whatever constraints you need around legal, privacy, and so on. The technology is there to police that kind of thing already. So the question is how does an organization become graph first? Where all the data, not just the historical data (which is where we currently really are, we’re desiloing historical data), but any new data that's being generated by applications, how do you get your developers to start thinking graph first? So that's another big change that I think we're going to be requiring. 

If you get down into the weeds of schemas versus ontologies, I think the people who build schemas are going to be just as comfortable with models. They’re just an abstraction away from where they are now. 

In fact, in some ways, the task gets a bit easier. But it also gets bigger, because you're able to address more data wider. The people who have an aptitude for dealing with relational schema, warehouses and ETL, all of that applies just in a new frame. So, I don't think there's anything for them to be worried about. In fact, if anything, their jobs are going to get a bit easier. Those guys are under terrible pressure to deliver for the business all the time. It’s deadline, after deadline, after deadline. It’s painstaking stuff. If you're able to deliver more quickly, you've got happier business users, more business, and more solutions. You may end up with a more fulfilling work experience, using technologies that are really delivering things that people appreciate more quickly and in a more agile fashion. So, I'm not so worried about those people, but there's going to have to be a mind shift. 

They've got to learn about these technologies and figure out what the similarities are to the things they do already and what the differences are, and adjust around that. I honestly don't think that's going to be a big shift for people of that kind of temperament and skill level. I think it'll be pretty straightforward. That's what we've always found ourselves. When people come and join us, they generally are able to adapt very quickly, even if they've had no knowledge graph experience. 

So I'd say that the main differences are:

  • Changing the way people think about their data, and changes in the way the organization thinks about data as being more valuable when it's connected than when it's disconnected. 
  • Then breaking up the silos that are protecting the data in ways that are not conducive to the best value to the business. 

The latter changes are the ones that will take longer and what you really need to do to get true value. 

If I build an application on Amazon Neptune, for example (great graph store delivered by Amazon). I use RDF and SPARQL and I'm building an app that is capturing data. It's doing transactions or whatever. That's more of a TP system than a OLAP system. Previously to get data out of an application system like that, I'd have to do some kind of ETL, then put it into some kind of warehouse. Or maybe I'm using Apache Arrow, one of the elastic databases. I've built an app that is capturing data, and I want to put that into RedShift to do analytics on it. I've got to do an ETL process. We're going to figure out what the schema in RedShift is going to be, and I'm going to reconstitute that to do my reporting in RedShift. That’s the traditional way, and it's quite painful. 

Go to Amazon Neptune on the other hand. I build a graph oriented application. I'm reading and writing the data in RDF. I'm using SPARQL to query it. To do analytics on it, that data has to be moved to another GOLAP RDF capable store. No ETL required to do the analytics. You just move the data. There's no transformation required. You just use the data as is. So what does it get to get people thinking, “hey, I should build my application on top of Neptune instead of building it on top of MySQL?” The reason being, the data I've created using graph has become much easier to work with.

There are also other advantages around the model being something that's easy to share with the business, so that both the developers and the people who specify the system have a shared understanding of what all the data means. These are side benefits of using open data standards. But there's a huge advantage, which is:

“Hey, if I'm thinking graph first in terms of building new applications, then that data is immediately usable elsewhere in the graph.”

Also, if I store it using these open data standards, it means that the data is independent of the application in which it originated. It’s self-describing. The model travels with the data. Sometime down the road the model can be interspected from the data which it’s travelling with, to be able to use it. It's future proofing data as opposed to building a whole lot of MySQL tables, or SQL server tables, and so on. 

I'm sure we could go on and pick up many different pieces to this, but those for me are the biggest things. They are going to have to change for this technology to really see its full promise.

 

The path to enterprise collective intelligence

 

Boris Shalumov: I really like you mentioning this network effect, because I feel like graph technology is the only way to reach some kind of collective intelligence within an enterprise. And so, are we heading towards a new era in data? How big of a change is it? Is it a gradual change that has been happening for the past decades?

 

Sean Martin: What you said at the start of that really resonates for me. When we got into this (I was at IBM, we're talking 1999, 2000 something like that), I started looking at this technology. At the time, I was trying to integrate a whole lot of desktop applets and we wanted them to talk to each other. One of them was an instant messaging tool and another was an LDAP-based tool and another one was expenses. It was a set of productivity apps, but they were decoupled.

We wanted to exchange data between them. But we kept on building new apps. So we wanted a neutral way to exchange data between them. We were trying to use XML schema as a description for the packets. We had a messaging system that all these apps could tap into on the desktop and send each other messages. It was pretty clear that XML schema were too brittle. If a new version of an app came out that had a slightly different schema, everything else would get broken. So, we were looking for something else. At the time, someone who was working for me went off to the AI Lab at MIT to finish his PhD. That's right next door to the W3C site. 

He came back, and said, “hey, have you heard about this thing called RDF?” We hadn't. So we looked at it and thought, wow, this is quite amazing. It will allow systems that are developed independently to exchange data in a meaningful way without being tied up to these rigid schema things and without needing schema conversions. 

For me, it lit up there. What I realized is that there was no other way we were going to get there without sufficiently advanced artificial intelligence. I don't really mean the machine learning stuff that we're doing now, but something that can think a bit like us to try and make sense of all these formats, dialects, sources, nuance, context, and all this stuff that you need to actually understand how to integrate data. Short of that, there was no other way forward except for these standards being used and applied. And of course, then the logical step was how do we apply the standard? That spurred, let's go start a company to apply those standards in a broad scale across every piece of the system that needs them? That's how I ended up talking to you today.  

For me, there was just no other way forward without something that was like Cambridge Semantics, Inc. to enable integration and faster methods of gathering data. So, it hasn't been short. It's been very long. I mean, we've been working on this, my friends have been working on this, and many other people around the world for more than 20 years. But we're getting somewhere, right. 

We've managed to assemble very large systems in different places, different companies that are actually starting to deliver this. What really ends up happening is this long, long lead up where nothing happens for a long time. It happens on a very small scale. There are a lot of setbacks, to get into that scale that we have now was not and continues too not be that easy. Until you get certain threshold things conquered, then there just isn't the way forward. Until we had the ETL tools working properly, it was a real pain to use our database because you just couldn't get data into it. Then we created ETL tools that I think are going to be world beaters because they can be done automatically. Freeing somebody up to do something else. Once you tackle these individual problems, eventually the curative mass of them starts to allow for easier adoption across-the-board. 

I expect the adoption to continue to increase as we've seen it, and it continues to accelerate in terms of the numbers of people involved. As the automation improves, as we can apply machine learning, all the modern stuff that's coming along the way that can help us, as you apply that we can start to see knowledge graph be used on a much broader scale. The last couple of years have seen more interest in Knowledge Graph than ever.

This is an anecdotal measure, but I think we had more PoCs in the last quarter than we've ever had before. It's almost like a whole year's worth in one quarter. That's showing that there's a lot more interest. I think that interest is not only because it's being written about in the press and the tools are getting better. 

People are looking at their business problems and saying, ”well, is there a technology out here that is viable, practical and not too expensive that can deliver this?” 

As we continue our own plans to make the software more packaged and automatic, it just makes it easier and easier for people to adopt. Look at the iPhone. Very simple for people to use. Ubiquitous interface. If you think about all its antecedents, all the different things we had before (Blackberries, Windows Phones), each one incrementally added on until you get to a point where suddenly these things reach mass market. Now mass market for data integration and generation of data products, it's a smaller market than iPhone, but it's the same. The same rules. It's a question of maturing the technology, making it more automated, picking the problems that you're going to solve and solving them really, really well. All of those contribute to something which is a world changer. iPhone was a world changer. 

I think we're looking at something similar. A more modest scale. Although, it may touch a lot of people who will use the data that is generated. However, the actual people who integrate data are probably a relatively small community, but vital in every business. It's going to be a huge change. 

How long does it take for that to trickle into businesses? One PC, one production at a time until you start to get the critical mass. It's very difficult to estimate. But often these things take off very suddenly. You're never quite sure what the rocket fuel was. Think about Voiceover IP (VoIP). We used the varying forms and then suddenly Skype was there in a package. I think the critical thing they solved was the quality of the voice and the fact that this client could traverse firewalls automatically, and suddenly the thing took off. They solved those two little things that held the technology to a much smaller audience. They solved those and suddenly we weren't paying for phone calls anymore. It happened very quickly when finally, two blockers were removed. And so, it will be interesting to see what that ends up being with knowledge graph technology. 

 

The future is already here, it's just not evenly distributed

 

Boris Shalumov: Sean, what do you think - knowledge graphs are conquering one industry after another, right. We have social networks, search engines, natural language understanding, master data management, all these classical things. Do you think that they will soon be present in other applications, for example, for IoT or self-driving cars to connect all the different devices?

 

Sean Martin: The short answer is yes. There's a great quote by William Gibson, the cyberpunk fiction author. The quote was, “the future is already here, but it's just not evenly distributed.” I'm paraphrasing. It’s probably not exactly that, but it's essentially that. We've been seeing knowledge graphs for a long time before they were even called knowledge graphs. I think the term knowledge graph came from Google and the great Google Graph, which was really a graph that was going to provide context to your search results. Different companies have used graph in different ways. LinkedIn and Facebook, they are a graph. They're a social graph. So, we saw glimpses of the future and future use cases there. But, of course, as this technology matures, it is going to become increasingly used as a far more flexible way of managing data. Whether that be data resting in databases, on disks waiting to be found, whether it be data in flight through IoT systems, or great big REST deployments, all of those things are starting to happen. They just happen a little bit at a time. You need various shake outs. There are many graph engines out there and they talk in different dialects until people start adopting one. That's one of the things that get in the way.

There is an effort now that SPARQL is a standard from the W3C that's been with us a long time. There's a new effort to create an ASCII standard called Graph Query Language (GQL) that will be another one of these things that vendors will probably have to adopt or be excluded from selection, because people don't want to buy into proprietary skills. As the market kind of shakes down, there will be less vendors. All of these things are simplifying events and will spur adoption more widely. Graph first will become the way. 

I don't see any other way you can do the things people want at scale, in any practical way, without using graphs or whatever follows graphs. 

I also think we're a long way from thinking about what follows graphs at this point. We've still got to go through the whole graph thing. Will you see it ubiquitously? Yes, you absolutely will. In many cases it'll be invisible. It’ll be in your phone apps. It'll be in your phone and you'll be interacting with it on the web, and so on. All of that is absolutely going to happen, because there really isn't any other way to do this that is cheap enough, now that this starts to prove itself out. It’s head and shoulders will rise above simply because it is more practical and cheaper.  

There may be certain twists needed in terms of more middleware, for developers for example, but a whole generation of people was taught about graph in the EU over the last 20 years. I think a lot of them have been waiting for graph software to become practical enough to apply in their business. That's starting to happen. In the last two or three years. Suddenly we're getting an influx of people in Germany, France, and Spain who know about these technologies, know about the standards, and may have failed 10 years ago (because something wasn't scalable enough or mature enough), but now they're saying, “hey, you guys this is something we can actually use.”

They're starting to really kick the tires. Trying to see if they can apply it to very difficult business problems in all of those areas you mentioned earlier. There'll be some failures. There will be some enormous successes, too. That's how we progress.

Once you get momentum about knowledge graph, it can snowball very quickly into something very big. I think we're definitely starting to see that.

If looking at early indicators, look at the valuations of the people who have started graph companies in the last few years. Look at what the VCs are putting in terms of money. Some people really do think this is going to work and they're putting their money where their mouths are.

 

Competitive advantage

 

Boris Shalumov:  Last question. What do you think - how far away are we from "not using knowledge graphs" being a huge competitive disadvantage? When will knowledge graphs be a standard in enterprise landscapes?

 

Sean Martin: I honestly don't know, and I'm probably not the right person to ask that because I'm always optimistic. Too optimistic. I can see where it's going to go, but I have no idea how long it's going to take. Except that it'll take longer than even my most pessimistic view of it.  

We're starting to now see successful production deployments of knowledge graphs at significant scales in terms of the volumes of data, doing well what we wanted them to do. That's the start of it. How long is it going to take? Who knows. I mean mainframes are still with us. They're not really going away either. 

Here's the thing, it's painful to get some system working. It takes a lot of effort and there's a lot of trial and error and a lot of expense. People are very reluctant once they've got something working to throw it out in favor of something else. It's high risk. So, I think it'll take a good while before you see knowledge graph, really replacing everything. 

But do I think you'll see them quite soon building these data fabrics? Absolutely. I think that's starting to happen. Enterprises take two or three years to kind of really wind themselves up. They started looking at this two or three years ago in earnest. Not the R&D skunk worky groups. Mainstream IT and architects have been looking at this technology now for a few years. And by looking at it, I mean, investing in it, running PoCs, putting systems in production, pushing back on features that they think they need to meet expectations from an enterprise point of view. The single sign-on, Kerberos authentication, and all the things that get left to the end. All of those have been demanded and being put in.

We're going to start seeing increasing amounts of interest in particular areas. The data fabric is a very good one, but there are many domains that are specific. If you look at any of the graph company’s websites you can read the use cases, what the customers are doing with them. That’s where you get your feet wet with graph. Once you've succeeded with something...nothing succeeds like success. Generally at this point, you're not going to use graph until you've succeeded with it in some way.

It's the early adopters who are getting into this. But once those early adopters have proved the way you'll have the normal adoption, where there's a much bigger population who might buy the product to take the same kind of benefit. So, I think we're going to work our way through the normal adoption curve for any technology and it's been a long time coming. 

I don't think people realize quite how much had to be done to make this work. From my point of view, it's been an entire rewrite of the entire compute stack from top to bottom. We've had to do every single piece of this. The issue is that if you don't have every single piece, you can't deliver a solution. There's a big gap that has to be filled with some kind of integration and that makes it too expensive, too risky.

Once you've got all those pieces assembled, suddenly it becomes something that an early adopter can think about taking on. Not a researcher, but someone who actually has a business to run and a business problem to address.

It's going to take awhile, but I think you'll see ever increasing amounts of it. From my obviously biased perspective (because I've invested an awful lot of my time working on this), but it seems to us like knowledge graph is a much better approach in general.

It solves so many of the problems that plagued the earlier technologies. In particular, metadata and data can be combined, and you can deal with really any level of complexity of data. Lots of relationships, huge heterogeneity of the types of things you want to tie together, nothing else comes close. So, people who are trying to unlock the value of data will recognize that and say, “okay, it's time for me to get my feet wet.” 

That's been happening on an accelerated basis for the last year or two, even with COVID. Things didn't stop. They slowed down a little bit, but they've come roaring back after the first three or four months of COVID, while the world was kind of trying to figure out what was going on. After that, things really picked up again. 

So anyway, so knowledge graph is inevitable, but I'm the wrong guy to guess the time frame. It's taken me much longer, at least 10 years longer than I thought.

 

 

Tags: Smart Data, Metadata, Semantics, Data Integration, Semantic Web, Data Lake, Machine Learning, Cognitive Computing, Artificial Intelligence, Unstructured Data, Semantic Layer, Graph Database, Podcast, Data Fabric, Knowledge Graph, semantic graph

Subscribe to the Smart Data Blog!

Comment on this Blogpost!