In September 2016 Cambridge Semantics attended Strata+Hadoop World 2016 in New York, NY. While we were there, Marty Loughlin, our VP of Financial Services, spoke to a gathering of attendees about who we are and what our platform does. Here is his presentation.
Good morning. Thank you for taking the time to join us this morning. My name is Marty Loughlin, Vice President of Financial Services at Cambridge Semantics. We’re a software company in Boston, and we have a very unique approach to the data lake, we call the Smart Data Lake® and we’re going to take a few minutes to talk about that.
So, I realize we are the last presentation before lunch so we should probably keep this short, but really not that short. The point here is that what we really want to be able to do in the enterprise is enable our business users to have access to all of the enterprise data in a way that they can ask any question they want of that data without having to go back to the IT for more access or more capabilities, ask any question of very complex data. So, we say “any questions?”. These are some examples of the questions that we’re helping our customers answer. So, it ranges from understanding, if you think back to the crisis of 2008 in financial services and “what’s my exposure to Lehman?”. That’s a very complex problem to answer for large organizations through questions like “what is the best investigator to use for a phase 2 clinical trial for an injectable liver drug?” anywhere in the world. The characteristic of these questions is that they require the harmonization of data across many diverse and large data sets. And they also require us to be able to answer questions that we don’t necessarily know in advance.
So, our approach to this is as we think about the enterprise landscape today, we think of it as a puzzle. Your data is residing in many disparate databases, many disparate sources, and it is pretty difficult to ask questions that span those data sources. We can bring a couple together, we can bring maybe three together, in the current BI tools but to go beyond that is very challenging. And at Cambridge Semantics what we are doing is helping you bring all of your enterprise data together with a common, harmonized meaning and a way to access all of that data without having to worry about data integration. So as a business user, I can now view the data in many different ways, ask different questions that span many parts of that data without having to worry about where that data came from or how to join it together.
Of course, the beauty of this is that if you think about any puzzle, you don’t know what pieces are missing until the end of the puzzle. The same applies when you apply a model to your data. When you bring it all together, you can suddenly identify where the gaps are and use that as a way to tap into new data sources to fill in the picture.
So, how do we do this? So, our platform is based on semantic web technology. We use models to describe data, high-level business conceptual models and we build those across all kinds of data sources. So, for example, if you think about a traditional relational data source that has rows and columns, we can represent that as a graph, as an entity and its relationships and properties. We can do this for unstructured data, too. So, we have natural language processing capabilities that identify entities and concepts and sentiments in text, and we can extend the graph with anything that we extract from those sources. This also applies to other kinds of data sources, enterprise systems that may have structured and unstructured data. We can extend the graph with data extracted from those systems also.
And so, what we are doing is building a representation of the data that's expressed in high-level business conceptual terms that spans all the different data sources that were brought together but isn’t tied to the physical representation of any of them. We can extend this model easily and we can now ask questions that expand many hubs in the model. We can ask a question about a company’s market cap and how that relates to safety signals that may be captured in a note. Think about doing that in a relational world that requires joins across many complex data sets. You can do it, but it requires sophisticated programmers and developers to do that. Our tools that I use to build queries using a visual query builder to access this kind of data.
So the model that we use, this is an example from the Pharma and Life Science space. That model we use to harmonize those different data sources is also the model that we use to provide users access to the data.
So, you can build dashboards, visualizations, search analytics filtering all driven by the model. These dashboards are graph-aware. And this is the power of being able to ask questions that you didn’t necessarily know in advance. You can infer relationships, you can construct new data through these dashboards and build views that, as we said, answer questions that were not anticipated in advance.
So, how do we do that? We have a very scalable platform, we’ve been developing it since 2007. The team that started the company came out of IBM, who were working on semantic web technology there, building a toolset that allows us to do all of what I described end to end without writing any code. So this is business analyst and business user capabilities. And we can do it at scale. We do that through a couple of different approaches.
So, in this picture this is our end to end architecture. The middle layer is the storage layer. We sit on top of almost any kind of storage platform. So we leverage Hadoop HDFS, we sit on top of Amazon’s S3 storage, we sit on top of Google’s Cloud Platform. We’re independent of the storage itself. And our tools on the bottom layer, this data lake server, allows you to prepare data for storage in the lake. We have tools that catalog data sources, capture configurable metadata and assist the business analyst mapping those sources on to the common semantic model. We do that for structured and unstructured data. We ship with natural language processing and text analytics capabilities. But more importantly we have a framework that you can plug in third party tools so if you are using Linguamatics or Lexalytics or any one of over forty products to do work on unstructured content we can take the output of that and harmonize it with the graph so that you can do analytics across combined structured and unstructured content.
Once you have the data in the lake, there are ways for end users on the top layer to browse the available data sets. We call this Smart Data Discovery. Browse the available data sets in the lake, select the ones they are interested in and load them into memory. And this is really the secret sauce in terms of our ability to scale this platform. If you know anything about semantics web technology, it's been around a long time, it’s very powerful, but it has a performance overhead in terms of being able to to do interactive queries. We’ve solved that problem with this technology.
So, the Graph Query Engine is a massively-parallel, in-memory semantic data-standard-compliant graph database. We have it running on 200 nodes on Google’s cloud platform at the moment and we're about to complete a 1 trillion triple benchmark. And we’re going to blow the socks off the last one. What that enables us to do is to interact with query analytics, all of the things our dashboard support at very large scale.
So, when you think about the smart data lake and how would you leverage a platform like this, we think along two dimensions. One is the number of data sources, from few to many. And the other is the number of data sets, from few to many, or small to large, I’m sorry. And so for we can handle simple data, we can handle small data sets. We can handle big data sets that are simple in structure but that’s not really our sweet spot. Our sweet spot is really when you’ve got multiple data sources, very diverse data of different entity types where you want to ask questions that span across those entities and data sets. And we can do that for complex data at very large scale.
If you’re interested in learning a little bit more about what we do, you want to see a demo of either the data ingestion of the Smart Data Lake sever component of the platform or all the way through to the use cases that we have applied this technology to in Financial Services and Life Sciences, we’re at booth 564 and we’d welcome you over for a demo.
To learn more about the Anzo Smart Data Lake, download our white paper here.