A biomedical professional is certain to have come across some variant of the headlines mentioned in the visual above. Combinations of trending buzzwords in technology and healthcare form half of my news feed. The other half merely mirrors the first half.
More often than not, such intriguing pieces make one or more of the following overarching points:
- Data Professionals now have a lot of data and applying Machine Learning techniques to them can yield insights that expedite the drug development process
- Data Professionals do not have all the data they need to be able to answer the questions they care about
- Advances in cloud, data storage, and computing have enhanced the scale and effectiveness of analytical workflows manifold
All these assertions are, for the most part, solidly anchored to reality. There is one minor detail that is often overlooked, probably because it doesn’t make as appealing a headline as some of the other topics mentioned above.
What do You Mean?
“We have this large volume of data from subscribed database X, subscribed database Y, subscribed database Z, and PubMed, and experimental results, and …. we are struggling to make sense of how they are interrelated.”
Most conversations about data and analytics in the (bio)pharmaceutical industry these days are incomplete without this sentiment. Drug manufacturers, Contract Research Organizations, Contract Manufacturing Organizations, Market Research providers, Equipment Providers, Academicians and every other organization that has got anything to do with drug discovery and development are united in their battle against complicated data. The labyrinthine data research processes encompassing, among others, semi-manual PubMed searches, tired eyes poring over spreadsheets over days and limitless copy-edit-paste cycles are indicative that there is something fundamentally suboptimal about how data is consumed in the industry. It, indeed, is complicated.
Complicated, though, is not usually a good thing. John Oliver would have probably added:
“If your boss asked you where you were yesterday and you responded “It’s complicated”, you better have cracked the interview that he suspects you were at anyway.”
One of the most enduring lessons I took away from graduate school is the distinction between complicated-ness and complexity. Complicated-ness is bad, complexity is a friend when overlaid with something people understand. A watch, for instance, is complex. People, however, understand what the ticking hands mean. The complexity of the watch is necessary for it to do its job. Most users only need its output and are perhaps grateful that they do not have to dabble in the mechanics of the device to tell the time. Wish one could say the same about how users currently interact with the data within the life sciences industry.
Biomedical Data Now Means a Means to Cure
Much like the interface of a watch, a layer of meaning can radically alter a user’s perspective towards data. A user, equipped with a Canonical Model that uniquely and consistently identifies entities (such as drugs, genes, proteins, and companies) across all sources, has a realistic shot at uncovering the cures in the data that the headlines dutifully inform us are hidden within. At a more operational level, such a canonical model makes a difference because it enables:
1. Discovery of Latent Knowledge
Once the data are interconnected, relationships that were so far disrupted come to the fore. Network biologists can finally materialize the networks of uniquely identified entities sourced from structured and unstructured data. Translational researchers can meaningfully traverse through graphs of data about drugs throughout their life cycles (both drugs’ and researchers’) to confirm or disprove their (researchers’, not drugs’) hypotheses. Program Managers can quickly identify signals about an asset’s success or failure through near-real time analysis of diverse self and competitor data. Market Researchers and Competitive Intelligence professionals can generate actionable insights of greater value faster. Medical devices and wearables firms can expand the spectrum of parameters they observe and to which they can link their data. CROs can approach a potential client as soon as it appears that a drug is likely to advance along the clinical development pathway. There is, indeed, enough cake for everyone.
2. Advancement of Discovered Knowledge
A canonical model lays the foundation upon which the edifice of analytical output is built. Knowledge discovery forms the first few floors of the building. The discovered knowledge in turn serves as input for analytical techniques that advance knowledge. Statisticians and Data Scientists now have access to an unprecedented spectrum of variables to feed into the data- hungry or starved algorithms and models. The outcomes from such models in turn become inputs into downstream analytical engines all the while getting enmeshed into the canonical model and extending it.
3. Better resource allocation
From an operational standpoint, the extensibility of the canonical model enables greater share of investment of time, money, and personnel to be directed to data analysis, not preparation. I will take the liberty of not belaboring the point here, because I have done so elsewhere.
The canonical model works because it is based on the meaning of data, not just the way it is structured. Enough keyboards have bitten the dust over the last two decades in their service of evangelizing semantic technologies that interconnect data based on their meaning. It’s finally happening, it’s finally happening!
To learn more about overlaying semantic technology over a data lake, download our whitepaper "Keeping the Big Data Promise: Data Lakes for Life Sciences".
PS: Dear semantic technology practitioners and critics, the canonical model is not “One Ontology to Rule Them All”. It is, instead, a rich network of interconnected ontologies that users can flexibly traverse to shortlist the data they need for analysis.
PPS: The title is a quirk on this excellent book.