Unifying Data, Metadata and Meaning

If you’ve ever taken a look at familiar inventions throughout history and how they came to be, you’ll notice that the majority were built from ideas and concepts that were around at the time but put to work in new ways. There was a problem at hand and someone came up with a novel way of solving it. Our current collective problem is a natural outcome of our speedy transition to a digital society. Simply put, we are all drowning in data.

Statista projected that by 2025, global data creation will increase to more than 180 zettabytes. Anyone who thinks they might be ahead of the curve might not appreciate the enormity of the challenge at hand. If not today, then soon. Our ability to interpret and act on data isn’t keeping up with what’s collectively coming at us. Not only is the data moving fast, but our understanding of it is moving fast. Ideally, we’d be agile with data—being able to go from data to knowledge to insight to action as quickly as possible.

While we’ve figured out great ways to share massive amounts of data, we haven’t figured out the best ways of sharing what we know about it. That requires formalized definitions, meanings and interpretations—a specialized language about the data we care about. It turns out that the pieces to solve this problem are already available and being put to work in a wide variety of compelling, real-world environments. While many of the components might be familiar to some, they are now being used in new ways to solve these problems and more.

Let’s Start with Data

A quick history of the evolution of databases might read something like “indexed, relational, specialized and then multi-model.” Multi-model as a category is appealing here as it uses metadata to represent (materialize) data in just about any way you’d like to see it: SQL tables, flat files, graphs, key-value, etc. This means we have the same data but many perspectives. That flexibility makes multi-model appealing for three patterns today: applications, platforms and fabrics.

Applications meet specific needs for specific users (short timeframe), while platforms meet shared needs for aligned users (moderate timeframe)—with enterprise fabrics intended to meet all potential needs for all potential users, internal or external (much longer timeframe).

Back to our desire to share our encoded knowledge about data, while all three multi-model database patterns are useful, an “enterprise fabric” pattern is clearly much more difficult to achieve and much more compelling from an outcome perspective.

Ideally, we’d use a multi-model database that could support and integrate all three patterns as needed during an adoption phase in a larger organization. And there are many examples of larger organizations doing just this by standardizing on a single multi-model database technology for all three patterns.

But How Do You Create Metadata in the First Place?

Not surprisingly, the hardest part turns out to be creating and improving the metadata that is used to describe the data. Simple labeling isn’t too hard: where did this data come from, when did we get it, agreed fields and formats, and so on.

But, for example, how do you determine something is personally identifiable information (PII)? When data is identified as such, it should trigger a set of enforced rules for handling. Also, failing to identify PII creates avoidable risk. Making matters more interesting: the rules, definitions and interpretations surrounding PII themselves change—often rapidly. What we know about PII and what it means is always a moving target. If handling PII consistently and uniformly is very important to you, how would you verify that your current knowledge about handling data with potential PII is used consistently throughout the entire organization and its ecosystem of partners?

You would have to first encode a set of rules of how PII should be identified in any form of data you are responsible for, keeping those rules updated as data sources and interpretations change. Next, you would define a set of rules for handling data once it has been identified as PII. Some uses are OK, and others are not. Those rules change as well.

Most importantly, you would have to enforce those rules against all uses of the data that you are responsible for and be able to prove that in an audit context. How people will want to use data may change—and audit requirements likely will too.

This three-part problem shows up in a surprising number of situations, with PII just being one illustrative example:

How do you encode your knowledge about the data?
How do you use that knowledge to identify and handle important data?
Most importantly, how do you get the data and the encoded knowledge about the data to be used uniformly everywhere?
And how do you do this in an agile, trusted way?

Enterprises need to reveal and apply a holistic approach to their data that allows them to drive strategic data initiatives and business growth. Metadata management is the solution.

How We Encode Knowledge About Data Today

There is a wide variety of ways we encode our knowledge about data, ranging from urban folklore to precise knowledge graphs. In between, we’ll find familiar artifacts such as researcher notebooks, data dictionaries, glossaries, ontologies, metadata managers and others.

A better way is to use semantic knowledge graphs (SKG) to encode our knowledge of data. SKGs are very handy ways of representing very deep and specialized meanings and interpretations of facts, digital or otherwise. SKGs have become de-rigueur in knowledge and metadata management disciplines as they are a rich, flexible representation that readily encapsulates and extends existing ones.

However, none of these things manage source data; they manage various encoded descriptions of source data. They are almost always divorced from the source data itself. They are also not usually intended to have software evaluate data and make decisions about it. To do that, metadata must be created from the data at hand.

How We Create Metadata Today

To interpret any form of data, metadata must be created about the data—and the richer and more automated the metadata creation, the better. We have a stunningly wide variety of tools available to look at data and create rich, automated metadata. We use sentiment analysis on social feeds, image recognition on video and pattern recognition on IOT streams. Even simple textual search can be powerful when informed.

Sadly, in most enterprise environments, automated metadata creation from potentially useful data is in dismal shape. It is usually done by using a combination of expensive coding experts and expensive domain experts to define and create static interpretations of data. As a result, it isn’t done often, and when it is, it requires constant attention. Better technology helps greatly.

Semantic AI technologies use natural language processing (NLP) to have domain experts converse directly with software, using the specialized language that they are most comfortable with. Semantic AI eliminates the need to translate complex concepts through a coding expert, so it is inherently more agile and accurate as a result. It is widely in use today in a variety of pursuits where specialized interpretations of data are important. To learn how your business can extract value from its complex data, watch our on-demand webinar “From Unstructured Data to Rich Insights with Semantic Technologies.”

How We Keep Data and Encoded Knowledge Together Today

The last part of our three-part problem is making sure that any time data is being used, it is consumed alongside everything known about it. That might be a useful definition, important concepts, how those relate to others, rules regarding security or privacy and more.

Just to be clear: data without usable knowledge about the data is of limited use and can create avoidable forms of risk. Also, usable knowledge about data is of limited use if it isn’t readily available when and where the data is being consumed: informed search, contextual applications, grounded analytics, etc.

There appears to be a history of many smart technology teams that have encountered this “connect data with everything we know about it” challenge in various forms, as it shows up in many places and many ways. Maybe you’re personally familiar with such an effort?

Most set out to integrate the three functional components through smart software and fail from sheer entropy. As their integration cannot be agile despite their best intentions, it can’t easily keep up with the real world, and the project is abandoned indefinitely. However, if one stores data and knowledge about data (metadata) together as a single entity, the problem is neatly solved, creating data agility in the process.

When you change the metadata, you immediately change how the data is interpreted everywhere it is consumed. Ideally, you’d create your “data knowledge” in the form of a semantic knowledge graph represented as metadata, using semantic AI to more quickly and effectively encode and decode your unique knowledge about data using any specialized language that people use today. That leads us to the idea of storing this active data, active metadata and active meanings in a data platform like the MarkLogic platform, where it can be kept together at all times.

Transform your data into knowledge and drive new initiatives with metadata management. Learn how metadata management solutions give context to data.

MarkLogic Semaphore

Jeremy Bentley

Jeremy Bentley is the founder of Semaphore, creators of the Semaphore semantic AI platform, and joined MarkLogic as part of the Semaphore acquisition. He is an engineer by training and has spent much of his career solving enterprise information management problems. His clients are innovators who build new products using Semaphore’s modeling, auto-classification, text analytics, and visualization capabilities. They are in many industries, including banking and finance, publishing, oil and gas, government, and life sciences, and have in common a dependence on their information assets and a need to monetize, manage, and unify them. Prior to Semaphore Jeremy was Managing Director of Microbank Software, a US New York based Fintech firm, acquired by Sungard Data Systems. Jeremy has a BSc with honors in mechanical engineering from Edinburgh University.