Metadata Enrichment – Getting the most out of your data

To deliver a high-quality personalised experience it is crucial that the metadata being used is consistent, populated, and well-structured. However, access to high quality metadata can be a challenge, with metadata management a common challenge for many within the DTV industry.

At The Filter we have access to a suite of data enrichment processors which we have developed as part of our Data Science Platform; our continuously evolving suite of technologies and processes which allow us to accelerate the delivery of high-quality personalised experiences, without the need for extensive data enrichment and cleansing exercises.

Suite of technologies and processes which allow us to accelerate the delivery of high-quality personalised experiences, without the need for extensive data enrichment and cleansing exercises

Our Data Science Platform has been built with flexibility in mind, allowing us to customise our models and data enrichment processes for each of our clients, with the understanding that each content catalogue is different and has its own unique set of challenges. This enables us to rapidly run bespoke processes which get the most out of your metadata and design the right features for modelling. Subsequently this means that each model we release is different and bespoke to your needs and most importantly your users.

In this article we will cover some examples of the data enrichment processes we have developed to get the most from our client’s data. To help bring these processes to life we will demonstrate how our solution works with the film ‘Zack Snyder’s Justice League’:

In ZACK SNYDER’S JUSTICE LEAGUE, determined to ensure Superman’s ultimate sacrifice was not in vain, Bruce Wayne aligns forces with Diana Prince with plans to recruit a team of metahumans to protect the world from an approaching threat of catastrophic proportions.<br><br>Contains strong violence, flashing images

 

DATA CLEANSING

We have developed a suite of processors designed specifically to cleanse your data to make it fit for modelling. Following a review of your data we will select the right processes for your catalogue – for example, in this case we can see that the description is valuable however contains breaks and text which is repeated across many titles, which would create unnecessary noise for our models.

In ZACK SNYDER’S JUSTICE LEAGUE, determined to ensure Superman’s ultimate sacrifice was not in vain, Bruce Wayne aligns forces with Diana Prince with plans to recruit a team of metahumans to protect the world from an approaching threat of catastrophic proportions.<br><br>Contains strong violence, flashing images

This is just one example of common issues we encounter with metadata, with additional processors available to carry out basic data cleansing; for example, removing special characters and punctuation.

 

DETECTING FRANCHISES & SERIES

To help identify content which may be part of a wider franchise (e.g. James Bond Movies) or part of a series (e.g. Game of Thrones S1 is part of the Games of Thrones series) we have designed bespoke processes which use methods such as Natural Language Processing (NLP) to infer these relationships.

For example, one of our NLP processes would identify from the title alone that Zack Snyder’s Justice League is part of a wider collection of titles:

We are also able to draw on our ‘canonical data store’, to help connect items which may not have easily identifiable links from just the metadata alone. Our ‘canonical data store’ contains a large collection of titles and valuable information which can significantly aid model outputs and has been specifically designed for recommendations. In the case of Zack Snyder’s Justice League, this is an item in our data store and as you can see belongs to a wider franchise – the DC Universe:

Our ‘canonical data store’ contains a large collection of titles and valuable information which can significantly aid model outputs

Understanding these connections within your content is extremely valuable for personalisation and allows you to recommend relevant content and configure our models so that we are showing a breadth of content to users; for example, we may set up a rule to prevent more than 2 items from a franchise appearing in the top 10 results to enable other titles to come through.

 

CREATING TAGS & MOODS

Descriptors, such as Tags and mood, can be powerful features for recommendations, helping provide a deeper level of insight into a viewer’s tastes and preferences. Often however we find many of our clients struggle to fully capture this information or have it stored in a way that enables it to be utilised by recommendation algorithms in its raw form; for example, we often see that tags are often unstandardised (cop, police, detective used to describe the same thing) or too specific (a client we started working with initially had over 3000 unique tags, of which over a quarter were only used on 1 item – these tags can be useful for platform features such as search, however do not easily lend themselves to helping us form connections within your catalogue).

In addition to our ‘canonical data store’ which contains a large volume of tags, we have specifically designed processors which predict/extract tags and moods from your existing data. This is achieved through analysing text fields such as the description, genre, and title. Our tagging dictionary means that any descriptors inferred are standardised and not too specific; for example, any police related content will be given a ’police’ tag, with tags such as cop and policeman consolidated.

Just using our examples description alone, the following tags and moods are likely to be predicted for Zack Snyder’s Justice League:

 

UNDERSTANDING THE CONTEXT

Text information such as description can be extremely valuable and often holds a wealth of information and insight. Drawing on proven models used by the likes of Google, we have designed processes to extract the most value from your text metadata and identify similarities between your content based on the context/meaning.

For example, in the case of Justice League, we can clearly tell from the description alone that the content is in relation to superheroes and how they overcome a potentially ‘catastrophic’ disaster. This is likely to create similarities with titles such as the Avengers series, as shown below.

 

KEY INDIVIDUALS

Again, using a combination of our ‘canonical data store’ and your existing metadata we can extract key individuals which are related to multiple pieces of content. These can be famous directors, actors or even characters. One of our processors would identify the following individuals from just the description alone, enabling us to have another valuable feature to aid our recommendation models.

This helps us form linkages between other Zack Snyder titles and content which includes these characters.

 

AND MANY MORE…

These are just a few examples of some of the tasks we have designed processes to accomplish. Our Data Science Platform has a wealth of additional processes we can draw on which enable us to be flexible and get the most from your data.

As our ‘Data Science Platform’ has been built with flexibility in mind, we are constantly enhancing developing new processors. Here are a few examples of ones we currently have in development and are refining:

 

IMAGES (ETA Q3 2021)

As they say ‘A picture is worth a thousand words’; content imagery can be extremely powerful and quickly provide a sense as to what type of content it is. We are currently designing a processor which extracts this value from your imagery, identifying genres and moods, enabling us to have another useful feature to call on for modelling. It is important to note however that imagery can equally be misleading, however as with all these features they are only a small part of a bigger and much richer picture of your content.

Example of one of our early prototype models predicting the most relevant genres for ‘Game of Thrones’ artwork

 

SUBTITLES / CLOSED CAPTIONS (ETA Q4 2021)

We are currently designing a processor which utilising subtitle/closed caption data to extract key information such as tags, characters, mood and more. In the case of Justice League, we would be able to identify all the characters (even Alfred, Bruce Wayne’s butler), the mood of the content and much more. This enables us to identify less obvious, but still very relevant connections; for example, the TV Series Pennyworth would have a strong connection due to it being based on Bruce Wayne’s butler, Alfred.

As you can see there is a lot of information you can derive from some basic metadata. Good quality data is fundamental to achieving relevant recommendations and can generate vast uplifts in performance. For example, we achieved a 37% uplift in conversion for one of our clients More Like This (MLT) models by deploying a processor designed to extract valuable keywords from descriptions; these keywords were subsequently used in our model to create connections with similar, but less known content, something the initial model struggled to do.

Our Data Science Platform has been designed to get the most out of your data and can easily be tailored to your metadata and needs. All these features created can aid recommendations, creating a highly personal and relevant experience for your users. If you have any questions or would like to find out more, please get in contact. If you want to see the tools in action, we are happy to carry out a proof-of-concept on your data to demonstrate what’s possible.

 


For more information please email damien.read@thefilter.com

The Filter offers personalisation of key customer touch points based on proven data science. We put the right content in front of a customer at the right time. More views mean more loyalty, lower churn, and more revenue.

The Filter prides itself on working closely with clients both during set-up and roll-out to make sure that the client’s customers are always getting optimal recommendations. We are an agency not a SaaS provider. We have a genuinely deep TV understanding which means our data science solutions work exceptionally well for your viewers. We work with you to tailor our AI to your customer experience, your audiences, and your content. All integrated effortlessly into your data ecosystem allowing you to benefit from the power of personalisation easily and quickly, developing your platform at speed.