Data Tagging – The Good, The Bad & The Ugly

In the world of recommendations, Netflix has long been seen as the gold standard of metadata, content classification and data tagging. With its fabled bank of hundreds of freelance ‘Taggers’ that spent hours watching every show and deriving over 8,000 tags, to their 2,000 different user taste groups that drive every recommendation, many in the DTV space have only ever dreamed of having such rich metadata to be able to drive their recommendation engine. So how does the competition, both big and small, compete with the crown jewels of the industry? Is it possible to compete, and if so, how? 

The simple answer is yes you can. But it takes any OTT or DTV provider to think seriously about how they design, build and manage their content metadata and place it front and centre of everything they do. But before we explore how metadata should be structured for video platforms, let’s talk about why data tagging is so important to recommendation engines.  


Data Tagging, the great connector… 

Tagging metadata is an extremely useful tool in order to create data connections. Models need well-ordered data to make quality connections between titles, and the greater tagging richness you have, the greater chance a model has in picking the most relatable content that a user has just consumed. But this is not just true of recommendations, search can also benefit massively from well tagged metadata. Rich and varied tags help search identify content a user is aiming for and with the increased usage of voice search, tags play an increasingly important role in the increased complexity of how users search for content.  


I just need as many tags as I can get, right? 

More is definitely better in some cases. Search for instance will benefit from large, complex tagging banks. But for predictive modelling, the opposite is true. Less tags but with improved quality and correlation is better. Models need to make connections to content and user history. Tags that exist only on one item are pretty much redundant for modelling purposes. Having a plethora of single use tags is a common problem across all DTV providers, big or small. In most cases we see at The Filter, single use tags can make up as much as 40% of the overall data universe. This results in missed opportunities and poor content correlation.  


So why does this happen? 

Very often the reasons are related to legacy systems, combining catalogues, changes in personnel looking after tags and a lack of structure and rigour applied to tag management. Most of the time, tags are curated by humans, therefore the normal inaccuracies and differences in interpretation cause tags to be created that mean the same thing. One person’s ‘police’ is another person’s ‘cop’, ‘romantic’ or ‘romance’, ‘vaccination’ or ‘vaccine’ and so on. This creates noise for a model and the forementioned inaccuracies. You add different languages and translations into the mix, you quickly have a recipe for out-of-control metadata. And then there are short cuts. Purchasing third party data which has rich metadata can seem appealing and solves the issue of coverage, however, invariably creates other issues around increased unconnected data.  


So how can poor tagging be addressed? 

The first step is to review your data. What does your tagging universe look like, and do you have a large quantity of single use tags? If the answer is yes, then you need to resolve this issue. At The Filter we have developed our data science platform to provide an automated methodology for identifying areas of concern. Using a selection of automated processes, we can target specific issues and consolidate, enrich and manage your tags, ensuring maximum connectivity. In addition, we can derive new tags from your existing data, invariably extracted from within description and synopsis data. This can contain keywords, people (actors and directors) and often we can also derive mood and franchises using these processes.   

“A ‘tagging’ methodology should be established that is used within your business” 


Central to the whole process, however, is to agree a consistent approach to how tags are created, logged and managed. A ‘tagging’ methodology should be established that is used within your business and one that can be transferred easily between stakeholders. This will mean, even after your data has been cleaned and enriched, any new content coming into your catalogue is tagged with the same criteria to ensure consistency and quality.  


What does this look like in practice? 

At The Filter we have created a central ‘entity’ methodology which is used across all forms of metadata curation and management. Whether that be for genres, tags, moods or people. Specifically related to tags, we create different classifications of tags, so that we can have one set that is used for Search (broad tag set with unlimited scope) and one set for modelling tags (well defined and have content crossover). If we take a look at an example for Wonder Woman 1984, you can see how our processes removed unnecessary or ambiguous tags.


By just changing the tags that are associated with Wonder Woman 1984 we can improve the quality of items returned from the same ‘More Like This’ model as demonstrated below. 


Can this type of processing be automated? 

We have developed a number of different automated solutions that can take your data and create automated tags using natural langue processing and other machine learning techniques. We use automation to carry out the bulk of the work, providing consistency at scale. Of course, vetting this type of automation is required, but in our tests the results and quality are impressive. There is still always a need for manual intervention however, to enable you to get the quality level as high as possible and with our expert team, this is handled for you. In addition, we draw upon our own proprietary canonical data store that can append hundreds of additional variables quickly and efficiently.  


Top tips for improving the quality of your metadata 

  1. Put metadata quality front and centre of your platform – it does not matter how good your content is, if the metadata behind it is poor, then your models and ultimately your users will suffer. 
  2. Review your data – work out where the pain points are, how many single use tags you have and what areas in particular need enrichment. 
  3. Agree on a metadata structure and methodology – work out what is right for your business but put measures in place that will give you the control of consistency of tags that are useful for modelling, whether managed in-house or outsourced to someone like ourselves. 
  4. Split your data for different use cases – don’t try and use the same tags for models and search; create data tables that can call on tags depending on the use case. 
  5. Process and clean your data – don’t expect algorithms to produce brilliant results without your data being in top-notch condition and if you need help, then come and speak to the experts! 


Tagging is just the start! 

Tagging is just one area that needs to be tackled with DTV metadata and to find out more about what we do at The Filter to help your business thrive, take a look at our blog on Metadata Enrichment – Getting the most out of your data. 


About The Filter 

The Filter offers personalisation of key customer touch points based on proven data science. We put the right content in front of a customer at the right time. More views mean more loyalty, lower churn, and more revenue. 

The Filter prides itself on working closely with clients both during set-up and roll-out to make sure that the client’s customers are always getting optimal recommendations. We are an agency not a SaaS provider. We have a genuinely deep TV understanding which means our data science solutions work exceptionally well for your viewers. We work with you to tailor our AI to your customer experience, your audiences, and your content. All integrated effortlessly into your data ecosystem allowing you to benefit from the power of personalisation easily and quickly, developing your platform at speed.   

For more information, please email