Bulletin, October/November 2006

IA Column
Changing Approaches to Metadata at bbc.co.uk: From Chaos to Control and Then Letting Go Again

by Karen Loasby

Karen Loasby is information architecture team leader for the BBC. She has been at the BBC for five years working on content modeling, controlled vocabularies, metadata schemas and automatic indexing and generally advocating the importance of the human touch here and there in the information retrieval process. She can be reached at karen.loasby<at>bbc.co.uk.

The BBC website is a rich, sprawling landscape. Much of the content is a long journey from the homepage, and many of the audience who fund the site are unaware of the existence of their perfect piece of content in its hinterlands. BBC staff have suspected for many years that metadata could be the solution to guiding the audience through the site, but it has not been simple to find the right approach.

2002: Meta-Lies and Misinformation
If we wind back the clock four years, we find that interest in metadata was restricted to a few search engine optimization enthusiasts. BBC guidelines required some basic metadata to be added to pages with most of the focus on keywords. The motivation was exclusively to improve the ranking of Web pages in search engines, primarily the BBC site search and Google. 

Even the tagging that did exist was often of dubious quality. Sites would store a standard set of keywords in their page templates, which resulted in pages having duplicative and unspecific keywords. It was not unusual to find metadata fields populated with 

“Insert metadata here”
or even
“Barbara, could you put the keywords in here?”

For the most part the keywords made the site search worse, and decent search engines ignored them. 

2004: Structured, Controlled and Enforced
Fast-forwarding two years, we find that interest in metadata has shifted from improving search to the possibility of powering feeds and aggregation pages. This change in focus made it vital to ensure that the metadata was unambiguous, as inadvertent and inappropriate connections between content can be damaging. Wal-Mart discovered this truth to their cost when their movie recommendations system accidentally made inappropriate links between DVDs (see www.msnbc.msn.com/id/10736265/).

Controlled vocabularies (CVs) were created to avoid the ambiguity problems. CVs covering everything the BBC might broadcast were inevitably monsters, and semi-automatic systems were brought in to lighten the load on journalists. In spite of these steps the metadata tagging was still disliked. It wasn’t used for much in the early days, and so it was hard to convince journalists to spend time on it. Even with committed taggers consistency was hard to achieve, as news stories were difficult to describe objectively. 

Compounding the problems associated with getting good tags, the metadata system was only applied to the parts of bbc.co.uk that were produced in the content management system (CMS), so cross-site aggregation was hampered. The huge CVs were also maintenance intensive since the news domain was constantly growing. 

The CV approach faced a backlash as teams began experimenting with folksonomies and tag clouds. The philosophy-centered design of those who believed we shouldn’t presume to categorize content for our users collided happily with the budget-centered design of those endeavoring to cut costs. The idea that the cheapest solution might be the most ethical and effective was tantalizing. For some it felt that metadata had reached a crisis point.

2006: The Golden Age of Metadata?
The crisis didn’t come to pass. Today metadata is a surprisingly hot topic. Both the director-general and the director of new media scatter major speeches with phrases such as “cracking metadata labeling” and “awesome metadata.” Findability and access to the long tail have become organization-wide issues. 
It is hard to pin the changing fortunes of metadata on any one thing, but there are number of possible factors: 

  • The rise of digital program content needing tagging. Some of the program metadata is less subjective than for news stories, and it is harder to argue that the BBC shouldn’t be describing this content. The audience expects us to have brand and actor metadata.
  • A focus on audio-video content (AV) rather than Web articles. There is no text to search so metadata is a necessity, not a nice-to-have.
  • Increasing production of data-driven prototypes that can demonstrate the possibilities of metadata. One prototype, the Open Archive (www.hackdiary.com/images/peel-contrib.png) also made use of a rich store of ready-made metadata from the internal BBC program catalog. With Open Archive it was not necessary to visualize what the service could be like since you could experience the serendipitous connections now. 

Maintenance costs and responsiveness were still a problem. A compromise solution (known as the metadata threshold) allows for free-text tagging that is absorbed into formal CVs when enough content is tagged with that term. The solution aims to combine cheap and responsive tagging with unambiguous aggregation power. So far it has been very successful at slashing overheads. The controlled vocabulary, semi-automatic suggestions and metadata threshold were still coupled to the CMS, but the development of an application programming interface (API) should resolve the issue of tagging content being produced in different systems. 

We are still not really tapping into end user language, and we will have to do so for the more subjective facets such as mood and style. Users will have to want to gather up and describe assets, so program AV may end up being folksonomy’s savior, too.

More and more we expect to break the back of the work with automation while using human brainpower to perfect the results. We will have to harness the power of folksonomies while remembering there is stuff our audience will demand we know about our own content. Most of all, we have to ensure our choices of metadata systems are made with the user in mind.