What We Talk About When We Talk About Metadata (Laura Dawson)

Edited by Hugh McGuire and Brian O'Leary; McGuire, Hugh

4

Laura Dawson is Product Manager, Identifiers at Bowker. She’s a 25-year veteran of the book industry, having worked in e-commerce (Barnes & Noble.com), libraries (SirsiDynix), and publishing (Doubleday and Bantam), and has worked as an independent consultant offering expertise on the digital transition for clients including McGraw-Hill, Alibris, Ingram Library Services, Bowker, and Muze. You can find Laura on Twitter at: @ljndawson.

Introduction

As Brian O’Leary noted in his previous chapter, metadata assumes a critical importance once the content is out of the container. Those of us who make our living in publishing by working with metadata regard its sudden popularity with a mix of amusement (that something previously regarded as so dry is now sexy) and exasperation (what took you so long?). In practice, metadata has been important in bookselling for many, many years. But because a book is no longer a physical object, discoverability via metadata is only just now becoming a front-office problem.

While metadata has been important for a while and is now in a new spotlight, the term still means many different things to many different people. To address that, I think it’s worth revisiting what the book industry means when it says “metadata.”

A Little History

We’ve come to think of metadata as a collection of attributes—ISBN, title, author, copyright year, price, subject category, etc. This title-level metadata started with library catalogs. Bibliographies as published monographs were, essentially, big books of metadata. “Books in Print,” those large volumes that tried to list every book that could possibly be obtained, were the same. As libraries developed computerized systems and moved away from print bibliographies, MARC^[1] became the standard metadata format for catalog records.

The concept of digital metadata for the commercial book world originated in the 1970s and early 1980s, when bar coding on books was introduced and EDI transactions between retailers and publishers began. Trade publishing metadata was very different from that of the scholarly and library world. It was limited, in many cases consisting only of ISBN, availability, and price, because transactions were done around a physical object, no more metadata than that was really necessary.

But the efficiency of digital metadata could not be denied. Even prior to e-commerce, retailers like Barnes & Noble and Borders rose to prominence because they made great use of computer transactions and scanners, realizing tremendous speed and logistics savings through their computer systems. Metadata—as bare-bones as it was—was a crucial element in the success of the superstore. A database of inventory (consisting of ISBN, title, author, price, status, quantity on hand, quantity on order, and where in the store the book was supposed to be shelved) allowed store personnel to know stock levels and locations of books.

Metadata changed again with the development of graphic user interfaces (GUI) and the rise of Amazon. Until the early 1990s, computer systems in both libraries and bookstores were large mainframes with dumb green-screen terminals. Software based on Microsoft Windows made it easier and more intuitive to display information, supporting a lot more innovation. Launched in 1995, Amazon took full advantage of these opportunities. In fact, when the meteor of Amazon’s online bookstore hit the publishing industry, it was clear that the world of metadata was never going to be the same.

Through Amazon, consumers were looking at metadata for the first time. No longer relegated to wholesalers’ warehouses and library reference desks, book metadata was front-and-center on the website of “The World’s Largest Bookstore.” Suddenly ISBN and price were not enough.

Why not? Because in order to figure out what they were buying, whether they were even interested in buying, or what books they had to look forward to being published, consumers needed to see the metadata, too.

Consumers wanted to know as much about each book as humanly possible. They wanted cover images, robust descriptions, and excerpts. They wanted to know when a book was published or going to be published—they wanted to place orders for books before they even rolled off the presses. In response, publishers frantically began supplying their warehouse data. This was frequently garbled, including truncated titles, TITLES IN ALL CAPS, misspelled author names, and nonstandard abbreviations. Amazon (and eventually its competitors) hired staffs of data editors whose job it was to clean up the information received from the ever-widening array of sources.

Libraries soon followed suit, demanding Amazon-like web-based catalog software from their software providers. Books-in-Print, formerly those large volumes of titles relegated to the reference desk, produced weekly CD-ROM updates that libraries (and retailers) could subscribe to. A host of services arose to fulfill the needs of both online retailers and libraries, providing additional content that could not be reliably provided by publishers.

These suppliers included Syndetics (ultimately bought by Books-in-Print’s parent company, Bowker), Muze (now Rovi), Firebrand’s Eloquence, and NetRead’s Jacketcaster. The value of metadata can be seen in the longevity of the firms that help publishers manage metadata. Even after a period of enormous upheavals in the book industry in the last 15 years, all of these companies are still very strongly in business.

This is because most book sales are now happening online. Consumers are using the web to browse and search for the titles they want. And if there is insufficient (or inaccurate) metadata for those books, consumers simply will not find them. The publisher (and retailer) with the best, most complete metadata offers the greatest chance for consumers to buy books. The publisher with poor metadata risks poor sales—because no one can find those books.

By 1998, it was clear that the metadata marketplace had reached Babel-like proportions. File formats proliferated, and both data receivers and data senders were overstretched in trying to produce and ingest feeds. The number of book-selling websites and libraries that required metadata had grown so large that the Association of American Publishers called a meeting in New York City, for the first time bringing all concerned parties to the table. It was time for a metadata standard.

Thus began ONIX: ONline Information eXchange.^[2] A global standard overseen by EDItEUR, ONIX is perpetually in development. The US standards body is the BISAC (Book Industry Standards And Communications) Metadata Committee, which operates within the Book Industry Study Group (BISG). ONIX, an XML data transmission protocol, quickly became the lingua franca among retailers, distributors, and publishers. Even libraries developed an ONIX-to-MARC mapping, allowing them to use ONIX records to power their online public-access catalogs (OPACs) so that library patrons could view the same information that appeared on Amazon, BarnesandNoble.com, and similar websites.

As metadata standards, ONIX and MARC held steady for about 12 years. “Metadata” came to mean a basic set of tags, or fields, that described a physical book: ISBN, title, author, price, copyright year, synopsis, subject codes, cover image, availability, excerpts, reviews, and several other bits of information that make it easier for customers and patrons to decide whether or not they want to acquire a book via online means.

As the share of online book sales and lending increased, this metadata grew in importance. No longer could publishers, booksellers, or libraries rely on in-store or in-library displays to lead readers to books. Gradually, the value of metadata as a tool to drive “discovery” began to take hold, as retailers, distributors, and lenders considered how readers might discover and locate the books they wanted to read when searching in an online environment.

To this point, the metadata fields in ONIX and MARC largely described a physical product: a hardcover or paperback version of a book. To some extent, they also described ephemeral aspects of that product—what the book is about, for example. But the metadata we had been using until very recently was, by and large, developed to describe what Brian O’Leary refers to^[3] as “the container”—the physical manifestation of a work.

Out of the Container

But what happens when books are set free from their containers (as O’Leary describes in his “Context First”^[4] presentation)? How do we describe those products (or services)? How do you ensure that readers can find content that—at least physically—defies description?

As the market migrates from print to digital, metadata becomes an even more critical issue. Without metadata, ebooks are invisible. Because they are not present in our physical world, there is absolutely no chance that readers will bump into them serendipitously the way readers bump into print books—typically, by seeing other people reading them or catching sight of them on a bookstore table. It is possible to receive a digital book as a gift, but the giver must still discover it.

Ebooks face a discoverability problem that print books never have: they are only discoverable online and by word of mouth. As far as the digital reader is concerned, without good metadata, the ebook doesn’t exist.

EPUB 3 and Metadata

Printed objects don’t support efforts to coherently cull metadata; that culling is a separate process. At some stage in the supply chain, a warehouse employee holds an actual printed book in her hands and enters all the relevant data she can glean from its cover, copyright page, and title page. She flips through the book to get a page count. She weighs and measures the book. An ONIX record (magically!) gets created and sent to trading partners. The print book is then shipped separately. These “book in hand” programs have provided publishers, distributors, and retailers with foundational book metadata for years.

Fortunately, ebooks offer a possibility that print books do not: the ability to extract metadata directly from the files themselves. With EPUB 3^[5] in particular, it’s possible for publishers to embed relevant metadata within the file, and for their trading partners to then extract this metadata and use it as they need to. The metadata travels with the product, rather than separately.

EPUB is an XML format, and it’s useful to remember that (as with all things XML) extensibility offers great flexibility and control—and great responsibility. Because it’s possible to enhance product metadata with tags that are not necessarily part of any standard, it’s critical to use those nonstandard elements intelligently.

Who Is Your Audience?

When an object is no longer confined by its container—in this case, book covers—describing it is challenging. As content forms evolve, it is likely that we won’t always be able to say that the thing we are talking about is, in fact, a “book.” Perhaps it’s a database, or an online resource. This harkens back to Books in Print, which went from book to CD-ROM to website-with-an-API.

At the 2011 BISG Making Information Pay Conference in New York, Madi Solomon of Pearson described^[6] her efforts to create good metadata for Pearson’s biology products. These products ranged from print books to HMTL ebooks to EPUB ebooks to databases. At the same time, Solomon notes, “lots of metadata” is not the same as “good metadata.”

So what do we mean when we say “good metadata”? As with most things in the digital age, that depends on who’s asking.

Librarians rely on several metadata standards: the Dublin Core,^[7] Library of Congress,^[8] METS,^[9] MARC,^[10] and (to some degree) ONIX. All of these standards help librarians describe, locate, purchase, and recommend books (and ebooks).

Distributors, wholesalers, data aggregators (such as Bowker), and retailers rely on ONIX metadata, in large part. While certainly not perfect, ONIX has proven to be reliable and extensible, evolving to meet the challenges of e-commerce, of selling print books via digital means.

However, it’s still early days for the ebook trade. Many ebook retailers (who are not traditional book companies) require publishers to submit (long, wide) spreadsheets of metadata. Some of these ebook retailers do not accept ONIX at all, requiring wholly different data sets.

Some of this makes sense. Certainly digital book retailers don’t need to see the nonexistent weights and measures of ebooks. Nor do page counts make much sense. But they do need to understand the length of the ebook and what sorts of copy and print rights the publisher is granting on that material. While the newest version of ONIX offers expanded capabilities for describing ebooks, ebook retailers (BarnesandNoble.com and Kobo excepted) are not using it.

The point of metadata is not that it comes in any particular format. ONIX, MARC, and spreadsheets are all just the different containers (there’s that word again) for information that trading partners need. The important thing for publishers is that while trading partners will probably take different types of containers, depending on their needs, what’s IN those containers is going to determine whether or not the books get sold. Is the author name spelled right? Is the title accurate? Does the description of the book really describe the book (and not simply say it’s the best resource out there)? Are the subject headings accurate? Is the price right?

How I Learned to Stop Worrying and Love Metadata

Of course, the ultimate audience for metadata, whether for print or digital books, is the consumer.

Selling digitally means, of course, that it is possible to appeal to many different kinds of consumers—there’s no single “consumer” anymore. Different markets require different descriptors; many books are suitable for several specifically targeted markets, each with its own vernacular.

As is often said, this stage of metadata development is “in its early days.” In practical terms, this means there is no universally accepted standard for consumer metadata for out-of-the-container content yet. There may never be a true standard.

This idea may seem distressing, but it reflects an aspect of the digital market that allows us to think multi-dimensionally.

Because it’s possible to embed metadata in an ePUB file, many metadata schemas can be embedded at once. Provided the recipient of each ePUB file has the correct schema to interpret its appropriate metadata, a file can carry as much metadata as an industry can throw at it. Although some concerns have been expressed about the bandwidth needed for successfully uploading and downloading these rich files, schemas are text-based and unlikely to clog most pipelines.

This digital flexibility makes it possible for a single ePUB file to contain a MARC metadata set, an ONIX metadata set, and a proprietary, consumer-friendly taxonomy that only the publisher’s website can render. Barnes & Noble can ignore the MARC and proprietary schemas; Library of Congress can ignore the proprietary schema, and the publisher’s website can ignore the MARC schema.

Digitization allows for abundance; standards allow for filtering. Metadata is that filter, that lens.

When thinking about metadata, it’s useful to remember the parable of the blind men and the elephant.^[11] In this tale, several blind men are presented with an elephant. Each uses his hands to feel a different aspect of the animal: the trunk, an ear, a tusk, a foot, the tail. Each comes up with a different descriptor of the animal: “It’s like a snake,” “It’s sharp and hard,” “It’s big and floppy,” “It’s sturdy and heavy,” “It’s like a rope.” Each interpretation is correct, but limited.

Metadata is that descriptor. ONIX describes a book from one point of view, MARC from another, and consumers will describe a book from a completely different point of view (“That book with the blue cover that was on Oprah yesterday”). It’s important to remember that no single metadata schema describes a book to the full satisfaction of everyone involved in its creation and consumption. That schema would be horribly bloated and ultimately quite fragile.

Over the next decade, this flexibility will frame the challenge publishers and their intermediaries will face. It’s okay to have multiple metadata schemas—in fact, it’s necessary. It’s okay to have different audiences for metadata; not everybody needs to know the same thing about a book. Much as there’s no one way to describe an elephant, there’s no one way to describe a book. Developing the workflows that capture and maintain the range of descriptions that “describe a book” will be critical in a world in which “discovery” increasingly means “found it online.”

Give the author feedback & add your comments about this chapter on the web: https://book.pressbooks.com/chapter/metadata-laura-dawson

4