September 25, 2009

A mishmash wrapped in a muddle?

Geoffrey Nunberg thinks Google's book search could be a disaster for scholars. In a recent article in The Chronicle for Higher Education he cites serious problems with the way Google extracts a book's metadata which diminish the utility of Google's great book project.
What if you may be interested in books simply as records of the language as it was used in various periods or genres. With the vast collection of published books at hand, you can track the way “happiness” replaced “felicity” in the 17th century, quantify the rise and fall of propaganda or industrial democracy over the course of the 20th century, or pluck out all the Victorian novels that contain the phrase "gentle reader." But to pose those questions, you need reliable metadata about dates and categories, which is why it's so disappointing that Google book search's metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a mess.
Nunberg gives numerous examples of metadata errors such as wrong publication dates (for example, a Bob Dylan biography published in 1899 and 29 references to Barack Obama in books published before he was ever born.) He also points out serious classification errors such as "an edition of Moby Dick is labeled Computers; The Cat Lover's Book of Fascinating Facts falls under Technology & Engineering."

While Google blames the errors on libraries and publishers who supply the books for the scanning project Nunberg points out that it's not libraries supplying bad information and probably not the publishers either. Google is using book seller classifications instead of more rigorous library classifications. No librarian or bookseller would classify a Mae West biography under Religion, even if the biography's subtitle is "An icon in black and white" he notes

The real problem writes Nunberg is that "Google has taken a group of the world's great research collections and returned them in the form of a suburban-mall bookstore."

And some of the metadata is just plain sloppy. Two examples Nunberg found are:
  • Moby Dick: or the White Wall
  • The Mosaic Navigator: The Essential Guide to the Internet Interface, dated 1939 and attributed to Sigmund Freud and Katherine Jones
Google has responded and pledged to fix errors as they are reported. Nunberg clearly states why Google needs to "get it right."
No competitor will be able to come after it on the same scale. Nor is technology going to lower the cost of entry. Scanning will always be an expensive, labor-intensive project. Of course, 50 or 100 years from now control of the collection may pass from Google to somebody else—Elsevier, Unesco, Wal-Mart. But it's safe to assume that the digitized books that scholars will be working with then will be the very same ones that are sitting on Google's servers today, augmented by the millions of titles published in the interim.
The scope of the project means that Google has "the responsibility of making its collections an adequate resource for scholarly research. That means, at a minimum, licensing the catalogs of the Library of Congress and OCLC Online Computer Library Center and incorporating them into the search engine so that users can get accurate results when they search on various combinations of dates, keywords, subject headings, and the like."

Next, we'll summarize Nunberg's conclusions and his reasons for being optimistic about Google Book Search

No comments: