Google's Book Search: A Disaster for Scholars

http://chronicle.com/article/Googles-Book-Search-A/48245/

I dug up this almost year old op-ed published in The Chronicle of higher education because it highlights some of the most challenging problems associated with metadata, classification and describing collections.

Google's ambitious Book Search project is perhaps the world's largest digital library. One of the fundamental problems with the system seems to be the unreliability of metadata. Google's lack of effort to improve the quality of metadata is not surprising given that the search giant's greatest achievement is in being able to locate useful information without relying on metadata or using Yahoo-like classification schemes. However, as the author points out books are not merely useful as vehicles for information but there is rich contextual information contained in them such as the year of publication, association with a volume, classification or genre, the number of an issue in case of a journal etc. While Google captures most of this information for its search index, it is largely riddled with errors. The unreliability of metadata impedes any kind of scholarly research using the system. For instance, using this vast collection, linguists may wish to track the way happiness replaced felicity in the 17 century or to identify all the Victorian novels containing the word 'gentle reader'. Such questions cannot be answered without reliable metadata about dates and categories.

Google likes to refer to it's Book Search as a library but ultimately views books as just another resource to be incorporated into greater Google. For the typical act of googling - keyword based text search - the scanned information contained within the pages of the books is sufficient and the rich metadata provided by a library catalog is of little value. However scholars are often interested in finding a book for reasons that have nothing to do with the information it contains. In such cases you need metadata, for instance if you are looking for a particular edition of a book or assembling all the French editions of a book published in the 18th century. The book search has widespread errors with publishing dates. A book on Peter F. Drucker is dated 1905, four years before the management consultant was even born. A search on "Internet" in books published before 1950 produces 527 results; "Medicare" for the same period gets almost 1,600. Many famous writers and public figures appear in search results for works published before the year of their birth.

Apart from metadata consistency errors, there a classification errors which again severely limit the potential for scholarly use. The BISAC categories that Google uses is a disastrous choice for a library of this scale. The BISAC categories were designed for large chain bookstores and may perhaps have potential for ad placement. H.L. Mencken's The American Language is classified as Family & Relationships. An edition of Moby Dick is labeled Computers. Moreover there are also occasional errors with renaming of works.

While all these errors cannot be attributed to Google's algorithms and a large proportion of them come from various publishers, they clearly highlight the complex problems with information organization at this scale akin to the Library of Babel. These problems also mean that scholars will have to put on hold their visions of finding answers such as quantifying the shift of "United States" from a plural to singular noun phrase over the first century of the republic because the metadata simply won't allow it.