SlideShare a Scribd company logo
Google Books: !
     The Metadata Mess!


      Google Book Settlement Conference!
                            UC Berkeley!
                       August 28, 2009!



                         Geoff Nunberg, !
                   School of Information!
                           1!
1!
The Last Library!
     "The cost of creating such a library and Google’s significant
     lead time advantage suggest that no other entity will create
     a competing digital library for the foreseeable future."
     Directors of ALA, ACRL, ARL in letter to DOJ Antitrust
     Division, July 29, 2009!
     There is no Moore's Law for capture…!
     Hence the urgency of concerns about pricing, access,
     exclusivity, privacy…and "quality""




2!
Whose interests determine!
                        "quality"?!
     Google Book Search is "a tremendous public
     good for students, for teachers, for scholars, for
     everyone." Derek Slater, Google!
        … but students, scholars and "everyone" may have
        different purposes for using GBS. !




3!
Three ways of using GBS!
     What "Googling" means: barrelling in sideways!




     GBS as a borough of Greater Google!
        "We just feel this is part of our core mission. There is
        fantastic information in books. Often when I do a
        search, what is in a book is miles ahead of what I find
        on a Web site." Sergey Brin!


4!
Three ways of using GBS!
       Seeking out works & editions: the
       "destination experience"!
          A particular edition of Leaves of Grass!
          A good edition of Tristram Shandy!
          18th-c. French editions of Don Quixote,
          etc.!



           The importance of metadata: Who,
           when, where etc. !




5!
Three ways of using GBS!
     "Batch processing": data mining and "
     "electronic philology"!
        "It's only reporters and computational linguists who
        care if [hit-count estimation] is really precise." Peter
        Norvig, Google!
     Text databases and the "new philologies": !
        The importance of language to social, intellectual, and
        political history & literary study!
        Coincides emergence of large-scale historical text
        databases…!
           When did happiness replace felicity in 17th c?!
           Plotting the rise & fall of propaganda!
           How did liberalism spread in the early nineteenth-century
6!         European context?. "
Good enough for scholarship?!
     Will GBS be an adequate resource for scholarly
     needs… now and in the future?!
     Depends on:!
        Quality of imaging!
        Reliability and robustness of search tools!
        Quality and reliability of metadata !
          e.g., date, edition history, author, subject classification,
          etc.!




7!
Good enough for scholarship?!
     Will GBS be an adequate resource for scholarly
     needs… now and in the future?!
     Depends on:!
        Quality of imaging!
        Reliability and robustness of search tools!
        Quality and reliability of metadata !
          e.g., date, edition history, author, subject classification,
          etc.!


     But GBS metadata are awful.!


8!
Quality Issues :!
     Botched Scans, OCR, &c.!




9!
Metadata Issues:!
      1899, annus mirabilis!




10!
Random Dates!


      1905!



               1848!




              1900!


               1888!


11!
The pervasiveness of
                    misdatings!
      1899!
                      527 hits returned for
      1905!           "Internet" before 1950 !
      1878 !


      1905!


      1946!


      1905!

       1905!


              1905!


        1939!


12!
Famous before their lifetime!
                          182 hits reported for "Charles
                          Dickens" before birthdate
                          (1812)!
              1878 !
                          Cf Jimi Hendrix, 81; Led
                1905!     Zeppelin, 59 etc.!

               1946!


                  1905!

              1905!




13!
Ego-surfing,
      Edgar Cayce
             Style!
      "Our reputation
       precedes us"!




14!
The frequency of misdatings!




                     Search on "candy bar" < 1920
                     yields 66 hits, 46 of them
                     misdated (70%)!



15!
Classification Errors!




16!
Classification Errors!




17!
The Pervasiveness of
           Misclassification!

               family and relationships (4)



               fiction (4)




               biography and autobiography (1)



                Unlabeled (1)
                (others classified as "music,"
                "history," "literary collections")

        Classifications of first 10 hits for !
        Tristram Shandy !
18!
The Pervasiveness of
           Misclassification!


               First 10 hits for Leaves of
                      Grass classify it as:"

                              Juvenile Nonfiction"
                                           Poetry!
                                           Fiction!
                                Literary Criticism!
                    Biography & Autobiography,!
                 Counterfeits and Counterfeiting !




19!
More bad metadata!




20!
More bad metadata!




  Reader, I
marketed him.




     21!
Other metadata issues!
      Books ascribed to authors of introductions, or
      given no author at all.!




22!
Other metadata issues!
      Titles linked to unrelated works.!




23!
Other metadata issues!
      Strange bedfellows!




24!
Who is to blame and what is
                     to be done?!
      "We got the metadata from the libraries": !
          yes, sometimes… but libraries didn't classify Hamlet as
          "antiques and collectibles" or Speculum as "Health & Fitness"!
          Libraries don't use BISAC headings like "Antiques and
          Collectibles" and "Health & Fitness" in the first place…!
          And publishers didn't assign BISAC codes to books
          published before the 1980's!




25!
The world according to BISAC!
      Making space for Bambi & Bullwinkle!




      … and Schiller, Petrarch & Verlaine!




26!
The world according to BISAC!
       Making shelf space for Bambi & Bullwinkle!




       … and scrunching together Schiller, Petrarch & Verlaine!




      Squeezing the universal library into a sububan bookstore!
27!
Correcting the Problem!
      Google: "We're on it (but it isn't a first priority)"!
         Correcting errors as noticed (like bad scans)?!
         Crowd Sourcing?!
         But errors/bad metadata affect 000,000's of records!
         "Error correction" doesn't address poor & missing
         metadata, inconsistent/confusing/inappropriate
         classification schemes!
         Why should the metadata decisions be left to Google
         engineers? !




28!
Correcting the Problem!
      HathiTrust to the rescue?!
         But HathiTrust makes available only out-of-copyright
         works, has (relatively) limited computational resources!
      Why should Google have no obligations to do
      GBS right? !
         Google Book Search is "a tremendous public good for
         students, for teachers, for scholars, for everyone."
         Derek Slater, Google!
         But a public good implies a public trust!




29!

More Related Content

PPT
Getting Started In Genealogy Research - Genealogy Boot Camp Part 1
PPTX
2014 Cornell University - Repackaging Research
PPTX
Moba thursday
PDF
#UoYTips: Welcome to the Library
PPTX
Webinar 4: Sharing, Promotion & The Ripple Effect
PPTX
Discovery & Reuse of Content
PDF
A Metadata Ocean in Chef and Puppet
PPT
Faceted Metadata for Site Navigation and Search
Getting Started In Genealogy Research - Genealogy Boot Camp Part 1
2014 Cornell University - Repackaging Research
Moba thursday
#UoYTips: Welcome to the Library
Webinar 4: Sharing, Promotion & The Ripple Effect
Discovery & Reuse of Content
A Metadata Ocean in Chef and Puppet
Faceted Metadata for Site Navigation and Search

Viewers also liked (8)

PDF
Six safe fonts to use in your presentations
PPTX
The Science of Great Site Navigation: Online Card Sorting + Tree Testing
PPTX
10 PowerPoint Templates That Don't Suck
PPTX
2015 Font Trends For Presentations
PDF
40 Tools in 20 Minutes: Hacking your Marketing Career
PDF
How to Craft Your Company's Storytelling Voice by Ann Handley of MarketingProfs
PDF
Pixar's 22 Rules to Phenomenal Storytelling
PDF
SMOKE - The Convenient Truth [1st place Worlds Best Presentation Contest] by ...
Six safe fonts to use in your presentations
The Science of Great Site Navigation: Online Card Sorting + Tree Testing
10 PowerPoint Templates That Don't Suck
2015 Font Trends For Presentations
40 Tools in 20 Minutes: Hacking your Marketing Career
How to Craft Your Company's Storytelling Voice by Ann Handley of MarketingProfs
Pixar's 22 Rules to Phenomenal Storytelling
SMOKE - The Convenient Truth [1st place Worlds Best Presentation Contest] by ...
Ad

Similar to Goog Books: The Metadata Mess (20)

PPT
Calibrate
PPTX
What's New in Literature for Young Adults
PPTX
PPTX
Gn brief history
PDF
Publishing Trend @ Bologna Children Book Fair
PDF
STAK: Serendipitous Tool for Augmenting Knowledge - Bridging gaps between dig...
PDF
Scientific Paper. Online assignment writing service.
PDF
Example Of Politics (Critical Research)
PPT
Eng2 d tkm_intro
PPT
Building a balanced collection 2003 version
PPTX
What i learned from the history of books
PPT
Hagar qim wicked comics presentation
PDF
James Baldwin Essays Online.pdf
PDF
Essay Peshawar Attack In English
PDF
Intro to Digital Storytelling (shorter version)
PPT
Social tagging 663
PPT
Redefining Libraries
PDF
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
PPTX
What's New in Literature for Young Readers
PDF
Cubism Essay. Cubism essay pablo picasso
Calibrate
What's New in Literature for Young Adults
Gn brief history
Publishing Trend @ Bologna Children Book Fair
STAK: Serendipitous Tool for Augmenting Knowledge - Bridging gaps between dig...
Scientific Paper. Online assignment writing service.
Example Of Politics (Critical Research)
Eng2 d tkm_intro
Building a balanced collection 2003 version
What i learned from the history of books
Hagar qim wicked comics presentation
James Baldwin Essays Online.pdf
Essay Peshawar Attack In English
Intro to Digital Storytelling (shorter version)
Social tagging 663
Redefining Libraries
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
What's New in Literature for Young Readers
Cubism Essay. Cubism essay pablo picasso
Ad

More from StreetLib (20)

PDF
Il crowdfunding come mezzo di finanziamento e di comunicazione nel mondo dell...
PPTX
Vendita e Distribuzione - Narcissus Vs DIY
PPTX
"Immagina, scrivi, pubblica - Editore di te stesso con il self publishing"
PDF
Wonders of self publishing for journalists (and not only) - Arabic version
PPT
La question de l'autoédition pour les journalistes (mais pas que)
PPT
Wonders of self publishing for journalists (and not only)
PPT
Maravillas de la AUTOPUBLICACIÓN para periodistas (y no sólo para ellos)
PPT
Die Chancen des SELF-PUBLISHING Für Journalisten (und nicht nur die)
PPTX
Le meraviglie del SELF PUBLISHING per giornalisti (e non solo)
PDF
Promuovere gli eBook - Parte 3
PDF
Promuovere gli eBook - Parte 1
PDF
Grafici e dati stealth convention 2013
PDF
Simplicissimus Web Wervices: Overview of the Simplicissimus Cloud Infrastructure
PDF
Tutto su Simplicissimus in 30 minuti
PDF
Ignite - Gli ebook in 5 minuti e 20 slide!
PDF
L'Ebook spiegato in 20 pagine
ODP
SBF JWW (it)
ODP
Sbf Paperless Democracy (en)
ODP
SBF STEALTH (en)
PDF
SBF Paperless Democracy (it)
Il crowdfunding come mezzo di finanziamento e di comunicazione nel mondo dell...
Vendita e Distribuzione - Narcissus Vs DIY
"Immagina, scrivi, pubblica - Editore di te stesso con il self publishing"
Wonders of self publishing for journalists (and not only) - Arabic version
La question de l'autoédition pour les journalistes (mais pas que)
Wonders of self publishing for journalists (and not only)
Maravillas de la AUTOPUBLICACIÓN para periodistas (y no sólo para ellos)
Die Chancen des SELF-PUBLISHING Für Journalisten (und nicht nur die)
Le meraviglie del SELF PUBLISHING per giornalisti (e non solo)
Promuovere gli eBook - Parte 3
Promuovere gli eBook - Parte 1
Grafici e dati stealth convention 2013
Simplicissimus Web Wervices: Overview of the Simplicissimus Cloud Infrastructure
Tutto su Simplicissimus in 30 minuti
Ignite - Gli ebook in 5 minuti e 20 slide!
L'Ebook spiegato in 20 pagine
SBF JWW (it)
Sbf Paperless Democracy (en)
SBF STEALTH (en)
SBF Paperless Democracy (it)

Recently uploaded (20)

PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PDF
advance database management system book.pdf
PPTX
20th Century Theater, Methods, History.pptx
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PPTX
Unit 4 Computer Architecture Multicore Processor.pptx
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
Hazard Identification & Risk Assessment .pdf
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PDF
Trump Administration's workforce development strategy
PDF
My India Quiz Book_20210205121199924.pdf
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PPTX
Introduction to Building Materials
PDF
What if we spent less time fighting change, and more time building what’s rig...
PDF
Empowerment Technology for Senior High School Guide
PDF
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
advance database management system book.pdf
20th Century Theater, Methods, History.pptx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Unit 4 Computer Architecture Multicore Processor.pptx
Chinmaya Tiranga quiz Grand Finale.pdf
Hazard Identification & Risk Assessment .pdf
FORM 1 BIOLOGY MIND MAPS and their schemes
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
LDMMIA Reiki Yoga Finals Review Spring Summer
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Trump Administration's workforce development strategy
My India Quiz Book_20210205121199924.pdf
B.Sc. DS Unit 2 Software Engineering.pptx
Introduction to Building Materials
What if we spent less time fighting change, and more time building what’s rig...
Empowerment Technology for Senior High School Guide
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...

Goog Books: The Metadata Mess

  • 1. Google Books: ! The Metadata Mess! Google Book Settlement Conference! UC Berkeley! August 28, 2009! Geoff Nunberg, ! School of Information! 1! 1!
  • 2. The Last Library! "The cost of creating such a library and Google’s significant lead time advantage suggest that no other entity will create a competing digital library for the foreseeable future." Directors of ALA, ACRL, ARL in letter to DOJ Antitrust Division, July 29, 2009! There is no Moore's Law for capture…! Hence the urgency of concerns about pricing, access, exclusivity, privacy…and "quality"" 2!
  • 3. Whose interests determine! "quality"?! Google Book Search is "a tremendous public good for students, for teachers, for scholars, for everyone." Derek Slater, Google! … but students, scholars and "everyone" may have different purposes for using GBS. ! 3!
  • 4. Three ways of using GBS! What "Googling" means: barrelling in sideways! GBS as a borough of Greater Google! "We just feel this is part of our core mission. There is fantastic information in books. Often when I do a search, what is in a book is miles ahead of what I find on a Web site." Sergey Brin! 4!
  • 5. Three ways of using GBS! Seeking out works & editions: the "destination experience"! A particular edition of Leaves of Grass! A good edition of Tristram Shandy! 18th-c. French editions of Don Quixote, etc.! The importance of metadata: Who, when, where etc. ! 5!
  • 6. Three ways of using GBS! "Batch processing": data mining and " "electronic philology"! "It's only reporters and computational linguists who care if [hit-count estimation] is really precise." Peter Norvig, Google! Text databases and the "new philologies": ! The importance of language to social, intellectual, and political history & literary study! Coincides emergence of large-scale historical text databases…! When did happiness replace felicity in 17th c?! Plotting the rise & fall of propaganda! How did liberalism spread in the early nineteenth-century 6! European context?. "
  • 7. Good enough for scholarship?! Will GBS be an adequate resource for scholarly needs… now and in the future?! Depends on:! Quality of imaging! Reliability and robustness of search tools! Quality and reliability of metadata ! e.g., date, edition history, author, subject classification, etc.! 7!
  • 8. Good enough for scholarship?! Will GBS be an adequate resource for scholarly needs… now and in the future?! Depends on:! Quality of imaging! Reliability and robustness of search tools! Quality and reliability of metadata ! e.g., date, edition history, author, subject classification, etc.! But GBS metadata are awful.! 8!
  • 9. Quality Issues :! Botched Scans, OCR, &c.! 9!
  • 10. Metadata Issues:! 1899, annus mirabilis! 10!
  • 11. Random Dates! 1905! 1848! 1900! 1888! 11!
  • 12. The pervasiveness of misdatings! 1899! 527 hits returned for 1905! "Internet" before 1950 ! 1878 ! 1905! 1946! 1905! 1905! 1905! 1939! 12!
  • 13. Famous before their lifetime! 182 hits reported for "Charles Dickens" before birthdate (1812)! 1878 ! Cf Jimi Hendrix, 81; Led 1905! Zeppelin, 59 etc.! 1946! 1905! 1905! 13!
  • 14. Ego-surfing, Edgar Cayce Style! "Our reputation precedes us"! 14!
  • 15. The frequency of misdatings! Search on "candy bar" < 1920 yields 66 hits, 46 of them misdated (70%)! 15!
  • 18. The Pervasiveness of Misclassification! family and relationships (4) fiction (4) biography and autobiography (1) Unlabeled (1) (others classified as "music," "history," "literary collections") Classifications of first 10 hits for ! Tristram Shandy ! 18!
  • 19. The Pervasiveness of Misclassification! First 10 hits for Leaves of Grass classify it as:" Juvenile Nonfiction" Poetry! Fiction! Literary Criticism! Biography & Autobiography,! Counterfeits and Counterfeiting ! 19!
  • 21. More bad metadata! Reader, I marketed him. 21!
  • 22. Other metadata issues! Books ascribed to authors of introductions, or given no author at all.! 22!
  • 23. Other metadata issues! Titles linked to unrelated works.! 23!
  • 24. Other metadata issues! Strange bedfellows! 24!
  • 25. Who is to blame and what is to be done?! "We got the metadata from the libraries": ! yes, sometimes… but libraries didn't classify Hamlet as "antiques and collectibles" or Speculum as "Health & Fitness"! Libraries don't use BISAC headings like "Antiques and Collectibles" and "Health & Fitness" in the first place…! And publishers didn't assign BISAC codes to books published before the 1980's! 25!
  • 26. The world according to BISAC! Making space for Bambi & Bullwinkle! … and Schiller, Petrarch & Verlaine! 26!
  • 27. The world according to BISAC! Making shelf space for Bambi & Bullwinkle! … and scrunching together Schiller, Petrarch & Verlaine! Squeezing the universal library into a sububan bookstore! 27!
  • 28. Correcting the Problem! Google: "We're on it (but it isn't a first priority)"! Correcting errors as noticed (like bad scans)?! Crowd Sourcing?! But errors/bad metadata affect 000,000's of records! "Error correction" doesn't address poor & missing metadata, inconsistent/confusing/inappropriate classification schemes! Why should the metadata decisions be left to Google engineers? ! 28!
  • 29. Correcting the Problem! HathiTrust to the rescue?! But HathiTrust makes available only out-of-copyright works, has (relatively) limited computational resources! Why should Google have no obligations to do GBS right? ! Google Book Search is "a tremendous public good for students, for teachers, for scholars, for everyone." Derek Slater, Google! But a public good implies a public trust! 29!