SlideShare a Scribd company logo
OMG! MY METADATA IS AS
  FRESH AS THE BACKSTREET
 BOYS: HOW GOOGLE REFINE
 CAN UPDATE, CLEAN UP AND
LINK YOUR METADATA TO THE
             WIDER WORLD
                 SARAH BETH WEEKS

   LIBRARY TECHNOLOGY CONFERENCE 2013

                   WEEKSS@STOLAF.EDU
                       @RASCALWHALE
SAMPLE PROJECT: NORDIC AMERICAN
                IMPRINTS

Situation: Wanted to match publishers of our books against a
list of important Nordic American Publishers (compiled by Penny
Huf fman) to find materials for our special collections.
Problem: Hard to compare when publication info is not
controlled:
ANSWER: GOOGLE REFINE!

Google Refine can “match and
 merge” messy data filled with:
 Random, leading or trailing spaces
 stray punctuation
 typos
 odd capitalization
  and more!
CREATE YOUR PROJECT USING ANY
        SPREADSHEET
USE “COMMON TRANSFORMS” TO FIX
“WHITESPACE” PROBLEMS IN A SINGLE CLICK
3. CLEAN UP STRAY CHARACTERS ([].?:) USING
   “TRANSFORM” AND REGULAR EXPRESSIONS
(OR JUST USE EXCEL FIND AND REPLACE FOR THIS)
4. REPEAT COMMON TRANSFORMS
5. CLUSTER AND EDIT
(THIS IS WHERE THE MAGIC HAPPENS)
FUNCTION 1: FINGERPRINT
    (MOST RELIABLE)
NGRAM METHOD
 (STILL RELIABLE: MORE MATCHES BUT LESS
RELIABILIT Y AS YOU DECREASE NGRAM SIZE)
PHONETIC MATCHING
(ESPECIALLY USEFUL WHEN DEALING WITH
          TRANSLATED TEXT)
(MORE FALSE MATCHES TO WATCH FOR
    WITH PHONETIC FUNCTIONS)
NEAREST NEIGHBOR (PPM) MATCHING
(SLOWER AND MORE FALSE MATCHES BUT
 CATCHES WHAT OTHER METHODS MISS)
(SET RADIUS HIGHER, BLOCK CHARACTERS
  LOWER TO GENERATE MORE MATCHES)
AFTER USING OTHER METHODS, RUN
THROUGH FINGERPRINT AND NGRAM AGAIN
BE AWARE THAT THINGS THAT WEREN’T
 CLUSTERED WON’T HAVE BEEN FIXED
6. USE THE TEXT FACET TO SEE ALL
         UNIQUE VALUES
YOU CAN SCROLL THROUGH THE LIST TO
     SPOT CHECK FOR PROBLEMS
CLICK EDIT TO T YPE NEW TEXT FOR ALL
       CELLS WITH THIS VALUE
OTHER CLEAN-UP WE DID:
     PUBLISHERS
OTHER CLEAN-UP WE DID:
      GIFT NOTES
ALSO WORKS FOR NUMBERS/DATES
END RESULT?

 Using Google Refine we were able to reduce the
  3230 unique values for city (260|a) to just 1153. For
  publishers (260|b) we went from 11342 unique
  names for publishers to approximately 6500.
 This project helped to identify over 2,000 potential
  candidates for our Nordic American Imprints
  collection. (These are still being evaluated).
 The controlled publishers, cities of publications and
  dates will be added to a local 9xx field for faceting in
  our future special collections discover tool. Users will
  be able to browse our Nordic American Imprints
  collection by publisher, city or state.
BUT WAIT! THERE’S MORE!!
     LINKED DATA!!!
FREEBASE IS THE DEFAULT SERVICE
(WIKIPEDIA -ESQUE DATA OWNED BY GOOGLE)
CHOOSE THE RIGHT “T YPE” AND MOST
   CELLS WILL BE AUTO-MATCHED
FOR THE REST CLICK THE OPTIONS TO
     SEE WHAT EACH REPRESENTS
 Then click “Match All Identical Cells” (or double checkmarks)
  to link all cells with this text to this Freebase topic
OR “SEARCH FOR MATCH” TO BRING UP
 AN AUTO-FILL LIST TO CHOOSE FROM
EVEN COOLER: NOW YOU CAN BRING
    DATA IN FROM FREEBASE!
CHOOSE WHAT INFO YOU WANT TO ADD
THIS NEW DATA IS NOW ADDED TO YOUR
           SPREADSHEET
TO SEE WHAT COLUMNS (DATA) YOU CAN
        ADD FROM FREEBASE:
Browse the properties at: http://schemas.freebaseapps.com /
MATCH LOCAL SUBJECT HEADING TO LC
    (FREEYOURMETADATA.ORG)
SPARQL ENDPOINTS

 Install the RDF Extension for Google Refine
  http://refine.deri.ie/




 SPARQL Endpoints
 http://labs.mondeca.com/sparqlEndpointsStatus/index.html
 CKAN Data Hub: http://datahub.io/dataset/
ADD SPARQL-BASED RECONCILIATION
            SERVICE
THANK YOU!

Questions?

Link to a public version of this presentation
 at my (personal) blog:
     gardenandalibrary.blogspot.com
I’m also happy to take questions by e-
 mail
              weekss@stolaf.edu

More Related Content

PDF
Open refine reconciliation service api (dc python 2013_03_05)
PPTX
Data Wrangling with Open Refine
PPTX
OpenRefine Tutorial
PPTX
Beautiful Research Data (Structured Data and Open Refine)
PDF
Let your data shine... with OpenRefine
PPTX
TXDHC OpenRefine Training
PDF
Introduction to OpenRefine
PPTX
OpenRefine Class Tutorial
Open refine reconciliation service api (dc python 2013_03_05)
Data Wrangling with Open Refine
OpenRefine Tutorial
Beautiful Research Data (Structured Data and Open Refine)
Let your data shine... with OpenRefine
TXDHC OpenRefine Training
Introduction to OpenRefine
OpenRefine Class Tutorial

What's hot (20)

PDF
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
PDF
The Lonesome LOD Cloud
PDF
The Digital Cavemen of Linked Lascaux
PDF
Live DBpedia querying with high availability
KEY
Semantic web application architecture
PDF
Using entity extraction extension with OpenRefine and Dandelion API
PDF
Querying data on the Web – client or server?
PDF
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
PPTX
Consuming Linked Data 4/5 Semtech2011
PDF
Querying datasets on the Web with high availability
PPTX
Creating 3rd Generation Web APIs with Hydra
DOC
Done reread detecting phrase-level duplication on the world wide we
PDF
The Future is Federated
PPTX
Web data from R
PDF
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
DOC
Asp.Net The Data List Control
PPT
Talis Platform: A Linked Data Engine
PDF
Text Analytics Online Knowledge Base / Database
PDF
Reasoned SPARQL
PDF
CEK KEMIRIPAN PADA CROSSREF
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
The Lonesome LOD Cloud
The Digital Cavemen of Linked Lascaux
Live DBpedia querying with high availability
Semantic web application architecture
Using entity extraction extension with OpenRefine and Dandelion API
Querying data on the Web – client or server?
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Consuming Linked Data 4/5 Semtech2011
Querying datasets on the Web with high availability
Creating 3rd Generation Web APIs with Hydra
Done reread detecting phrase-level duplication on the world wide we
The Future is Federated
Web data from R
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
Asp.Net The Data List Control
Talis Platform: A Linked Data Engine
Text Analytics Online Knowledge Base / Database
Reasoned SPARQL
CEK KEMIRIPAN PADA CROSSREF
Ad

Similar to OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world (20)

PDF
Joy Nelson - Workshop on BIBFRAME, RDF and SPAQL
PDF
Lecture 2 part 3
PPT
The Power of Semantic Technologies to Explore Linked Open Data
PPTX
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
PPTX
A brief history of "big data"
PPTX
Hadoop with Python
PDF
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
PPTX
Graph databases: Tinkerpop and Titan DB
PPTX
Splunk bsides
PDF
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
PPTX
Search Engines After The Semanatic Web
PPTX
Why MongoDB over other Databases - Habilelabs
PPT
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
PPTX
Hadoop Interview Questions and Answers
PDF
Visualizations using Visualbox
PDF
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
PPT
Another Intro To Hadoop
PDF
3 map reduce perspectives
PPTX
Semantic framework for web scraping.
PPTX
Case study of Rujhaan.com (A social news app )
Joy Nelson - Workshop on BIBFRAME, RDF and SPAQL
Lecture 2 part 3
The Power of Semantic Technologies to Explore Linked Open Data
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
A brief history of "big data"
Hadoop with Python
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Graph databases: Tinkerpop and Titan DB
Splunk bsides
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Search Engines After The Semanatic Web
Why MongoDB over other Databases - Habilelabs
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
Hadoop Interview Questions and Answers
Visualizations using Visualbox
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
Another Intro To Hadoop
3 map reduce perspectives
Semantic framework for web scraping.
Case study of Rujhaan.com (A social news app )
Ad

Recently uploaded (20)

PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
A Presentation on Artificial Intelligence
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Machine Learning_overview_presentation.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Unlocking AI with Model Context Protocol (MCP)
The Rise and Fall of 3GPP – Time for a Sabbatical?
Advanced methodologies resolving dimensionality complications for autism neur...
sap open course for s4hana steps from ECC to s4
Chapter 3 Spatial Domain Image Processing.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Weekly Chronicles - August'25-Week II
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
A Presentation on Artificial Intelligence
20250228 LYD VKU AI Blended-Learning.pptx
Electronic commerce courselecture one. Pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine Learning_overview_presentation.pptx
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

  • 1. OMG! MY METADATA IS AS FRESH AS THE BACKSTREET BOYS: HOW GOOGLE REFINE CAN UPDATE, CLEAN UP AND LINK YOUR METADATA TO THE WIDER WORLD SARAH BETH WEEKS LIBRARY TECHNOLOGY CONFERENCE 2013 WEEKSS@STOLAF.EDU @RASCALWHALE
  • 2. SAMPLE PROJECT: NORDIC AMERICAN IMPRINTS Situation: Wanted to match publishers of our books against a list of important Nordic American Publishers (compiled by Penny Huf fman) to find materials for our special collections. Problem: Hard to compare when publication info is not controlled:
  • 3. ANSWER: GOOGLE REFINE! Google Refine can “match and merge” messy data filled with: Random, leading or trailing spaces stray punctuation typos odd capitalization  and more!
  • 4. CREATE YOUR PROJECT USING ANY SPREADSHEET
  • 5. USE “COMMON TRANSFORMS” TO FIX “WHITESPACE” PROBLEMS IN A SINGLE CLICK
  • 6. 3. CLEAN UP STRAY CHARACTERS ([].?:) USING “TRANSFORM” AND REGULAR EXPRESSIONS (OR JUST USE EXCEL FIND AND REPLACE FOR THIS)
  • 7. 4. REPEAT COMMON TRANSFORMS
  • 9. (THIS IS WHERE THE MAGIC HAPPENS)
  • 10. FUNCTION 1: FINGERPRINT (MOST RELIABLE)
  • 11. NGRAM METHOD (STILL RELIABLE: MORE MATCHES BUT LESS RELIABILIT Y AS YOU DECREASE NGRAM SIZE)
  • 12. PHONETIC MATCHING (ESPECIALLY USEFUL WHEN DEALING WITH TRANSLATED TEXT)
  • 13. (MORE FALSE MATCHES TO WATCH FOR WITH PHONETIC FUNCTIONS)
  • 14. NEAREST NEIGHBOR (PPM) MATCHING (SLOWER AND MORE FALSE MATCHES BUT CATCHES WHAT OTHER METHODS MISS)
  • 15. (SET RADIUS HIGHER, BLOCK CHARACTERS LOWER TO GENERATE MORE MATCHES)
  • 16. AFTER USING OTHER METHODS, RUN THROUGH FINGERPRINT AND NGRAM AGAIN
  • 17. BE AWARE THAT THINGS THAT WEREN’T CLUSTERED WON’T HAVE BEEN FIXED
  • 18. 6. USE THE TEXT FACET TO SEE ALL UNIQUE VALUES
  • 19. YOU CAN SCROLL THROUGH THE LIST TO SPOT CHECK FOR PROBLEMS
  • 20. CLICK EDIT TO T YPE NEW TEXT FOR ALL CELLS WITH THIS VALUE
  • 21. OTHER CLEAN-UP WE DID: PUBLISHERS
  • 22. OTHER CLEAN-UP WE DID: GIFT NOTES
  • 23. ALSO WORKS FOR NUMBERS/DATES
  • 24. END RESULT?  Using Google Refine we were able to reduce the 3230 unique values for city (260|a) to just 1153. For publishers (260|b) we went from 11342 unique names for publishers to approximately 6500.  This project helped to identify over 2,000 potential candidates for our Nordic American Imprints collection. (These are still being evaluated).  The controlled publishers, cities of publications and dates will be added to a local 9xx field for faceting in our future special collections discover tool. Users will be able to browse our Nordic American Imprints collection by publisher, city or state.
  • 25. BUT WAIT! THERE’S MORE!! LINKED DATA!!!
  • 26. FREEBASE IS THE DEFAULT SERVICE (WIKIPEDIA -ESQUE DATA OWNED BY GOOGLE)
  • 27. CHOOSE THE RIGHT “T YPE” AND MOST CELLS WILL BE AUTO-MATCHED
  • 28. FOR THE REST CLICK THE OPTIONS TO SEE WHAT EACH REPRESENTS  Then click “Match All Identical Cells” (or double checkmarks) to link all cells with this text to this Freebase topic
  • 29. OR “SEARCH FOR MATCH” TO BRING UP AN AUTO-FILL LIST TO CHOOSE FROM
  • 30. EVEN COOLER: NOW YOU CAN BRING DATA IN FROM FREEBASE!
  • 31. CHOOSE WHAT INFO YOU WANT TO ADD
  • 32. THIS NEW DATA IS NOW ADDED TO YOUR SPREADSHEET
  • 33. TO SEE WHAT COLUMNS (DATA) YOU CAN ADD FROM FREEBASE: Browse the properties at: http://schemas.freebaseapps.com /
  • 34. MATCH LOCAL SUBJECT HEADING TO LC (FREEYOURMETADATA.ORG)
  • 35. SPARQL ENDPOINTS  Install the RDF Extension for Google Refine http://refine.deri.ie/  SPARQL Endpoints  http://labs.mondeca.com/sparqlEndpointsStatus/index.html  CKAN Data Hub: http://datahub.io/dataset/
  • 37. THANK YOU! Questions? Link to a public version of this presentation at my (personal) blog: gardenandalibrary.blogspot.com I’m also happy to take questions by e- mail weekss@stolaf.edu