SlideShare a Scribd company logo
A Public Metadata Commons:
                                   What is it?
                                Why do we need it?
                                How do we get it?




                                    Kurt Bollacker
                                  Open Data Bay Area
                                    2012 Nov 27


Wednesday, April 3, 2013                                1
A long time ago, there was no “open” data.
                           All of the media we used to create was physical.




Wednesday, April 3, 2013                                                      2
Then most (all?) of the media became digital.




Wednesday, April 3, 2013                                    3
The Internet let us ship data around
                                for (almost) free.




Wednesday, April 3, 2013                                      4
And we learned how to connect it all together.




                            So naturally, we started to build a
                           Global Digital Data Commons!

Wednesday, April 3, 2013                                          5
At first it was a “free for all” of
                             academics and enthusiasts.




             Almost all data on the Web was considered to be “open”.
Wednesday, April 3, 2013                                               6
And then folks figured out how to
                           make money from our contributions,




             so they started to “lock down” part of the Internet that
                previously would have been part of the commons.


Wednesday, April 3, 2013                                                7
Why is this bad?
                           For the data archivist, centrally controlled data
                              have far fewer (single?) points of failure.

                                      •   Technical Failure

                                      •   Legal Barriers

                                      •   Incompetence




Wednesday, April 3, 2013                                                       8
A (Potential) Digital Dark Age




                         "Those who cannot remember the past are
                       condemned to repeat it" --- George Santayana
Wednesday, April 3, 2013                                              9
How Do We Avoid This
                           Lockdown Of Central Control,
                             (And Hopefully A Digital Dark Age)?




                   We Need A Practical Perspective On the Problem.
Wednesday, April 3, 2013                                             10
Example Surviving Archives




Wednesday, April 3, 2013                                11
Data tends to survive if
                              over the long term, it is:


                                      •   Visible

                                      •   Mobile

                                      •   Well Loved




                              These happen to also be the
                           properties of data in a public commons.


Wednesday, April 3, 2013                                             12
Historical
                                      •   Bible / Torah / Koran

      Examples:                       •   U.S. Constitution

                                      •   DNA?
                                                                  •   Wikipedia

                                       Present Day                •   Open Street Maps
                                        Examples:                 •   Freebase

                                                                  •   MusicBrainz
               Why?
                           •   There are many copies. (mobile)

                           •   Their use is mostly unrestricted. (visible)

                           •   Everyone can access and contribute. (well loved)

Wednesday, April 3, 2013                                                                 13
But what about data that is still trapped by:



                           •   Technical Barriers?

                           •   Legal Restrictions?

                           •   Limited Resources?




Wednesday, April 3, 2013                                      14
We build a metadata commons to hold
       the “cultural context” of our trapped data.




Wednesday, April 3, 2013                             15
How does a metadata commons work?


                                                                     Metadata

                                                    Metadata



                           Trapped     Extraction
                           Datasets    Processes               Metadata



                                                    Metadata

                                                                          Metadata




                   Even if the original contribution is lost or otherwise
                    made unavailable, we still have its cultural context.

Wednesday, April 3, 2013                                                             16
The cultural context in a metadata commons
                          might contain:

          •       Indices and Tags (to find and organize)

          •       Comments (to analyze and interpret)

          •       Technical metadata (e.g. provenance, format info)

          •       Transforms and Interpretations (to make something useful)




Wednesday, April 3, 2013                                                      17
Where is the trapped data that we care about?
         A lot of it is in The World Wide Web!

                                          But the Web is:

                   •       Very large (10TB - 100TB for accessible / deduped)

                   •       Very noisy (useless pages, partial duplicates)

                   •       Very diverse (in content, purpose, and target audience)



                   How do we build a Metadata Commons
                             from the Web?
Wednesday, April 3, 2013                                                             18
A Practical Place To Start:

                                      Common Crawl
                            (and cheap cloud computing resources)
                           make the Web far cheaper and easier to
                                   access and manipulate.

                           •   Can be downloaded wholesale

                           •   Can be processed and analyzed in situ.

                           •   Parts can be publicly referenced




Wednesday, April 3, 2013                                                19
This foundation helps us scale up to
                                “Web size”, but:


                           •   What is the useful “metadata of the Web”?

                           •   How to we extract that metadata?




Wednesday, April 3, 2013                                                   20
Useful Web Extracts Are


                       •   Interesting to many people (to me!)

                       •   Can be used to answer relevant questions.

                       •   Can be used to build useful products and services.




           Almost everyone will have an itch to scratch.


Wednesday, April 3, 2013                                                        21
Specific Examples Of Useful Web Extracts
                               (From the Common Crawl code contest)



                           •    WikiEntities

                           •    Congressional sentiment

                           •    Reach of Facebook on the Web




Wednesday, April 3, 2013                                              22
(A Few) General Shapes Of Web Metadata Extracts

                           •   Link graphs

                           •   N-gram counts

                           •   File Indices by domain or keyword

                           •   Mashups with interesting datasets

                               •   Wikipedia

                               •   Freebase

                               •   Location databases (e.g. Open Street Maps)


              We should all create an extract!
Wednesday, April 3, 2013                                                        23
How do I create an extract?

                                       An easy Recipe:


                       •   Ingredients:

                           •   A Web crawl snapshot

                           •   A little bit of programming skill

                           •   Access to a cloud computing resources (e.g. EMR)

                       •   Directions:

                           •   http://commoncrawl.org/mapreduce-for-the-masses/



Wednesday, April 3, 2013                                                          24
What Happens Once
                           I’ve Made This Awesome Extract?

                             •   Share the extracted data

                             •   Share the code you created / modified

                                 •   https://github.com/commoncrawl/
                                     commoncrawl-examples/

                             •   Broadcast it to the world!




Wednesday, April 3, 2013                                                25
And The World Is Saved!




                                Thank you.




Wednesday, April 3, 2013                             26
Some Useful Links


   •       https://github.com/commoncrawl

   •       http://commoncrawl.org/mapreduce-for-the-masses/

   •       https://github.com/commoncrawl/commoncrawl-examples/

   •       https://aws.amazon.com/amis/common-crawl-quick-start

   •       https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set




Wednesday, April 3, 2013                                                            27

More Related Content

PDF
Research Data and Scholarly Communication (with notes)
PDF
Arc 08 12 10 final short
PDF
CAST May 2010
PDF
Manufacturing Serendipity
PDF
Research Data and Scholarly Communication
PDF
Education in the age of access
PPTX
Sjsul web2.011
PDF
Digital FDLP Louisiana GODORT 2012 slides+notes
Research Data and Scholarly Communication (with notes)
Arc 08 12 10 final short
CAST May 2010
Manufacturing Serendipity
Research Data and Scholarly Communication
Education in the age of access
Sjsul web2.011
Digital FDLP Louisiana GODORT 2012 slides+notes

What's hot (18)

PDF
Blind Spots and Broken Links: Access to Government Information
PDF
Gone today, here tomorrow: the future of government information and the digit...
PDF
20111114 b hyland government data and publishers
PDF
Is this BIG DATA which I see before me?
PPT
Let the trumpet sound 2003 version
PPTX
Metadata in a Crowd: Shared Knowledge Production
PPTX
HathiTrust--a GovDocs Repository?
PPTX
2014 digital ethography_eric meyer
PDF
A Cabinet Of Web2.0 Scientific Curiosities
PDF
Why Should I Care? New Technologies for Libraries & Librarians
PDF
Introduction for skills seminar on Search and Data Mining, Master of European...
PPTX
Designing Instructions using the Internet and other E-Resources
PPT
European librarians theatre - Social Media Spotlight
PDF
Why Should I Care? New Technologies for Libraries & Librarians
PDF
Lecture 3: Data Formats on the Social Web (2013)
PDF
Data Science with Humans in the Loop
PDF
Libraries & Open Source: Freedom and Community
PDF
TRETC 2011 - DRP Presentation
Blind Spots and Broken Links: Access to Government Information
Gone today, here tomorrow: the future of government information and the digit...
20111114 b hyland government data and publishers
Is this BIG DATA which I see before me?
Let the trumpet sound 2003 version
Metadata in a Crowd: Shared Knowledge Production
HathiTrust--a GovDocs Repository?
2014 digital ethography_eric meyer
A Cabinet Of Web2.0 Scientific Curiosities
Why Should I Care? New Technologies for Libraries & Librarians
Introduction for skills seminar on Search and Data Mining, Master of European...
Designing Instructions using the Internet and other E-Resources
European librarians theatre - Social Media Spotlight
Why Should I Care? New Technologies for Libraries & Librarians
Lecture 3: Data Formats on the Social Web (2013)
Data Science with Humans in the Loop
Libraries & Open Source: Freedom and Community
TRETC 2011 - DRP Presentation
Ad

Viewers also liked (16)

PDF
Real Estate Home Sales: Spring, Tx
PDF
People To Know In Human Resources
PDF
Real Estate Market Reports: The Woodlands, TX
PDF
People toknowinhumanresources
PDF
Герои труда Ставрополья (03.07.13)
PDF
Real Estate Report July 2013
PDF
Magnolia, Texas Real Estate Update August 2013
PPT
Open Data Bay Area (OBDA) | Chase Davis: Data Journalism
PDF
Real Estate Report: Homes Sales in Magnolia TX
PPTX
Suitemed presentation
PPTX
Haggle
PPTX
Genpak overview
PDF
Open Data Bay Area (OBDA) | John Wilbanks
PDF
Measuring the impact of Google Analytics
PDF
Главы регионов в сфере ЖКХ - апрель 2013
PDF
The Switchabalizer - our journey from spell checker to homophone corrrecter
Real Estate Home Sales: Spring, Tx
People To Know In Human Resources
Real Estate Market Reports: The Woodlands, TX
People toknowinhumanresources
Герои труда Ставрополья (03.07.13)
Real Estate Report July 2013
Magnolia, Texas Real Estate Update August 2013
Open Data Bay Area (OBDA) | Chase Davis: Data Journalism
Real Estate Report: Homes Sales in Magnolia TX
Suitemed presentation
Haggle
Genpak overview
Open Data Bay Area (OBDA) | John Wilbanks
Measuring the impact of Google Analytics
Главы регионов в сфере ЖКХ - апрель 2013
The Switchabalizer - our journey from spell checker to homophone corrrecter
Ad

Similar to Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons (20)

PPTX
Shared Data & Big Data for Libraries
PDF
Shared data and the future of libraries
PDF
Lecture 5: Social Web Data Analysis (2012)
PDF
Sharing Data on the Web
PDF
Presentation elag 2013
PPTX
2014 aus-agta
ZIP
Linked Open Data in Libraries, Archives & Museums
ZIP
Intro to Linked Open Data in Libraries, Archives & Museums
PDF
Data Infrastructure and the Scholarly Ecosystem of the Future
PDF
Data Herding for Scientists - UC Davis OA Week
ZIP
Intro to Linked Open Data in Libraries Archives & Museums.
PDF
APLIC 2012: Discovering & Dealing with Data
PPT
Research Data Management
PDF
Open Sesame (and other open movements)
PPTX
Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...
PDF
Big Data @ Bodensee Barcamp 2010
PPTX
Discover or no discover?That is the question
PDF
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
PPTX
Data 101: A Gentle Introduction
PDF
Practical Best Practices for Data Management
Shared Data & Big Data for Libraries
Shared data and the future of libraries
Lecture 5: Social Web Data Analysis (2012)
Sharing Data on the Web
Presentation elag 2013
2014 aus-agta
Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & Museums
Data Infrastructure and the Scholarly Ecosystem of the Future
Data Herding for Scientists - UC Davis OA Week
Intro to Linked Open Data in Libraries Archives & Museums.
APLIC 2012: Discovering & Dealing with Data
Research Data Management
Open Sesame (and other open movements)
Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...
Big Data @ Bodensee Barcamp 2010
Discover or no discover?That is the question
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Data 101: A Gentle Introduction
Practical Best Practices for Data Management

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
A Presentation on Artificial Intelligence
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Cloud computing and distributed systems.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation theory and applications.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
Agricultural_Statistics_at_a_Glance_2022_0.pdf
sap open course for s4hana steps from ECC to s4
The Rise and Fall of 3GPP – Time for a Sabbatical?
Assigned Numbers - 2025 - Bluetooth® Document
“AI and Expert System Decision Support & Business Intelligence Systems”
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
20250228 LYD VKU AI Blended-Learning.pptx
Spectroscopy.pptx food analysis technology
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
A Presentation on Artificial Intelligence
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
Cloud computing and distributed systems.
The AUB Centre for AI in Media Proposal.docx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

  • 1. A Public Metadata Commons: What is it? Why do we need it? How do we get it? Kurt Bollacker Open Data Bay Area 2012 Nov 27 Wednesday, April 3, 2013 1
  • 2. A long time ago, there was no “open” data. All of the media we used to create was physical. Wednesday, April 3, 2013 2
  • 3. Then most (all?) of the media became digital. Wednesday, April 3, 2013 3
  • 4. The Internet let us ship data around for (almost) free. Wednesday, April 3, 2013 4
  • 5. And we learned how to connect it all together. So naturally, we started to build a Global Digital Data Commons! Wednesday, April 3, 2013 5
  • 6. At first it was a “free for all” of academics and enthusiasts. Almost all data on the Web was considered to be “open”. Wednesday, April 3, 2013 6
  • 7. And then folks figured out how to make money from our contributions, so they started to “lock down” part of the Internet that previously would have been part of the commons. Wednesday, April 3, 2013 7
  • 8. Why is this bad? For the data archivist, centrally controlled data have far fewer (single?) points of failure. • Technical Failure • Legal Barriers • Incompetence Wednesday, April 3, 2013 8
  • 9. A (Potential) Digital Dark Age "Those who cannot remember the past are condemned to repeat it" --- George Santayana Wednesday, April 3, 2013 9
  • 10. How Do We Avoid This Lockdown Of Central Control, (And Hopefully A Digital Dark Age)? We Need A Practical Perspective On the Problem. Wednesday, April 3, 2013 10
  • 12. Data tends to survive if over the long term, it is: • Visible • Mobile • Well Loved These happen to also be the properties of data in a public commons. Wednesday, April 3, 2013 12
  • 13. Historical • Bible / Torah / Koran Examples: • U.S. Constitution • DNA? • Wikipedia Present Day • Open Street Maps Examples: • Freebase • MusicBrainz Why? • There are many copies. (mobile) • Their use is mostly unrestricted. (visible) • Everyone can access and contribute. (well loved) Wednesday, April 3, 2013 13
  • 14. But what about data that is still trapped by: • Technical Barriers? • Legal Restrictions? • Limited Resources? Wednesday, April 3, 2013 14
  • 15. We build a metadata commons to hold the “cultural context” of our trapped data. Wednesday, April 3, 2013 15
  • 16. How does a metadata commons work? Metadata Metadata Trapped Extraction Datasets Processes Metadata Metadata Metadata Even if the original contribution is lost or otherwise made unavailable, we still have its cultural context. Wednesday, April 3, 2013 16
  • 17. The cultural context in a metadata commons might contain: • Indices and Tags (to find and organize) • Comments (to analyze and interpret) • Technical metadata (e.g. provenance, format info) • Transforms and Interpretations (to make something useful) Wednesday, April 3, 2013 17
  • 18. Where is the trapped data that we care about? A lot of it is in The World Wide Web! But the Web is: • Very large (10TB - 100TB for accessible / deduped) • Very noisy (useless pages, partial duplicates) • Very diverse (in content, purpose, and target audience) How do we build a Metadata Commons from the Web? Wednesday, April 3, 2013 18
  • 19. A Practical Place To Start: Common Crawl (and cheap cloud computing resources) make the Web far cheaper and easier to access and manipulate. • Can be downloaded wholesale • Can be processed and analyzed in situ. • Parts can be publicly referenced Wednesday, April 3, 2013 19
  • 20. This foundation helps us scale up to “Web size”, but: • What is the useful “metadata of the Web”? • How to we extract that metadata? Wednesday, April 3, 2013 20
  • 21. Useful Web Extracts Are • Interesting to many people (to me!) • Can be used to answer relevant questions. • Can be used to build useful products and services. Almost everyone will have an itch to scratch. Wednesday, April 3, 2013 21
  • 22. Specific Examples Of Useful Web Extracts (From the Common Crawl code contest) • WikiEntities • Congressional sentiment • Reach of Facebook on the Web Wednesday, April 3, 2013 22
  • 23. (A Few) General Shapes Of Web Metadata Extracts • Link graphs • N-gram counts • File Indices by domain or keyword • Mashups with interesting datasets • Wikipedia • Freebase • Location databases (e.g. Open Street Maps) We should all create an extract! Wednesday, April 3, 2013 23
  • 24. How do I create an extract? An easy Recipe: • Ingredients: • A Web crawl snapshot • A little bit of programming skill • Access to a cloud computing resources (e.g. EMR) • Directions: • http://commoncrawl.org/mapreduce-for-the-masses/ Wednesday, April 3, 2013 24
  • 25. What Happens Once I’ve Made This Awesome Extract? • Share the extracted data • Share the code you created / modified • https://github.com/commoncrawl/ commoncrawl-examples/ • Broadcast it to the world! Wednesday, April 3, 2013 25
  • 26. And The World Is Saved! Thank you. Wednesday, April 3, 2013 26
  • 27. Some Useful Links • https://github.com/commoncrawl • http://commoncrawl.org/mapreduce-for-the-masses/ • https://github.com/commoncrawl/commoncrawl-examples/ • https://aws.amazon.com/amis/common-crawl-quick-start • https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set Wednesday, April 3, 2013 27