SlideShare a Scribd company logo
Digital Enterprise Research Institute                                         www.deri.ie




                              Wikipedia (DBpedia):
                           Crowdsourced Data Curation
                      Edward Curry, Andre Freitas, Seán O'Riain




 ed.curry@deri.org
 http://www.deri.org/
 http://www.EdwardCurry.org/
 Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
Speaker Profile
Digital Enterprise Research Institute                                                www.deri.ie



            Research Scientist at the Digital Enterprise Research
             Institute (DERI)
                   Leading international web science research organization
            Researching how web of data is changing way business
             work and interact with information
                   Projects include studies of enterprise linked data, community-
                    based data curation, semantic data analytics, and semantic
                    search
                   Investigate utilization within the pharmaceutical, oil &
                    gas, financial, advertising, media, manufacturing, health
                    care, ICT, and automotive industries
            Invited speaker at the 2010 MIT Sloan CIO Symposium
             to an audience of more than 600 CIOs
Overview
Digital Enterprise Research Institute                    www.deri.ie



            Curation Background
                   The Business Need for Curated Data
                   What is Data Curation?
                   Data Quality and Curation
                   How to Curate Data


            Wikipedia (DBpedia) Case Study

            Best Practices from Case Study Learning
The Business Need
Digital Enterprise Research Institute                              www.deri.ie



               Knowledge workers need:
                   Access              to the right information
                   Confidence              in that information


               Working incomplete
                inaccurate, or wrong
                information can have
                disastrous consequences
The Problems with Data
Digital Enterprise Research Institute                                           www.deri.ie



          Flawed Data
             Effects   25% of critical data in world‟s top companies
                 (Gartner)

          Data Quality
             Recent               banking crisis (Economist Dec‟09)
             Inaccurate   figures made it difficult to manage operations
                 (investments exposure and risk)
                    –   “asset are defined differently in different programs”
                    –   “numbers did not always add up”
                    –   “departments do not trust each other‟s figures”
                    –   “figures … not worth the pixels they were made of”
What is Data Curation?
Digital Enterprise Research Institute                                    www.deri.ie


        Digital Curation
            Selection,    preservation, maintenance, collection, and
                archiving of digital assets

        Data Curation
            Active             management of data over its life-cycle

        Data Curators
            Ensure     data is
                trustworthy, discoverable, accessible, reusable, and fit for
                use
                   – Museum cataloguers of the Internet age
What is Data Curation?
Digital Enterprise Research Institute                              www.deri.ie




            Data Governance
                Convergence    of data quality, data
                    management, business process management, and
                    risk management

            Data Curation is a complimentary activity
                Part   of overall data governance strategy for
                    organization

            Data Curator = Data Steward ??
                   Overlapping terms between communities
Data Quality and Curation
Digital Enterprise Research Institute                                               www.deri.ie



            What is Data Quality?
                Desirable              characteristics for information resource
                Described              as a series of quality dimensions
                       – Discoverability, Accessibility, Timeliness, Completeness, Inte
                         rpretation, Accuracy, Consistency, Provenance & Reputation

            Data curation can be used to improve these
             quality dimensions
Data Quality and Curation
Digital Enterprise Research Institute                                    www.deri.ie



            Discoverability & Accessibility
                Curate    to streamline search by storing and classifying
                    in appropriate and consistent manner

            Accuracy
                Curate     to ensure data correctly represents the “real-
                    world” values it models

            Consistency
                Curate      to ensure data created and maintained using
                    standardized definitions, calculations, terms, and
                    identifiers
Data Quality and Curation
Digital Enterprise Research Institute                                                www.deri.ie




            Provenance & Reputation
                Curate                 to track source of data and determine reputation
                Curate                 to include the objectivity of the source/producer
                       – Is the information unbiased, unprejudiced, and impartial?
                       – Or does it come from a reputable but partisan source?




                       Other dimensions discussed in chapter
How to Curate Data
Digital Enterprise Research Institute                               www.deri.ie




            Data Curation is a large field with sophisticated
             techniques and processes

            Section provides high-level overview on:
                Should                 you curate data?
                Types             of Curation
                Setting                up a curation process


               Additional detail and references available in book
               chapter
Should You Curate Data?
Digital Enterprise Research Institute                                              www.deri.ie




            Curation can have multiple motivations
                Improving                accessibility, quality, consistency,…

            Will the data benefit from curation?
                Identify               business case
                Determine                if potential return support investment

            Not all enterprise data should be curated
                Suits   knowledge-centric data rather than transactional
                    operations data
Types of Data Curation
Digital Enterprise Research Institute                        www.deri.ie



            Multiple approaches to curate data, no single
             correct way
                Who?
                       – Individual Curators
                       – Curation Departments
                       – Community-based Curation
                How?
                       – Manual Curation
                       – (Semi-)Automated
                       – Sheer Curation
Types of Data Curation – Who?
Digital Enterprise Research Institute                                                 www.deri.ie




            Individual Data Curators
                Suitable               for infrequently changing small quantity of
                    data
                       – (<1,000 records)
                       – Minimal curation effort (minutes per record)
Types of Data Curation – Who?
Digital Enterprise Research Institute                                             www.deri.ie


            Curation Departments
                Curation     experts working with subject matter experts
                    to curate data within formal process
                       – Can deal with large curation effort (000‟s of records)

            Limitations
                Scalability: Can struggle with large quantities of
                    dynamic data (>million records)
                Availability:  Post-hoc nature creates delay in curated
                    data availability
Types of Data Curation - Who?
Digital Enterprise Research Institute                                    www.deri.ie



            Community-Based Data Curation
                Decentralized               approach to data curation
                Crowd-sourcing                the curation process
                       – Leverages community of users to curate data
                Wisdom                 of the community (crowd)
                Can           scale to millions of records
Types of Data Curation – How?
Digital Enterprise Research Institute                                        www.deri.ie



            Manual Curation
                Curators               directly manipulate data
                Can           tie users up with low-value add activities

            (Sem-)Automated Curation
                Algorithms      can (semi-)automate curation activities
                    such as data cleansing, record duplication and
                    classification
                Can           be supervised or approved by human curators
Types of Data Curation – How?
Digital Enterprise Research Institute                                          www.deri.ie



            Sheer curation, or Curation at Source
                Curation    activities integrated in normal workflow of
                    those creating and managing data
                Can     be as simple as vetting or “rating” the results of a
                    curation algorithm
                Results                can be available immediately

            Blended Approaches: Best of Both
                Sheer             curation + post hoc curation department
                Allows             immediate access to curated data
                Ensures                quality control with expert curation
Setting up a Curation Process
Digital Enterprise Research Institute                                  www.deri.ie




            5 Steps to setup a curation process:
               1 - Identify what data you need to curate
               2 - Identify who will curate the data
               3 - Define the curation workflow
               4 - Identity appropriate data-in & data-out formats
               5 - Identify the artifacts, tools, and processes needed to
                   support the curation process
Wikipedia
Digital Enterprise Research Institute                             www.deri.ie




              The World Largest Open Digital Curation Community
Wikipedia
Digital Enterprise Research Institute                                         www.deri.ie



        Open-source encyclopedia
        Collaboratively built by large community
                Challenges             existing models of content creation
                More            than 19,000,000 articles
                270+            languages, 3,200,000+ articles in English
                More            than 157,000 active contributors
            Studies show accuracy and stylistic formality are
             equivalent to resources developed in expert-
             based closed communities
                i.e.       Columbia and Britannica encyclopedias
Wikipedia
Digital Enterprise Research Institute                                            www.deri.ie



       MediaWiki
           Wiki          platform behind Wikipedia
                  – Widespread and popular technology
           Wikis            can also support data curation
                  – Lowers entry barriers for collaborative data curation
       Widely used inside organizations
           Intellipedia                covering 16 U.S. Intelligence agencies
           Wiki    Proteins, curated Protein data for knowledge
               discovery and annotation
Wikipedia
Digital Enterprise Research Institute                                www.deri.ie




           Decentralized environment supports creation of
            high quality information with:
               Social            organization
               Artifacts,  tools & processes for cooperative work
                   coordination


           Wikipedia collaboration dynamics highlight good
            practices
Wikipedia – Social Organization
Digital Enterprise Research Institute                                             www.deri.ie


            Any user can edit its contents
                Without                prior registration

            Does not lead to a chaotic scenario
                In   practice highly scalable approach for high quality
                    content creation on the Web

            Relies on simple but highly effective way to
             coordinate its curation process
            Curation is activity of Wikipedia admins
                Responsibility               for information quality standards
Wikipedia – Social Organization
Digital Enterprise Research Institute                                             www.deri.ie




            Four main types of accounts:
                Anonymous              users
                       – Identified by their associated IP address
                Registered             users
                       – Users with an account in the Wikipedia website
                Administrators/Editors
                       – Registered users with additional permissions in the system
                       – Access to curation tools
                Bots
                       – Programs that perform repetitive tasks
Wikipedia – Social Organization
Digital Enterprise Research Institute    www.deri.ie
Wikipedia – Social Organization
Digital Enterprise Research Institute                                           www.deri.ie



            Incentives
                Improvement               of one‟s reputation
                Sense              of efficacy
                       – Contributing effectively to a meaningful project
                Over            time focus of editors typically change
                       – From curators of a few articles in specific topics
                       – To more global curation perspective
                       – Enforcing quality assessment of Wikipedia as a whole
Wikipedia – Artifacts, Tools &
       Processes
Digital Enterprise Research Institute                                                 www.deri.ie




            Wiki Article Editor (Tool)
                   WYSIWYG or markup text editor
            Talk Pages (Tool)
                   Public arena for discussions around Wikipedia resources
            Watchlists (Tool)
                   Helps curators to actively monitor the integrity and quality of
                    resources they contribute
            Permission Mechanisms (Tool)
                   Users with administrator status can perform critical actions such
                    as remove pages and grant administrative permissions to new
                    users
Wikipedia – Artifacts, Tools &
       Processes
Digital Enterprise Research Institute                                                www.deri.ie


          Automated Edition (Tool)
                Bots are automated or semi-automated tools that perform repetitive
                 tasks over content
          Page History and Restore (Tool)
                Historical trail of changes to a Wikipedia Resource
          Guidelines, Policies & Templates (Artifact)
                Defines curation guidelines for editors to assess article quality
          Dispute Resolution (Process)
                Dispute mechanism between editors over the article contents
          Article
           Edition, Deletion, Merging, Redirection, Transwiking, Archiv
           al (Process)
                Describe the curation actions over Wikipedia resources
Wikipedia - DBPedia
Digital Enterprise Research Institute                                              www.deri.ie


            DBPedia Knowledge base
                Inherits               massive volume of curated Wikipedia data
                Built         using information info box properties
                Indirectly              uses wiki as data curation platform

            DBPedia provides direct access to data
                3.4         million entities and 1 billion RDF triples
                Comprehensive                 data infrastructure
                       – Concept URIs, definitions, and basic types
Digital Enterprise Research Institute   www.deri.ie
Wikipedia - DBPedia
Digital Enterprise Research Institute   www.deri.ie
Overview
Digital Enterprise Research Institute                    www.deri.ie



            Curation Background
                   The Business Need for Curated Data
                   What is Data Curation?
                   Data Quality and Curation
                   How to Curate Data


            Wikipedia (DBpedia) Case Study

            Best Practices from Case Study Learning
Best Practices from Case Study
       Learning
Digital Enterprise Research Institute                           www.deri.ie


            Social Best Practices
                Participation
                Engagement
                Incentives
                Community                Governance Models

            Technical Best Practices
                Data           Representation
                Human-                 and AutomatedCuration
                Track            Provenance
Social Best Practices
Digital Enterprise Research Institute                                              www.deri.ie




            Participation
                Stakeholders  involvement for data producers and
                    consumers must occur early in project
                       – Provides insight into basic questions of what they want
                         to do, for whom, and what it will provide
                White     papers are effective means to present these
                    ideas, and solicit opinion from community
                       – Can be used to establish informal „social contract‟ for
                         community
Social Best Practices
Digital Enterprise Research Institute                                               www.deri.ie




            Engagement
                Outreach                 activities essential for promotion and
                    feedback
                Typical                consumers-to-contributors ratios of less than
                    5%
                Social            communication and networking forums are
                    useful
                       – Majority of community may not communicate using
                         these media
                       – Communication by email still remains important
Social Best Practices
Digital Enterprise Research Institute                                     www.deri.ie




            Incentives
                Sheer      curation needs line of sight from data curating
                    activity, to tangible exploitation benefits
                Lack   of awareness of value proposition will slow
                    emergence of collaborative contributions
                Recognizing   contributing curators through a formal
                    feedback mechanism
                       – Reinforces contribution culture
                       – Directly increases output quality
Social Best Practices
Digital Enterprise Research Institute                                         www.deri.ie




            Community Governance Models
                Effective  governance structure is vital to ensure
                    success of community
                Internal  communities and consortium perform well
                    when they leverage traditional corporate and
                    democratic governance models
                Open      communities need to engage the community
                    within the governance process
                       – Follow less orthodox approaches using meritocratic
                         and autocratic principles
Technical Best Practices
Digital Enterprise Research Institute                                    www.deri.ie

            Data Representation
                Must   be robust and standardized to encourage
                    community usage and tools development
                Support     for legacy data formats and ability to
                    translate data forward to support new technology and
                    standards
            Human & Automated Curation
                Balancing              will improve data quality
                Automated      curation should always defer to, and never
                    override, human curation edits
                       – Automate validating data deposition and entry
                       – Target community at focused curation tasks
Technical Best Practices
Digital Enterprise Research Institute                                         www.deri.ie



            Track Provenance
                All  curation activities should be recorded and
                    maintained as part data provenance effort
                       – Especially where human curators are involved
                Users             can have different perspectives of provenance
                       – A scientist may need to evaluate the fine grained
                         experiment description behind the data
                       – For a business analyst the ‟brand‟ of data provider can
                         be sufficient for determining quality
Conclusions
Digital Enterprise Research Institute                                               www.deri.ie




        Data curation can ensure the quality of data and
         its fitness for use
        Pre-competitive data can be shared without
         conferring a commercial advantage
        Pre-competitive data communities
                Common                 curation tasks carried out once in public
                    domain
                Reduces                cost, increase quantity and quality
Acknowledgements
Digital Enterprise Research Institute                                                      www.deri.ie


        Collaborators Andre Freitas & Seán O'Riain

        Insight from Thought Leaders
               Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product
                Development and Management), and Gregg Fenton (Director Emerging Platforms)
                from the New York Times
               Krista Thomas (Vice President, Marketing & Communications), Tom Tague
                (OpenCalais initiative Lead) from Thomson Reuters
               Antony Williams (VP of Strategic Development ) from ChemSpider
               Helen Berman (Director), John Westbrook (Product Development) from the Protein
                Data Bank
               Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.

        The work presented has been funded by Science
         Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-
         2).
Further Information
Digital Enterprise Research Institute                     www.deri.ie


The Role of Community-Driven
Data Curation for Enterprises
Edward Curry, Andre Freitas, & Seán O'Riain




  In David Wood (ed.),
  Linking Enterprise Data Springer, 2010.
  Available Free at:
  http://3roundstones.com/led_book/led-curry-et-al.html

More Related Content

PPTX
Data Curation at the New York Times
PDF
Challenges Ahead for Converging Financial Data
PDF
Approximate Semantic Matching of Heterogeneous Events
PDF
Developing an Sustainable IT Capability: Lessons From Intel's Journey
PPTX
An Environmental Chargeback for Data Center and Cloud Computing Consumers
PPTX
Building Optimisation using Scenario Modeling and Linked Data
PDF
Using Linked Data and the Internet of Things for Energy Management
PDF
Dealing with Semantic Heterogeneity in Real-Time Information
Data Curation at the New York Times
Challenges Ahead for Converging Financial Data
Approximate Semantic Matching of Heterogeneous Events
Developing an Sustainable IT Capability: Lessons From Intel's Journey
An Environmental Chargeback for Data Center and Cloud Computing Consumers
Building Optimisation using Scenario Modeling and Linked Data
Using Linked Data and the Internet of Things for Energy Management
Dealing with Semantic Heterogeneity in Real-Time Information

What's hot (20)

PPTX
The Role of Community-Driven Data Curation for Enterprises
PPT
Querying Heterogeneous Datasets on the Linked Data Web
PDF
Linked Building (Energy) Data
PDF
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
PDF
Crowdsourcing Approaches to Big Data Curation for Earth Sciences
PDF
System of Systems Information Interoperability using a Linked Dataspace
PPT
Big Data Public Private Forum (BIG) @ European Data Forum 2013
PDF
Citizen Actuation For Lightweight Energy Management
PDF
The Big Data Value PPP: A Standardisation Opportunity for Europe
PDF
Enterprise Energy Management using a Linked Dataspace for Energy Intelligence
PDF
Transforming the European Data Economy: A Strategic Research and Innovation A...
PDF
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
PDF
Key Technology Trends for Big Data in Europe
PDF
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
PDF
Linked Water Data For Water Information Management
PDF
Interactive Water Services: The Waternomics Approach
PDF
A Capability Maturity Framework for Sustainable ICT
PDF
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
PPTX
Crowdsourcing Approaches for Smart City Open Data Management
PDF
Big Data and Big Data Management (BDM) with current Technologies –Review
The Role of Community-Driven Data Curation for Enterprises
Querying Heterogeneous Datasets on the Linked Data Web
Linked Building (Energy) Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Crowdsourcing Approaches to Big Data Curation for Earth Sciences
System of Systems Information Interoperability using a Linked Dataspace
Big Data Public Private Forum (BIG) @ European Data Forum 2013
Citizen Actuation For Lightweight Energy Management
The Big Data Value PPP: A Standardisation Opportunity for Europe
Enterprise Energy Management using a Linked Dataspace for Energy Intelligence
Transforming the European Data Economy: A Strategic Research and Innovation A...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Key Technology Trends for Big Data in Europe
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
Linked Water Data For Water Information Management
Interactive Water Services: The Waternomics Approach
A Capability Maturity Framework for Sustainable ICT
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
Crowdsourcing Approaches for Smart City Open Data Management
Big Data and Big Data Management (BDM) with current Technologies –Review
Ad

Viewers also liked (8)

PDF
Influenciencia del mundo emocional en el aprendizaje
PDF
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
PDF
Designing Next Generation Smart City Initiatives: Harnessing Findings And Les...
PDF
Towards Unified and Native Enrichment in Event Processing Systems
PDF
Improving Policy Coherence and Accessibility through Semantic Web Technologie...
PDF
Towards a BIG Data Public Private Partnership
PDF
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
PDF
Open Data Innovation in Smart Cities: Challenges and Trends
Influenciencia del mundo emocional en el aprendizaje
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Designing Next Generation Smart City Initiatives: Harnessing Findings And Les...
Towards Unified and Native Enrichment in Event Processing Systems
Improving Policy Coherence and Accessibility through Semantic Web Technologie...
Towards a BIG Data Public Private Partnership
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
Open Data Innovation in Smart Cities: Challenges and Trends
Ad

Similar to Wikipedia (DBpedia): Crowdsourced Data Curation (20)

PPTX
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
PPTX
Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
PDF
Manfred Linking the Real World
PPT
Envisioning a discussion dashboard for collective intelligence of web convers...
PPTX
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
PDF
KMWorld Martin Briefing
PDF
Towards Patient Controlled Privacy
PPTX
Self-service Linked Government Data
PPTX
Data2030 Summit Data Megatrends Turner Sept 2022.pptx
PDF
Down to Business: Taking Action Quickly with Linked Data Services
ODP
Knowledge management on the desktop
PDF
Digital DNA for Organic Enterprises
PPTX
Microsoft Purview Data Governance L100 Pitch Deck.PPTX
PDF
Externalization Trend
PPTX
2018 10 igneous
PPTX
Introduction to Open Data
PPTX
Making sense out of disagreement, University of Limerick Interaction Design C...
PPTX
Towards Social semantic journalism
PPT
Linked Open Data
PDF
Keynote Theatre. Keynote Day 2. 16:30 Evelyn de Souza
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
Manfred Linking the Real World
Envisioning a discussion dashboard for collective intelligence of web convers...
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
KMWorld Martin Briefing
Towards Patient Controlled Privacy
Self-service Linked Government Data
Data2030 Summit Data Megatrends Turner Sept 2022.pptx
Down to Business: Taking Action Quickly with Linked Data Services
Knowledge management on the desktop
Digital DNA for Organic Enterprises
Microsoft Purview Data Governance L100 Pitch Deck.PPTX
Externalization Trend
2018 10 igneous
Introduction to Open Data
Making sense out of disagreement, University of Limerick Interaction Design C...
Towards Social semantic journalism
Linked Open Data
Keynote Theatre. Keynote Day 2. 16:30 Evelyn de Souza

Recently uploaded (20)

PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Encapsulation theory and applications.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
A Presentation on Artificial Intelligence
PDF
Getting Started with Data Integration: FME Form 101
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPTX
1. Introduction to Computer Programming.pptx
A comparative study of natural language inference in Swahili using monolingua...
Encapsulation theory and applications.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation_ Review paper, used for researhc scholars
A comparative analysis of optical character recognition models for extracting...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25-Week II
A Presentation on Artificial Intelligence
Getting Started with Data Integration: FME Form 101
Network Security Unit 5.pdf for BCA BBA.
Programs and apps: productivity, graphics, security and other tools
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Heart disease approach using modified random forest and particle swarm optimi...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TLE Review Electricity (Electricity).pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Spectroscopy.pptx food analysis technology
1. Introduction to Computer Programming.pptx

Wikipedia (DBpedia): Crowdsourced Data Curation

  • 1. Digital Enterprise Research Institute www.deri.ie Wikipedia (DBpedia): Crowdsourced Data Curation Edward Curry, Andre Freitas, Seán O'Riain ed.curry@deri.org http://www.deri.org/ http://www.EdwardCurry.org/ Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
  • 2. Speaker Profile Digital Enterprise Research Institute www.deri.ie  Research Scientist at the Digital Enterprise Research Institute (DERI)  Leading international web science research organization  Researching how web of data is changing way business work and interact with information  Projects include studies of enterprise linked data, community- based data curation, semantic data analytics, and semantic search  Investigate utilization within the pharmaceutical, oil & gas, financial, advertising, media, manufacturing, health care, ICT, and automotive industries  Invited speaker at the 2010 MIT Sloan CIO Symposium to an audience of more than 600 CIOs
  • 3. Overview Digital Enterprise Research Institute www.deri.ie  Curation Background  The Business Need for Curated Data  What is Data Curation?  Data Quality and Curation  How to Curate Data  Wikipedia (DBpedia) Case Study  Best Practices from Case Study Learning
  • 4. The Business Need Digital Enterprise Research Institute www.deri.ie  Knowledge workers need:  Access to the right information  Confidence in that information  Working incomplete inaccurate, or wrong information can have disastrous consequences
  • 5. The Problems with Data Digital Enterprise Research Institute www.deri.ie  Flawed Data  Effects 25% of critical data in world‟s top companies (Gartner)  Data Quality  Recent banking crisis (Economist Dec‟09)  Inaccurate figures made it difficult to manage operations (investments exposure and risk) – “asset are defined differently in different programs” – “numbers did not always add up” – “departments do not trust each other‟s figures” – “figures … not worth the pixels they were made of”
  • 6. What is Data Curation? Digital Enterprise Research Institute www.deri.ie  Digital Curation  Selection, preservation, maintenance, collection, and archiving of digital assets  Data Curation  Active management of data over its life-cycle  Data Curators  Ensure data is trustworthy, discoverable, accessible, reusable, and fit for use – Museum cataloguers of the Internet age
  • 7. What is Data Curation? Digital Enterprise Research Institute www.deri.ie  Data Governance  Convergence of data quality, data management, business process management, and risk management  Data Curation is a complimentary activity  Part of overall data governance strategy for organization  Data Curator = Data Steward ??  Overlapping terms between communities
  • 8. Data Quality and Curation Digital Enterprise Research Institute www.deri.ie  What is Data Quality?  Desirable characteristics for information resource  Described as a series of quality dimensions – Discoverability, Accessibility, Timeliness, Completeness, Inte rpretation, Accuracy, Consistency, Provenance & Reputation  Data curation can be used to improve these quality dimensions
  • 9. Data Quality and Curation Digital Enterprise Research Institute www.deri.ie  Discoverability & Accessibility  Curate to streamline search by storing and classifying in appropriate and consistent manner  Accuracy  Curate to ensure data correctly represents the “real- world” values it models  Consistency  Curate to ensure data created and maintained using standardized definitions, calculations, terms, and identifiers
  • 10. Data Quality and Curation Digital Enterprise Research Institute www.deri.ie  Provenance & Reputation  Curate to track source of data and determine reputation  Curate to include the objectivity of the source/producer – Is the information unbiased, unprejudiced, and impartial? – Or does it come from a reputable but partisan source? Other dimensions discussed in chapter
  • 11. How to Curate Data Digital Enterprise Research Institute www.deri.ie  Data Curation is a large field with sophisticated techniques and processes  Section provides high-level overview on:  Should you curate data?  Types of Curation  Setting up a curation process Additional detail and references available in book chapter
  • 12. Should You Curate Data? Digital Enterprise Research Institute www.deri.ie  Curation can have multiple motivations  Improving accessibility, quality, consistency,…  Will the data benefit from curation?  Identify business case  Determine if potential return support investment  Not all enterprise data should be curated  Suits knowledge-centric data rather than transactional operations data
  • 13. Types of Data Curation Digital Enterprise Research Institute www.deri.ie  Multiple approaches to curate data, no single correct way  Who? – Individual Curators – Curation Departments – Community-based Curation  How? – Manual Curation – (Semi-)Automated – Sheer Curation
  • 14. Types of Data Curation – Who? Digital Enterprise Research Institute www.deri.ie  Individual Data Curators  Suitable for infrequently changing small quantity of data – (<1,000 records) – Minimal curation effort (minutes per record)
  • 15. Types of Data Curation – Who? Digital Enterprise Research Institute www.deri.ie  Curation Departments  Curation experts working with subject matter experts to curate data within formal process – Can deal with large curation effort (000‟s of records)  Limitations  Scalability: Can struggle with large quantities of dynamic data (>million records)  Availability: Post-hoc nature creates delay in curated data availability
  • 16. Types of Data Curation - Who? Digital Enterprise Research Institute www.deri.ie  Community-Based Data Curation  Decentralized approach to data curation  Crowd-sourcing the curation process – Leverages community of users to curate data  Wisdom of the community (crowd)  Can scale to millions of records
  • 17. Types of Data Curation – How? Digital Enterprise Research Institute www.deri.ie  Manual Curation  Curators directly manipulate data  Can tie users up with low-value add activities  (Sem-)Automated Curation  Algorithms can (semi-)automate curation activities such as data cleansing, record duplication and classification  Can be supervised or approved by human curators
  • 18. Types of Data Curation – How? Digital Enterprise Research Institute www.deri.ie  Sheer curation, or Curation at Source  Curation activities integrated in normal workflow of those creating and managing data  Can be as simple as vetting or “rating” the results of a curation algorithm  Results can be available immediately  Blended Approaches: Best of Both  Sheer curation + post hoc curation department  Allows immediate access to curated data  Ensures quality control with expert curation
  • 19. Setting up a Curation Process Digital Enterprise Research Institute www.deri.ie  5 Steps to setup a curation process: 1 - Identify what data you need to curate 2 - Identify who will curate the data 3 - Define the curation workflow 4 - Identity appropriate data-in & data-out formats 5 - Identify the artifacts, tools, and processes needed to support the curation process
  • 20. Wikipedia Digital Enterprise Research Institute www.deri.ie The World Largest Open Digital Curation Community
  • 21. Wikipedia Digital Enterprise Research Institute www.deri.ie  Open-source encyclopedia  Collaboratively built by large community  Challenges existing models of content creation  More than 19,000,000 articles  270+ languages, 3,200,000+ articles in English  More than 157,000 active contributors  Studies show accuracy and stylistic formality are equivalent to resources developed in expert- based closed communities  i.e. Columbia and Britannica encyclopedias
  • 22. Wikipedia Digital Enterprise Research Institute www.deri.ie  MediaWiki  Wiki platform behind Wikipedia – Widespread and popular technology  Wikis can also support data curation – Lowers entry barriers for collaborative data curation  Widely used inside organizations  Intellipedia covering 16 U.S. Intelligence agencies  Wiki Proteins, curated Protein data for knowledge discovery and annotation
  • 23. Wikipedia Digital Enterprise Research Institute www.deri.ie  Decentralized environment supports creation of high quality information with:  Social organization  Artifacts, tools & processes for cooperative work coordination  Wikipedia collaboration dynamics highlight good practices
  • 24. Wikipedia – Social Organization Digital Enterprise Research Institute www.deri.ie  Any user can edit its contents  Without prior registration  Does not lead to a chaotic scenario  In practice highly scalable approach for high quality content creation on the Web  Relies on simple but highly effective way to coordinate its curation process  Curation is activity of Wikipedia admins  Responsibility for information quality standards
  • 25. Wikipedia – Social Organization Digital Enterprise Research Institute www.deri.ie  Four main types of accounts:  Anonymous users – Identified by their associated IP address  Registered users – Users with an account in the Wikipedia website  Administrators/Editors – Registered users with additional permissions in the system – Access to curation tools  Bots – Programs that perform repetitive tasks
  • 26. Wikipedia – Social Organization Digital Enterprise Research Institute www.deri.ie
  • 27. Wikipedia – Social Organization Digital Enterprise Research Institute www.deri.ie  Incentives  Improvement of one‟s reputation  Sense of efficacy – Contributing effectively to a meaningful project  Over time focus of editors typically change – From curators of a few articles in specific topics – To more global curation perspective – Enforcing quality assessment of Wikipedia as a whole
  • 28. Wikipedia – Artifacts, Tools & Processes Digital Enterprise Research Institute www.deri.ie  Wiki Article Editor (Tool)  WYSIWYG or markup text editor  Talk Pages (Tool)  Public arena for discussions around Wikipedia resources  Watchlists (Tool)  Helps curators to actively monitor the integrity and quality of resources they contribute  Permission Mechanisms (Tool)  Users with administrator status can perform critical actions such as remove pages and grant administrative permissions to new users
  • 29. Wikipedia – Artifacts, Tools & Processes Digital Enterprise Research Institute www.deri.ie  Automated Edition (Tool)  Bots are automated or semi-automated tools that perform repetitive tasks over content  Page History and Restore (Tool)  Historical trail of changes to a Wikipedia Resource  Guidelines, Policies & Templates (Artifact)  Defines curation guidelines for editors to assess article quality  Dispute Resolution (Process)  Dispute mechanism between editors over the article contents  Article Edition, Deletion, Merging, Redirection, Transwiking, Archiv al (Process)  Describe the curation actions over Wikipedia resources
  • 30. Wikipedia - DBPedia Digital Enterprise Research Institute www.deri.ie  DBPedia Knowledge base  Inherits massive volume of curated Wikipedia data  Built using information info box properties  Indirectly uses wiki as data curation platform  DBPedia provides direct access to data  3.4 million entities and 1 billion RDF triples  Comprehensive data infrastructure – Concept URIs, definitions, and basic types
  • 31. Digital Enterprise Research Institute www.deri.ie
  • 32. Wikipedia - DBPedia Digital Enterprise Research Institute www.deri.ie
  • 33. Overview Digital Enterprise Research Institute www.deri.ie  Curation Background  The Business Need for Curated Data  What is Data Curation?  Data Quality and Curation  How to Curate Data  Wikipedia (DBpedia) Case Study  Best Practices from Case Study Learning
  • 34. Best Practices from Case Study Learning Digital Enterprise Research Institute www.deri.ie  Social Best Practices  Participation  Engagement  Incentives  Community Governance Models  Technical Best Practices  Data Representation  Human- and AutomatedCuration  Track Provenance
  • 35. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Participation  Stakeholders involvement for data producers and consumers must occur early in project – Provides insight into basic questions of what they want to do, for whom, and what it will provide  White papers are effective means to present these ideas, and solicit opinion from community – Can be used to establish informal „social contract‟ for community
  • 36. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Engagement  Outreach activities essential for promotion and feedback  Typical consumers-to-contributors ratios of less than 5%  Social communication and networking forums are useful – Majority of community may not communicate using these media – Communication by email still remains important
  • 37. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Incentives  Sheer curation needs line of sight from data curating activity, to tangible exploitation benefits  Lack of awareness of value proposition will slow emergence of collaborative contributions  Recognizing contributing curators through a formal feedback mechanism – Reinforces contribution culture – Directly increases output quality
  • 38. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Community Governance Models  Effective governance structure is vital to ensure success of community  Internal communities and consortium perform well when they leverage traditional corporate and democratic governance models  Open communities need to engage the community within the governance process – Follow less orthodox approaches using meritocratic and autocratic principles
  • 39. Technical Best Practices Digital Enterprise Research Institute www.deri.ie  Data Representation  Must be robust and standardized to encourage community usage and tools development  Support for legacy data formats and ability to translate data forward to support new technology and standards  Human & Automated Curation  Balancing will improve data quality  Automated curation should always defer to, and never override, human curation edits – Automate validating data deposition and entry – Target community at focused curation tasks
  • 40. Technical Best Practices Digital Enterprise Research Institute www.deri.ie  Track Provenance  All curation activities should be recorded and maintained as part data provenance effort – Especially where human curators are involved  Users can have different perspectives of provenance – A scientist may need to evaluate the fine grained experiment description behind the data – For a business analyst the ‟brand‟ of data provider can be sufficient for determining quality
  • 41. Conclusions Digital Enterprise Research Institute www.deri.ie  Data curation can ensure the quality of data and its fitness for use  Pre-competitive data can be shared without conferring a commercial advantage  Pre-competitive data communities  Common curation tasks carried out once in public domain  Reduces cost, increase quantity and quality
  • 42. Acknowledgements Digital Enterprise Research Institute www.deri.ie  Collaborators Andre Freitas & Seán O'Riain  Insight from Thought Leaders  Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product Development and Management), and Gregg Fenton (Director Emerging Platforms) from the New York Times  Krista Thomas (Vice President, Marketing & Communications), Tom Tague (OpenCalais initiative Lead) from Thomson Reuters  Antony Williams (VP of Strategic Development ) from ChemSpider  Helen Berman (Director), John Westbrook (Product Development) from the Protein Data Bank  Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.  The work presented has been funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion- 2).
  • 43. Further Information Digital Enterprise Research Institute www.deri.ie The Role of Community-Driven Data Curation for Enterprises Edward Curry, Andre Freitas, & Seán O'Riain In David Wood (ed.), Linking Enterprise Data Springer, 2010. Available Free at: http://3roundstones.com/led_book/led-curry-et-al.html