SlideShare a Scribd company logo
Collaborative Data Management: How
Crowdsourcing Can Help To Manage Data
Edward Curry
Enterprise Data World 2013
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Problems with Data
¨ Master Data Management
n  Crowdsourcing
n  Collaborative Data Management
n  Setting up a CDM Process
n  Future Directions
Overview
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
The Problems with Data
Knowledge Workers need:
¨  Access to the right data
¨  Confidence in that data
Flawed data effects 25%
of critical data in world’s
top companies
Data quality role in recent
financial crisis:
¨  “Asset are defined differently
in different programs”
¨  “Numbers did not always add
up”
¨  “Departments do not trust
each other’s figures”
¨  “Figures … not worth the
pixels they were made of”
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Master Data Management is a process that
can improve data quality
n  What is Data Quality?
¨ Desirable characteristics for information
resource
¨ Described as a series of quality dimensions
–  Discoverability, Accessibility, Timeliness,
Completeness, Interpretation, Accuracy, Consistency,
Provenance & Reputation
Master Data Management
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Data Quailty
Master Data Management
Profile
Sources
Define
Mappings
Cleans Enrich
De-duplicate
Define
Rules
Master
Data
Data Developer
Data Steward
Data Governance
Business Users
Applications
Product DataProduct Data
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Data Quality
6	
  
ID PNAME PCOLOR PRICE
APNR iPod Nano Red 150
APNS iPod Nano Silver 160
<Product	
  name=“iPod	
  Nano”>	
  
	
  	
  	
  <Items>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <Item	
  code=“IPN890”>	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <price>150</price>	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <genera?on>5</genera?on>	
  
	
  	
  	
  	
  	
  	
  	
  	
  </Item>	
  
	
  	
  	
  	
  </Items>	
  
</Product>	
  
Source A
Source B
Schema Difference?
Data Developer
APNR	
  
iPod	
  Nano	
  
Red	
  
150	
  
APNR	
  
iPod	
  Nano	
  
Silver	
  
160	
  
iPod	
  Nano	
   IPN890	
  
150	
  
5	
  
Value Conflicts?
Entity Duplication?
Data Steward
Business Users
?
Technical Domain
(Technical)
Domain
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Pros
¨  Can create a single version of truth
¨  Standardized information creation and management
¨  Improves data quality
n  Cons
¨  Significant upfront costs and efforts
¨  Participation limited to few (mostly) technical experts
¨  Difficult to scale for large data sources
–  Extended Enterprise e.g. partner, data vendors
¨  Small % of data under management (i.e. CRM, Product, …)
Master Data Management
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Enterprise Data Landscape
The
Managed
8
Reference data managed
through well define policies
and governance council
Data directly
managed by
enterprise and
its departments
All data relevant to
enterprise and its
operationsThe
Reality
The
Known
MDM
Enterprise Data
Relevant External Data
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
CROWDSOURCING
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Crowdsourcing Industry
Landscape
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Coordinating a crowd (a large group of workers)to
do micro-work (small tasks) that solves problems
(that computers or a single user can’t)
n  A collection of mechanisms and associated
methodologies for scaling and directing crowd
activities to achieve goals
n  Related Areas
¨  Collective Intelligence
¨  Social Computing
¨  Human Computation
¨  Data Mining
Introduction to Crowdsourcing
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Maskelyne 1760
¨ Used human computers
to created almanac of
moon positions
– Used for shipping/
navigation
¨ Quality assurance
– Do calculations twice
– Compare to third verifier
When Computers Were Human
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
When Computers Were Human
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Human
ü Visual perception
ü Visuospatial thinking
ü Audiolinguistic ability
ü Sociocultural
awareness
ü Creativity
ü Domain knowledge
Machine
ü Large-scale data
manipulation
ü Collecting and storing
large amounts of data
ü Efficient data movement
ü Bias-free analysis
Human vs Machine Affordances
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Computers cannot do the task
n  Single person cannot do the task
n  Work can be split into smaller tasks
When to Crowdsource?
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Tag a Tune
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Peekaboom
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Foldit
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
ReCaptcha
n  OCR
¨  ~ 1% error rate
¨  20%-30% for 18th and
19th century books
n  40 million ReCAPTCHAs
every day” (2008)
¨  Fixing 40,000 books a
day
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Generic Architecture
Workers
Platform/Marketplace
(Publish Task, Task Management)
Requestors
1.
2.
4.
3.
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Amazon Mechanical Turk
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
CrowdFlower
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
COLLABORATIVE DATA
MANAGEMENT
•  Collabora?ve	
  knowledge	
  
base	
  maintained	
  by	
  
community	
  of	
  web	
  users	
  
•  Users	
  create	
  en?ty	
  types	
  
and	
  their	
  meta-­‐data	
  
according	
  to	
  guidelines	
  	
  
•  Requires	
  administra?ve	
  
approvals	
  for	
  schema	
  
changes	
  by	
  end	
  users	
  
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Collaboratively built by large community
¨  More than 19,000,000 articles, 270+ languages,
3,200,000+ articles in English
¨  More than 157,000 active contributors
n  Accuracy and stylistic formality are
equivalent to expert-based resources
¨  i.e. Columbia and Britannica encyclopedias
n  WikiMeida
¨  Software behind Wikipedia
¨  Widely used inside organizations
¨  Intellipedia:16 U.S. Intelligence agencies
¨  Wiki Proteins: curated Protein data for
knowledge discovery
Wikipedia
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  DBPedia provides direct access to data
¨ Indirectly uses wiki as data curation platform
¨ Inherits massive volume of curated
Wikipedia data
¨ 3.4 million entities and 1 billion RDF triples
¨ Comprehensive data infrastructure
– Concept URIs
– Definitions
– Basic types
DBPedia Knowledge base
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
A Bottom up Approach to MDM
Engage	
  More	
  Human	
  Workers	
  to	
  Collabora4vely	
  
Manage	
  Enterprise	
  Data	
  
31	
  of	
  50	
  
Collaborative Enterprise
Data Management
10s-100s 10,000s-100,000sNumber of Participants
Data Control
Top-down
Bottom-up
MDM
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Emerging Enterprise Data
Landscape
The
Managed
8
Reference data managed
through well define policies
and governance council
Data directly
managed by
enterprise and
its departments
All data relevant to
enterprise and its
operationsThe
Reality
The
Known
Enterprise Data
Relevant External Data
Collaboratively
Managed
MDM
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Clean Data
Algorithm + Crowd
Developers Data Governance
Internal Community
External Crowd
Data
Sources
Data Quality
Algorithms
Human
Computation
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Examples of CDM Tasks
n  Understanding customer sentiment for
launch of new product around the world.
n  Implemented 24/7 sentiment analysis
system with workers from around the
world.
n  Categorize millions of products on eBay’s
catalog with accurate and complete
attributes
n  Combine the crowd with machine learning to
create an affordable and flexible catalog
quality system
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Natural Language Processing
¨  Dialect Identification, Spelling Correction, Machine
Translation, Word Similarity
n  Computer Vision
¨  Image Similarity, Image Annotation/Analysis
n  Classification
¨  Data attributes, Improving taxonomy, search results
n  Verification
¨  Entity consolidation, de-duplicate, cross-check, validate
data
n  Enrichment
¨  Judgments, annotation
Examples of CDM Tasks
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
SETTING UP A CDM PROCESS
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Core Design Questions of CDM
Goal
What
Why IncentivesWhoWorkers
How
Process
Malone, T. W., Laubacher, R., & Dellarocas, C. N.
Harnessing crowds: Mapping the genome of collective intelligence. MIT Sloan Research Paper 4732-09, (2009).
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Hierarchy (Assignment)
¨ Someone in authority assigns a particular person
or group of people to perform the task
¨ Within the Enterprise
n  Crowd (Choice)
¨ Anyone in a large group who choses to do so
¨ Internal or External Crowds
Who is doing it? (Workers)
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Motivation
¨  Money ($$££)
¨  Glory (reputation/prestige)
¨  Love (altruism, socialize, enjoyment)
¨  Unintended by-product (e.g. re-Captcha, captured in workflow)
¨  Self-serving resources (e.g. Wikipedia, product/customer data)
n  Determine pay and time for each task
¨  Marketplace: Delicate balance
–  Money does not improve quality but can increase participation
¨  Internal Hierarchy: Engineering opportunities for recognition
–  Performance review, prizes for top contributors, badges,
leaderboards, etc.
Why are they doing it? (Incentives)
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Effect of Payment on Quality
n  Cost does not affect quality [Mason and Watts, 2009, AdSafe]
n  Similar results for bigger tasks [Ariely et al, 2009]
[Panos Ipeirotis. WWW2011 tutorial]
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Creation Tasks
¨ Create/Generate
¨ Find
¨ Improve/ Edit / Fix
n  Decision (Vote) Tasks
¨ Accept / Reject
¨ Thumbs up / Thumbs Down
¨ Vote for Best
What is being done? (Goal)
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Tasks integrated in normal workflow of
those creating and managing data
¨ Simple as vetting or “rating” results of algorithm
n  Task Design
¨ Task Interface
¨ Task Assignment/Routing
¨ Task Quality Assurance
How is it being done? (How)
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Task Design
43
* Edith Law and Luis von Ahn, Human Computation - Core Research Questions and State of the Art
Input Output
Task Router
before computation
Output Aggregation
after computation
Task Interface
during computation
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Pull Routing
n  Workers seek tasks and assign to themselves
¨  Search and Discovery of tasks support by platform
¨  Task Recommendation
¨  Peer Routing
Workers
Tasks Select
Result
Algorithm
Search & Browse Interface
Result
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Push Routing
n  System assigns tasks to workers based on:
¨  Past performance
¨  Expertise
¨  Cost
¨  Latency
45
Workers
Tasks
Assign
Result
Assign
Algorithm
Task Interface
* www.mobileworks.com
Result
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Redundancy: Quorum Votes
¨  Replicate the task (i.e. 3 times)
¨  Use majority voting to determine right value (% agreement)
¨  Weighted majority vote
n  Gold Data / Honey Pots
¨  Inject trap question to test quality
¨  Worker fatigue check (habit of saying no all the time)
n  Estimation of Worker Quality
¨  Redundancy plus gold data
n  Qualification Test
¨  Use test tasks to determine users ability for such tasks
Managing Task Quality Assurance
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Task Management
¨ Task assignment, payment, routing
–  Optimizing for Cost, Quality, Completion Time
n  Human–Computer Interaction
¨ Payment / incentives
¨ User interface and interaction design
¨ Worker reputation, recruitment, retention
n  Quality Control
¨ Trust, reliability, spam detection, consensus
Future Directions
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Collaborative Data Management
¨  Emerging trend for data management in the Enterprise.
¨  Crowdsourcing + Micro Tasks
¨  A number of emerging platform to assist
Summary
Data Quality
Algorithms
Human
Computation Clean DataDirty Data
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Edward is a research scientist at the Digital Enterprise Research
Institute. His areas of research include green IT/IS, energy informatics,
linked data, integrated reporting, and cloud computing.
He has worked extensively with industry and government advising on
the adoption patterns, practicalities and benefits of new technologies.
He has published in leading journals and books, and has spoken at
international conferences including the MIT CIO Symposium.
About the Presenter
URL: www.edwardcurry.org
Email: edcurry@acm.org
Twitter: @EdwardACurry
Slides: slideshare.net/edwardcurry
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Big Data & Data Quality
¨  S. Lavalle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data,
Analytics and the Path from Insights to Value,” MIT Sloan Management Review, vol.
52, no. 2, pp. 21–32, 2011.
¨  A. Haug and J. S. Arlbjørn, “Barriers to master data quality,” Journal of Enterprise
Information Management, vol. 24, no. 3, pp. 288–303, 2011.
¨  R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo, “Managing one
master data – challenges and preconditions,” Industrial Management & Data
Systems, vol. 111, no. 1, pp. 146–162, 2011.
¨  E. Curry, S. Hasan, and S. O’Riain, “Enterprise Energy Management using a Linked
Dataspace for Energy Intelligence,” in Second IFIP Conference on Sustainable
Internet and ICT for Sustainability, 2012.
¨  D. Loshin, Master Data Management. San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc., 2008.
¨  B. Otto and A. Reichert, “Organizing Master Data Management: Findings from an
Expert Survey,” in Proceedings of the 2010 ACM Symposium on Applied Computing
- SAC ’10, 2010, pp. 106–110.
Selected References
50
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Collective Intelligence, Crowdsourcing & Human Computation
¨  A. Doan, R. Ramakrishnan, and A. Y. Halevy, “Crowdsourcing systems on the World-
Wide Web,” Communications of the ACM, vol. 54, no. 4, p. 86, Apr. 2011.
¨  E. Law and L. von Ahn, “Human Computation,” Synthesis Lectures on Artificial
Intelligence and Machine Learning, vol. 5, no. 3, pp. 1–121, Jun. 2011.
¨  M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin, “CrowdDB : Answering
Queries with Crowdsourcing,” in Proceedings of the 2011 international conference
on Management of data - SIGMOD ’11, 2011, p. 61.
¨  P. Wichmann, A. Borek, R. Kern, P. Woodall, A. K. Parlikad, and G. Satzger,
“Exploring the ‘Crowd’ as Enabler of Better Information Quality,” in Proceedings of
the 16th International Conference on Information Quality, 2011, pp. 302–312.
¨  Winter A. Mason, Duncan J. Watts: Financial incentives and the "performance of
crowds". SIGKDD Explorations (SIGKDD) 11(2):100-108 (2009)
¨  Panos Ipeirotis. Managing Crowdsourced Human Computation, WWW2011 Tutorial
¨  O. Alonso & M. Lease. Crowdsourcing 101: Putting the WSDM of Crowds to Work for
You, WSDM Hong Kong 2011.
¨  When Computers Were Human: http://www.youtube.com/watch?v=YwqltwvPnkw
Selected References
51
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n  Collaborative Data Management
¨  E. Curry, A. Freitas, and S. O. Riain, “The Role of Community-Driven Data Curation
for Enterprises,” in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US,
2010, pp. 25–47.
¨  ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Towards Expertise Modelling for
Routing Data Cleaning Tasks within a Community of Knowledge Workers,” In 17th
International Conference on Information Quality (ICIQ 2012), Paris, France.
¨  ul Hassan, U., O’Riain, S., and Curry, E. 2013. “Effects of Expertise Assessment on
the Quality of Task Routing in Human Computation,” In 2nd International Workshop
on Social Media for Crowdsourcing and Human Computation, Paris, France.
¨  ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Leveraging Matching Dependencies
for Guided User Feedback in Linked Data Applications,” In 9th International
Workshop on Information Integration on the Web (IIWeb2012) Scottsdale, Arizona,:
ACM.
Selected References
52
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data

More Related Content

PDF
Big Data Analytics: A New Business Opportunity
PDF
Dealing with Semantic Heterogeneity in Real-Time Information
PPTX
The Role of Community-Driven Data Curation for Enterprises
PDF
Key Technology Trends for Big Data in Europe
PPTX
Wikipedia (DBpedia): Crowdsourced Data Curation
PDF
The Big Data Value PPP: A Standardisation Opportunity for Europe
PDF
Approximate Semantic Matching of Heterogeneous Events
PDF
Linked Building (Energy) Data
Big Data Analytics: A New Business Opportunity
Dealing with Semantic Heterogeneity in Real-Time Information
The Role of Community-Driven Data Curation for Enterprises
Key Technology Trends for Big Data in Europe
Wikipedia (DBpedia): Crowdsourced Data Curation
The Big Data Value PPP: A Standardisation Opportunity for Europe
Approximate Semantic Matching of Heterogeneous Events
Linked Building (Energy) Data

What's hot (20)

PDF
Towards a BIG Data Public Private Partnership
PPTX
Data Curation at the New York Times
PDF
Transforming the European Data Economy: A Strategic Research and Innovation A...
PDF
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
PDF
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
PPT
Querying Heterogeneous Datasets on the Linked Data Web
PDF
Interactive Water Services: The Waternomics Approach
PDF
Developing an Sustainable IT Capability: Lessons From Intel's Journey
PDF
Big Data: Beyond the hype, Delivering value
PDF
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
PPTX
An Environmental Chargeback for Data Center and Cloud Computing Consumers
PDF
Challenges Ahead for Converging Financial Data
PDF
Towards Unified and Native Enrichment in Event Processing Systems
PDF
Crowdsourcing Approaches to Big Data Curation for Earth Sciences
PPTX
Building Optimisation using Scenario Modeling and Linked Data
PDF
Citizen Actuation For Lightweight Energy Management
PPTX
Crowdsourcing Approaches for Smart City Open Data Management
PDF
Using Linked Data and the Internet of Things for Energy Management
PDF
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
PDF
Linked Water Data For Water Information Management
Towards a BIG Data Public Private Partnership
Data Curation at the New York Times
Transforming the European Data Economy: A Strategic Research and Innovation A...
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
Querying Heterogeneous Datasets on the Linked Data Web
Interactive Water Services: The Waternomics Approach
Developing an Sustainable IT Capability: Lessons From Intel's Journey
Big Data: Beyond the hype, Delivering value
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
An Environmental Chargeback for Data Center and Cloud Computing Consumers
Challenges Ahead for Converging Financial Data
Towards Unified and Native Enrichment in Event Processing Systems
Crowdsourcing Approaches to Big Data Curation for Earth Sciences
Building Optimisation using Scenario Modeling and Linked Data
Citizen Actuation For Lightweight Energy Management
Crowdsourcing Approaches for Smart City Open Data Management
Using Linked Data and the Internet of Things for Energy Management
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
Linked Water Data For Water Information Management
Ad

Viewers also liked (7)

PPT
Big Data Public Private Forum (BIG) @ European Data Forum 2013
PDF
Improving Policy Coherence and Accessibility through Semantic Web Technologie...
PDF
Designing Next Generation Smart City Initiatives: Harnessing Findings And Les...
PDF
Open Data Innovation in Smart Cities: Challenges and Trends
PDF
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
PDF
Enterprise Energy Management using a Linked Dataspace for Energy Intelligence
PDF
A Capability Maturity Framework for Sustainable ICT
Big Data Public Private Forum (BIG) @ European Data Forum 2013
Improving Policy Coherence and Accessibility through Semantic Web Technologie...
Designing Next Generation Smart City Initiatives: Harnessing Findings And Les...
Open Data Innovation in Smart Cities: Challenges and Trends
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
Enterprise Energy Management using a Linked Dataspace for Energy Intelligence
A Capability Maturity Framework for Sustainable ICT
Ad

Similar to Collaborative Data Management: How Crowdsourcing Can Help To Manage Data (20)

PPTX
Integrate Big Data into Your Organization with Informatica and Perficient
PDF
John Mancini's Predictions for Information Management in 2015
PDF
Building Resiliency and Agility with Data Virtualization for the New Normal
PDF
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
PDF
EPF-datagov-part1-1.pdf
PPTX
Riding and Capitalizing the Next Wave of Information Technology
PDF
Building the Cognitive Era : Big Data Strategies
PDF
CTO Radshow Hamburg17 - Keynote - The CxO responsibilities in Big Data and AI...
PDF
Your AI Transformation
PDF
Big Data Customer Experience Analytics -- The Next Big Opportunity for You
PPT
Delivering Value Through Business Analytics
PPTX
Tangenz big data
PDF
Accelerate Self-Service Analytics with Data Virtualization and Visualization
PPTX
ZIGRAM Introduction Deck June 2019
PPTX
Applying AI & Search in Europe - featuring 451 Research
PDF
Big data and the data quality imperative
PPTX
final oracle presentation
PDF
The Role of the Logical Data Fabric in a Unified Platform for Modern Analytics
PDF
The Role of Logical Data Fabric in a Unified Platform for Modern Analytics (A...
PDF
Taming Big Data With Modern Software Architecture
Integrate Big Data into Your Organization with Informatica and Perficient
John Mancini's Predictions for Information Management in 2015
Building Resiliency and Agility with Data Virtualization for the New Normal
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
EPF-datagov-part1-1.pdf
Riding and Capitalizing the Next Wave of Information Technology
Building the Cognitive Era : Big Data Strategies
CTO Radshow Hamburg17 - Keynote - The CxO responsibilities in Big Data and AI...
Your AI Transformation
Big Data Customer Experience Analytics -- The Next Big Opportunity for You
Delivering Value Through Business Analytics
Tangenz big data
Accelerate Self-Service Analytics with Data Virtualization and Visualization
ZIGRAM Introduction Deck June 2019
Applying AI & Search in Europe - featuring 451 Research
Big data and the data quality imperative
final oracle presentation
The Role of the Logical Data Fabric in a Unified Platform for Modern Analytics
The Role of Logical Data Fabric in a Unified Platform for Modern Analytics (A...
Taming Big Data With Modern Software Architecture

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
cuic standard and advanced reporting.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Electronic commerce courselecture one. Pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Cloud computing and distributed systems.
PPTX
Big Data Technologies - Introduction.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Teaching material agriculture food technology
Building Integrated photovoltaic BIPV_UPV.pdf
MYSQL Presentation for SQL database connectivity
cuic standard and advanced reporting.pdf
Empathic Computing: Creating Shared Understanding
Electronic commerce courselecture one. Pdf
sap open course for s4hana steps from ECC to s4
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectroscopy.pptx food analysis technology
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Mobile App Security Testing_ A Comprehensive Guide.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx
Spectral efficient network and resource selection model in 5G networks
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”

Collaborative Data Management: How Crowdsourcing Can Help To Manage Data

  • 1. Collaborative Data Management: How Crowdsourcing Can Help To Manage Data Edward Curry Enterprise Data World 2013
  • 2. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Problems with Data ¨ Master Data Management n  Crowdsourcing n  Collaborative Data Management n  Setting up a CDM Process n  Future Directions Overview
  • 3. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge The Problems with Data Knowledge Workers need: ¨  Access to the right data ¨  Confidence in that data Flawed data effects 25% of critical data in world’s top companies Data quality role in recent financial crisis: ¨  “Asset are defined differently in different programs” ¨  “Numbers did not always add up” ¨  “Departments do not trust each other’s figures” ¨  “Figures … not worth the pixels they were made of”
  • 4. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Master Data Management is a process that can improve data quality n  What is Data Quality? ¨ Desirable characteristics for information resource ¨ Described as a series of quality dimensions –  Discoverability, Accessibility, Timeliness, Completeness, Interpretation, Accuracy, Consistency, Provenance & Reputation Master Data Management
  • 5. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Data Quailty Master Data Management Profile Sources Define Mappings Cleans Enrich De-duplicate Define Rules Master Data Data Developer Data Steward Data Governance Business Users Applications Product DataProduct Data
  • 6. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Data Quality 6   ID PNAME PCOLOR PRICE APNR iPod Nano Red 150 APNS iPod Nano Silver 160 <Product  name=“iPod  Nano”>        <Items>                  <Item  code=“IPN890”>                              <price>150</price>                              <genera?on>5</genera?on>                  </Item>          </Items>   </Product>   Source A Source B Schema Difference? Data Developer APNR   iPod  Nano   Red   150   APNR   iPod  Nano   Silver   160   iPod  Nano   IPN890   150   5   Value Conflicts? Entity Duplication? Data Steward Business Users ? Technical Domain (Technical) Domain
  • 7. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Pros ¨  Can create a single version of truth ¨  Standardized information creation and management ¨  Improves data quality n  Cons ¨  Significant upfront costs and efforts ¨  Participation limited to few (mostly) technical experts ¨  Difficult to scale for large data sources –  Extended Enterprise e.g. partner, data vendors ¨  Small % of data under management (i.e. CRM, Product, …) Master Data Management
  • 8. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Enterprise Data Landscape The Managed 8 Reference data managed through well define policies and governance council Data directly managed by enterprise and its departments All data relevant to enterprise and its operationsThe Reality The Known MDM Enterprise Data Relevant External Data
  • 9. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge CROWDSOURCING
  • 10. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Crowdsourcing Industry Landscape
  • 11. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Coordinating a crowd (a large group of workers)to do micro-work (small tasks) that solves problems (that computers or a single user can’t) n  A collection of mechanisms and associated methodologies for scaling and directing crowd activities to achieve goals n  Related Areas ¨  Collective Intelligence ¨  Social Computing ¨  Human Computation ¨  Data Mining Introduction to Crowdsourcing
  • 12. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Maskelyne 1760 ¨ Used human computers to created almanac of moon positions – Used for shipping/ navigation ¨ Quality assurance – Do calculations twice – Compare to third verifier When Computers Were Human
  • 13. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge When Computers Were Human
  • 14. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Human ü Visual perception ü Visuospatial thinking ü Audiolinguistic ability ü Sociocultural awareness ü Creativity ü Domain knowledge Machine ü Large-scale data manipulation ü Collecting and storing large amounts of data ü Efficient data movement ü Bias-free analysis Human vs Machine Affordances
  • 15. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Computers cannot do the task n  Single person cannot do the task n  Work can be split into smaller tasks When to Crowdsource?
  • 16. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Tag a Tune
  • 17. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Peekaboom
  • 18. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Foldit
  • 19. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge ReCaptcha n  OCR ¨  ~ 1% error rate ¨  20%-30% for 18th and 19th century books n  40 million ReCAPTCHAs every day” (2008) ¨  Fixing 40,000 books a day
  • 20. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Generic Architecture Workers Platform/Marketplace (Publish Task, Task Management) Requestors 1. 2. 4. 3.
  • 21. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Amazon Mechanical Turk
  • 22. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge CrowdFlower
  • 23. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge COLLABORATIVE DATA MANAGEMENT
  • 24. •  Collabora?ve  knowledge   base  maintained  by   community  of  web  users   •  Users  create  en?ty  types   and  their  meta-­‐data   according  to  guidelines     •  Requires  administra?ve   approvals  for  schema   changes  by  end  users  
  • 28. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Collaboratively built by large community ¨  More than 19,000,000 articles, 270+ languages, 3,200,000+ articles in English ¨  More than 157,000 active contributors n  Accuracy and stylistic formality are equivalent to expert-based resources ¨  i.e. Columbia and Britannica encyclopedias n  WikiMeida ¨  Software behind Wikipedia ¨  Widely used inside organizations ¨  Intellipedia:16 U.S. Intelligence agencies ¨  Wiki Proteins: curated Protein data for knowledge discovery Wikipedia
  • 29. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  DBPedia provides direct access to data ¨ Indirectly uses wiki as data curation platform ¨ Inherits massive volume of curated Wikipedia data ¨ 3.4 million entities and 1 billion RDF triples ¨ Comprehensive data infrastructure – Concept URIs – Definitions – Basic types DBPedia Knowledge base
  • 31. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge A Bottom up Approach to MDM Engage  More  Human  Workers  to  Collabora4vely   Manage  Enterprise  Data   31  of  50   Collaborative Enterprise Data Management 10s-100s 10,000s-100,000sNumber of Participants Data Control Top-down Bottom-up MDM
  • 32. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Emerging Enterprise Data Landscape The Managed 8 Reference data managed through well define policies and governance council Data directly managed by enterprise and its departments All data relevant to enterprise and its operationsThe Reality The Known Enterprise Data Relevant External Data Collaboratively Managed MDM
  • 33. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Clean Data Algorithm + Crowd Developers Data Governance Internal Community External Crowd Data Sources Data Quality Algorithms Human Computation
  • 34. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Examples of CDM Tasks n  Understanding customer sentiment for launch of new product around the world. n  Implemented 24/7 sentiment analysis system with workers from around the world. n  Categorize millions of products on eBay’s catalog with accurate and complete attributes n  Combine the crowd with machine learning to create an affordable and flexible catalog quality system
  • 35. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Natural Language Processing ¨  Dialect Identification, Spelling Correction, Machine Translation, Word Similarity n  Computer Vision ¨  Image Similarity, Image Annotation/Analysis n  Classification ¨  Data attributes, Improving taxonomy, search results n  Verification ¨  Entity consolidation, de-duplicate, cross-check, validate data n  Enrichment ¨  Judgments, annotation Examples of CDM Tasks
  • 36. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge SETTING UP A CDM PROCESS
  • 37. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Core Design Questions of CDM Goal What Why IncentivesWhoWorkers How Process Malone, T. W., Laubacher, R., & Dellarocas, C. N. Harnessing crowds: Mapping the genome of collective intelligence. MIT Sloan Research Paper 4732-09, (2009).
  • 38. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Hierarchy (Assignment) ¨ Someone in authority assigns a particular person or group of people to perform the task ¨ Within the Enterprise n  Crowd (Choice) ¨ Anyone in a large group who choses to do so ¨ Internal or External Crowds Who is doing it? (Workers)
  • 39. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Motivation ¨  Money ($$££) ¨  Glory (reputation/prestige) ¨  Love (altruism, socialize, enjoyment) ¨  Unintended by-product (e.g. re-Captcha, captured in workflow) ¨  Self-serving resources (e.g. Wikipedia, product/customer data) n  Determine pay and time for each task ¨  Marketplace: Delicate balance –  Money does not improve quality but can increase participation ¨  Internal Hierarchy: Engineering opportunities for recognition –  Performance review, prizes for top contributors, badges, leaderboards, etc. Why are they doing it? (Incentives)
  • 40. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Effect of Payment on Quality n  Cost does not affect quality [Mason and Watts, 2009, AdSafe] n  Similar results for bigger tasks [Ariely et al, 2009] [Panos Ipeirotis. WWW2011 tutorial]
  • 41. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Creation Tasks ¨ Create/Generate ¨ Find ¨ Improve/ Edit / Fix n  Decision (Vote) Tasks ¨ Accept / Reject ¨ Thumbs up / Thumbs Down ¨ Vote for Best What is being done? (Goal)
  • 42. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Tasks integrated in normal workflow of those creating and managing data ¨ Simple as vetting or “rating” results of algorithm n  Task Design ¨ Task Interface ¨ Task Assignment/Routing ¨ Task Quality Assurance How is it being done? (How)
  • 43. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Task Design 43 * Edith Law and Luis von Ahn, Human Computation - Core Research Questions and State of the Art Input Output Task Router before computation Output Aggregation after computation Task Interface during computation
  • 44. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Pull Routing n  Workers seek tasks and assign to themselves ¨  Search and Discovery of tasks support by platform ¨  Task Recommendation ¨  Peer Routing Workers Tasks Select Result Algorithm Search & Browse Interface Result
  • 45. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Push Routing n  System assigns tasks to workers based on: ¨  Past performance ¨  Expertise ¨  Cost ¨  Latency 45 Workers Tasks Assign Result Assign Algorithm Task Interface * www.mobileworks.com Result
  • 46. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Redundancy: Quorum Votes ¨  Replicate the task (i.e. 3 times) ¨  Use majority voting to determine right value (% agreement) ¨  Weighted majority vote n  Gold Data / Honey Pots ¨  Inject trap question to test quality ¨  Worker fatigue check (habit of saying no all the time) n  Estimation of Worker Quality ¨  Redundancy plus gold data n  Qualification Test ¨  Use test tasks to determine users ability for such tasks Managing Task Quality Assurance
  • 47. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Task Management ¨ Task assignment, payment, routing –  Optimizing for Cost, Quality, Completion Time n  Human–Computer Interaction ¨ Payment / incentives ¨ User interface and interaction design ¨ Worker reputation, recruitment, retention n  Quality Control ¨ Trust, reliability, spam detection, consensus Future Directions
  • 48. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Collaborative Data Management ¨  Emerging trend for data management in the Enterprise. ¨  Crowdsourcing + Micro Tasks ¨  A number of emerging platform to assist Summary Data Quality Algorithms Human Computation Clean DataDirty Data
  • 49. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Edward is a research scientist at the Digital Enterprise Research Institute. His areas of research include green IT/IS, energy informatics, linked data, integrated reporting, and cloud computing. He has worked extensively with industry and government advising on the adoption patterns, practicalities and benefits of new technologies. He has published in leading journals and books, and has spoken at international conferences including the MIT CIO Symposium. About the Presenter URL: www.edwardcurry.org Email: edcurry@acm.org Twitter: @EdwardACurry Slides: slideshare.net/edwardcurry
  • 50. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Big Data & Data Quality ¨  S. Lavalle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path from Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, pp. 21–32, 2011. ¨  A. Haug and J. S. Arlbjørn, “Barriers to master data quality,” Journal of Enterprise Information Management, vol. 24, no. 3, pp. 288–303, 2011. ¨  R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo, “Managing one master data – challenges and preconditions,” Industrial Management & Data Systems, vol. 111, no. 1, pp. 146–162, 2011. ¨  E. Curry, S. Hasan, and S. O’Riain, “Enterprise Energy Management using a Linked Dataspace for Energy Intelligence,” in Second IFIP Conference on Sustainable Internet and ICT for Sustainability, 2012. ¨  D. Loshin, Master Data Management. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2008. ¨  B. Otto and A. Reichert, “Organizing Master Data Management: Findings from an Expert Survey,” in Proceedings of the 2010 ACM Symposium on Applied Computing - SAC ’10, 2010, pp. 106–110. Selected References 50
  • 51. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Collective Intelligence, Crowdsourcing & Human Computation ¨  A. Doan, R. Ramakrishnan, and A. Y. Halevy, “Crowdsourcing systems on the World- Wide Web,” Communications of the ACM, vol. 54, no. 4, p. 86, Apr. 2011. ¨  E. Law and L. von Ahn, “Human Computation,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 5, no. 3, pp. 1–121, Jun. 2011. ¨  M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin, “CrowdDB : Answering Queries with Crowdsourcing,” in Proceedings of the 2011 international conference on Management of data - SIGMOD ’11, 2011, p. 61. ¨  P. Wichmann, A. Borek, R. Kern, P. Woodall, A. K. Parlikad, and G. Satzger, “Exploring the ‘Crowd’ as Enabler of Better Information Quality,” in Proceedings of the 16th International Conference on Information Quality, 2011, pp. 302–312. ¨  Winter A. Mason, Duncan J. Watts: Financial incentives and the "performance of crowds". SIGKDD Explorations (SIGKDD) 11(2):100-108 (2009) ¨  Panos Ipeirotis. Managing Crowdsourced Human Computation, WWW2011 Tutorial ¨  O. Alonso & M. Lease. Crowdsourcing 101: Putting the WSDM of Crowds to Work for You, WSDM Hong Kong 2011. ¨  When Computers Were Human: http://www.youtube.com/watch?v=YwqltwvPnkw Selected References 51
  • 52. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Collaborative Data Management ¨  E. Curry, A. Freitas, and S. O. Riain, “The Role of Community-Driven Data Curation for Enterprises,” in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25–47. ¨  ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Towards Expertise Modelling for Routing Data Cleaning Tasks within a Community of Knowledge Workers,” In 17th International Conference on Information Quality (ICIQ 2012), Paris, France. ¨  ul Hassan, U., O’Riain, S., and Curry, E. 2013. “Effects of Expertise Assessment on the Quality of Task Routing in Human Computation,” In 2nd International Workshop on Social Media for Crowdsourcing and Human Computation, Paris, France. ¨  ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Leveraging Matching Dependencies for Guided User Feedback in Linked Data Applications,” In 9th International Workshop on Information Integration on the Web (IIWeb2012) Scottsdale, Arizona,: ACM. Selected References 52