Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup

Crowdsourcing
Approaches
to
Big
Data
Cura5on

Edward
Curry

Insight
Centre
for
Data
Analy5cs,

University
College
Dublin

Take
Home

Algorithms Humans
Better DataData

Talk
Overview

•  Part
I:
Mo4va4on

•  Part
II:
Data
Quality
And
Data
Cura4on

•  Part
III:
Crowdsourcing

•  Part
IV:
Case
Studies
on
Crowdsourced
Data

Cura4on

•  Part
V:
SeBng
up
a
Crowdsourced
Data
Cura4on

Process

•  Part
VI:
Linked
Open
Data
Example

•  Part
IIV:
Future
Research
Challenges

BIG
Big Data Public Private Forum
THE BIG PROJECT
Overall objective
Bringing the necessary stakeholders into a self-sustainable
industry-led initiative, which will greatly contribute to
enhance the EU competitiveness taking full advantage of Big
Data technologies.
Work at technical, business and policy levels, shaping the
future through the positioning of IIM and Big Data
specifically in Horizon 2020.
BIGBig Data Public Private Forum

BIG
SITUATING BIG DATA IN INDUSTRY
Health Public Sector
Finance &
Insurance
Telco, Media&
Entertainment
Manufacturing,
Retail, Energy,
Transport
Needs Offerings
Value Chain
Technical Working Groups
Industry Driven Sectorial Forums
Data
Acquisition
Data
Analysis
Data
Curation
Data
Storage
Data
Usage
•  Structured data
•  Unstructured data
•  Event processing
•  Sensor networks
•  Protocols
•  Real-time
•  Data streams
•  Multimodality
•  Stream mining
•  Semantic analysis
•  Machine learning
•  Information
extraction
•  Linked Data
•  Data discovery
•  ‘Whole world’
semantics
•  Ecosystems
•  Community data
analysis
•  Cross-sectorial data
analysis
•  Data Quality
•  Trust / Provenance
•  Annotation
•  Data validation
•  Human-Data
Interaction
•  Top-down/Bottom-up
•  Community / Crowd
•  Human Computation
•  Curation at scale
•  Incentivisation
•  Automation
•  Interoperability
•  In-Memory DBs
•  NoSQL DBs
•  NewSQL DBs
•  Cloud storage
•  Query Interfaces
•  Scalability and
Performance
•  Data Models
•  Consistency,
Availability, Partition-
tolerance
•  Security and Privacy
•  Standardization
•  Decision support
•  Prediction
•  In-use analytics
•  Simulation
•  Exploration
•  Visualisation
•  Modeling
•  Control
•  Domain-specific
usage

BIG
SUBJECT MATTER EXPERT INTERVIEWS

BIG
KEY INSIGHTS
Key Trends
▶  Lower usability barrier for data tools
▶  Blended human and algorithmic data processing for coping with
for data quality
▶  Leveraging large communities (crowds)
▶  Need for semantic standardized data representation
▶  Significant increase in use of new data models (i.e. graph)
(expressivity and flexibility)
▶  Much of (Big Data) technology
is evolving evolutionary
▶  But business processes change
must be revolutionary
▶  Data variety and verifiability
are key opportunities
▶  Long tail of data variety is a
major shift in the data landscape
The Data Landscape
▶  Lack of Business-driven Big
Data strategies
▶  Need for format and data
storage technology standards
▶  Data exchange between
companies, institutions,
individuals, etc.
▶  Regulations & markets for data
access
▶  Human resources: Lack of
skilled data scientists
Biggest Blockers
Technical White Papers available on:
http://www.big-project.eu

The Internet of Everything:
Connecting the Unconnected

Earth Science – Systems of Systems

Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup

Ci5zen
Sensors

“…humans
as
ci,zens
on
the
ubiquitous
Web,
ac,ng
as

sensors
and
sharing
their
observa,ons
and
views…”

¨  Sheth,
A.
(2009).
Ci4zen
sensing,
social
signals,
and
enriching
human

experience.
Internet
Compu,ng,
IEEE,
13(4),
87-‐92.

Air Pollution

DATA
QUALITY
AND
DATA
CURATION

PART
II

The Problems with Data
Knowledge Workers need:
¨  Access to the right data
¨  Confidence in that data
Flawed data effects 25%
of critical data in world’s
top companies
Data quality role in recent
financial crisis:
¨  “Asset are defined differently
in different programs”
¨  “Numbers did not always add
up”
¨  “Departments do not trust
each other’s figures”
¨  “Figures … not worth the
pixels they were made of”

What is Data Quality?
“Desirable characteristics for information
resource”
Described as a series of quality dimensions:
n  Discoverability & Accessibility: storing and classifying in
appropriate and consistent manner
n  Accuracy: Correctly represents the “real-world” values it models
n  Consistency: Created and maintained using standardized
definitions, calculations, terms, and identifiers
n  Provenance & Reputation: Track source & determine reputation
¨  Includes the objectivity of the source/producer
¨  Is the information unbiased, unprejudiced, and impartial?
¨  Or does it come from a reputable but partisan source?
Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers.
Journal of Management Information Systems, 1996. 12(4): p. 5-33.

Data Quality
ID PNAME PCOLOR PRICE
APNR iPod Nano Red 150
APNS iPod Nano Silver 160
<Product
name=“iPod
Nano”>

<Items>

<Item
code=“IPN890”>

<price>150</price>

<genera4on>5</genera4on>

</Item>

</Items>

</Product>

Source A
Source B
Schema Difference?
Data Developer
APNR

iPod
Nano

Red

150

APNR

iPod
Nano

Silver

160

iPod
Nano
IPN890

150

5

Value Conflicts?
Entity Duplication?
Data Steward
Business Users
?
Technical Domain
(Technical)
Domain

What is Data Curation?
n  Digital Curation
¨ Selection, preservation, maintenance, collection,
and archiving of digital assets
n  Data Curation
¨ Active management of data over its life-cycle
n  Data Curators
¨ Ensure data is trustworthy, discoverable, accessible,
reusable, and fit for use
– Museum cataloguers of the Internet age

Related Activities
n  Data Governance/ Master Data Management
¨ Convergence of data quality, data management,
business process management, and risk
management
¨ Part of overall data governance strategy for
organization
n  Data Curator = Data Steward

Types of Data Curation
n  Multiple approaches to curate data, no
single correct way
¨ Who?
– Individual Curators
– Curation Departments
– Community-based Curation
¨ How?
– Manual Curation
– (Semi-)Automated
– Sheer Curation

Types of Data Curation – Who?
n  Individual Data Curators
¨ Suitable for infrequently changing small quantity
of data
–  (<1,000 records)
–  Minimal curation effort (minutes per record)

Types of Data Curation – Who?
n  Curation Departments
¨ Curation experts working with subject matter
experts to curate data within formal process
–  Can deal with large curation effort (000’s of records)
n  Limitations
¨ Scalability: Can struggle with large quantities of
dynamic data (>million records)
¨ Availability: Post-hoc nature creates delay in
curated data availability

Types of Data Curation - Who?
n  Community-Based Data Curation
¨ Decentralized approach to data curation
¨ Crowd-sourcing the curation process
– Leverages community of users to curate data
¨ Wisdom of the community (crowd)
¨ Can scale to millions of records

Types of Data Curation – How?
n  Manual Curation
¨ Curators directly manipulate data
¨ Can tie users up with low-value add activities
n  (Sem-)Automated Curation
¨ Algorithms can (semi-)automate curation
activities such as data cleansing, record
duplication and classification
¨ Can be supervised or approved by human
curators

Types of Data Curation – How?
n  Sheer curation, or Curation at Source
¨ Curation activities integrated in normal workflow
of those creating and managing data
¨ Can be as simple as vetting or “rating” the
results of a curation algorithm
¨ Results can be available immediately
n  Blended Approaches: Best of Both
¨ Sheer curation + post hoc curation department
¨ Allows immediate access to curated data
¨ Ensures quality control with expert curation

Data Quailty
Data Curation Example
Profile
Sources
Define
Mappings
Cleans Enrich
De-duplicate
Define
Rules
Curated
Data
Data Developer
Data Curator
Data Governance
Business Users
Applications
Product DataProduct Data

Data Curation
n  Pros
¨  Can create a single version of truth
¨  Standardized information creation and management
¨  Improves data quality
n  Cons
¨  Significant upfront costs and efforts
¨  Participation limited to few (mostly) technical experts
¨  Difficult to scale for large data sources
–  Extended Enterprise e.g. partner, data vendors
¨  Small % of data under management (i.e. CRM, Product, …)

The New York Times
100 Years of Expert Data Curation

The New York Times
n  Largest metropolitan and third largest
newspaper in the United States
n  nytimes.com
q  Most popular newspaper
website in US
n  100 year old curated
repository defining its
participation in the
emerging Web of Data

The New York Times
n  Data curation dates back to 1913
¨ Publisher/owner Adolph S. Ochs decided to
provide a set of additions to the newspaper
n  New York Times Index
¨ Organized catalog of articles titles and summaries
–  Containing issue, date and column of article
–  Categorized by subject and names
–  Introduced on quarterly then annual basis
n  Transitory content of newspaper became
important source of searchable historical data
¨ Often used to settle historical debates

The New York Times
n  Index Department was created in 1913
¨ Curation and cataloguing of NYT resources
–  Since 1851 NYT had low quality index for internal use
n  Developed a comprehensive catalog using a
controlled vocabulary
¨ Covering subjects, personal names,
organizations, geographic locations and titles of
creative works (books, movies, etc), linked to
articles and their summaries
n  Current Index Dept. has ~15 people

The New York Times
n  Challenges with consistently and accurately
classifying news articles over time
¨ Keywords expressing subjects may show some
variance due to cultural or legal constraints
¨ Identities of some entities, such as organizations
and places, changed over time
n  Controlled vocabulary grew to hundreds of
thousands of categories
¨ Adding complexity to classification process

The New York Times
n  Increased importance of Web drove need to
improve categorization of online content
n  Curation carried out by Index Department
¨ Library-time (days to weeks)
¨ Print edition can handle next-day index
n  Not suitable for real-time online publishing
¨ nytimes.com needed a same-day index

The New York Times
n  Introduced two stage curation process
¨ Editorial staff performed best-effort semi-
automated sheer curation at point of online pub.
–  Several hundreds journalists
¨ Index Department follow up with long-term
accurate classification and archiving
n  Benefits:
¨ Non-expert journalist curators provide instant
accessibility to online users
¨ Index Department provides long-term high-
quality curation in a “trust but verify” approach

NYT Curation Workflow
¨ Curation starts with article getting out of the newsroom

¨ Member of editorial staff submits article to web-based
rule based information extraction system (SAS Teragram)

¨ Teragram uses linguistic extraction rules based on subset
of Index Dept’s controlled vocab.

¨ Teragram suggests tags based on the Index vocabulary
that can potentially describe the content of article

¨ Editorial staff member selects terms that best describe
the contents and inserts new tags if necessary

¨ Reviewed by the taxonomy managers with feedback to
editorial staff on classification process

¨ Article is published online at nytimes.com

¨ At later stage article receives second level curation by
Index Dept. additional Index tags and a summary

¨ Article is submitted to NYT Index

The New York Times
n  Early adopter of Linked Open Data (June ‘09)

The New York Times
n  Linked Open Data @ data.nytimes.com
¨ Subset of 10,000 tags from index vocabulary
¨ Dataset of people, organizations & locations
– Complemented by search services to consume
data about articles, movies, best sellers,
Congress votes, real estate,…
n  Benefits
¨ Improves traffic by third party data usage
¨ Lowers development cost of new applications
for different verticals inside the website
–  E.g. movies, travel, sports, books

CROWDSOURCING

PART
III

Introduction to Crowdsourcing
n  Coordinating a crowd (a large group of workers)to
do micro-work (small tasks) that solves problems
(that computers or a single user can’t)
n  A collection of mechanisms and associated
methodologies for scaling and directing
crowd activities to achieve goals
n  Related Areas
¨  Collective Intelligence
¨  Social Computing
¨  Human Computation
¨  Data Mining
A. J. Quinn and B. B. Bederson, “Human computation: a survey and taxonomy of a growing field,” in
Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems, 2011, pp.
1403–1412.

When Computers Were Human
n  Maskelyne 1760
¨ Used human computers
to created almanac of
moon positions
– Used for shipping/
navigation
¨ Quality assurance
– Do calculations twice
– Compare to third verifier
D. A. Grier, When Computers Were Human,
vol. 13. Princeton University Press, 2005.

Human
ü Visual perception
ü Visuospatial thinking
ü Audiolinguistic ability
ü Sociocultural
awareness
ü Creativity
ü Domain knowledge
Machine
ü Large-scale data
manipulation
ü Collecting and storing
large amounts of data
ü Efﬁcient data movement
ü Bias-free analysis
Human vs Machine Affordances
R. J. Crouser and R. Chang, “An affordance-based framework
for human computation and human-computer collaboration,”
IEEE Trans. Vis. Comput. Graph., vol. 18, pp. 2859–2868,
2012.

When to Crowdsource a Task?
n  Computers cannot do the task
n  Single person cannot do the task
n  Work can be split into smaller tasks

Types of Crowds
n  Internal corporate communities
¨ Taps potential of internal workforce
¨ Curate competitive enterprise data that will
remain internal to the company
– May not always be the case e.g. product technical
support and marketing data
n  External communities
¨  Public crowd-souring market places
¨  Pre-competitive communities

Generic Architecture
Workers
Platform/Marketplace
(Publish Task, Task Management)
Requestors
1.
2.
4.
3.

CASE
STUDIES
ON
CROWDSOURCED

DATA
CURATION

PART
IV

Crowdsourced Data Curation
59
DQ Rules &
Algorithms
Entity Linking
Data Fusion
Relation Extraction
Human
Computation
Relevance Judgment
Data Verification
Disambiguation
Clean Data
Internal Community
- Domain Knowledge
- High Quality Responses
- Trustable
Web of Data
Databases
Textual Content
Programmers Managers
External Crowd
- High Availability
- Large Scale
- Expertise Variety

Examples of CDM Tasks
n  Understanding customer sentiment for
launch of new product around the world.
n  Implemented 24/7 sentiment analysis
system with workers from around the
world.
n  90% accuracy in 95% on content
n  Categorize millions of products on eBay’s
catalog with accurate and complete
attributes
n  Combine the crowd with machine learning to
create an affordable and flexible catalog
quality system

Examples of CDM Tasks
n  Natural Language Processing
¨  Dialect Identification, Spelling Correction, Machine
Translation, Word Similarity
n  Computer Vision
¨  Image Similarity, Image Annotation/Analysis
n  Classification
¨  Data attributes, Improving taxonomy, search results
n  Verification
¨  Entity consolidation, de-duplicate, cross-check, validate
data
n  Enrichment
¨  Judgments, annotation

Wikipedia
n  Collaboratively built by large community
¨  More than 19,000,000 articles, 270+ languages,
3,200,000+ articles in English
¨  More than 157,000 active contributors
n  Accuracy and stylistic formality are
equivalent to expert-based resources
¨  i.e. Columbia and Britannica encyclopedias
n  WikiMeida
¨  Software behind Wikipedia
¨  Widely used inside organizations
¨  Intellipedia:16 U.S. Intelligence agencies
¨  Wiki Proteins: curated Protein data for
knowledge discovery

Wikipedia – Social Organization
n  Any user can edit its contents
¨ Without prior registration
n  Does not lead to a chaotic scenario
¨ In practice highly scalable approach for high
quality content creation on the Web
n  Relies on simple but highly effective way to
coordinate its curation process
n  Curation is activity of Wikipedia admins
¨ Responsibility for information quality standards

Wikipedia – Social Organization

DBPedia Knowledge base
n  DBPedia provides direct access to data
¨ Indirectly uses wiki as data curation platform
¨ Inherits massive volume of curated
Wikipedia data
¨ 3.4 million entities and 1 billion RDF triples
¨ Comprehensive data infrastructure
– Concept URIs
– Definitions
– Basic types

n  Collaborative knowledge
base maintained by
community of web users
n  Users create entity types
and their meta-data
according to guidelines
n  Requires administrative
approvals for schema
changes by end users

ReCaptcha
n  OCR
¨  ~ 1% error rate
¨  20%-30% for 18th and 19th
century books
n  40 million ReCAPTCHAs
every day” (2008)
¨  Fixing 40,000 books a day

SETTING
UP
A
CROWDSOURCED
DATA

CURATION
PROCESS

PART
V

Core Design Questions
Goal
What
Why IncentivesWhoWorkers
How
Process
Malone, T. W., Laubacher, R., & Dellarocas, C. N.
Harnessing crowds: Mapping the genome of collective intelligence. MIT Sloan Research Paper 4732-09, (2009).

1) Who is doing it? (Workers)
n  Hierarchy (Assignment)
¨ Someone in authority assigns a particular person
or group of people to perform the task
¨ Within the Enterprise (i.e. Individuals, specialised
departments)
¨ Within a structured community (i.e. pre-
competitive community)
n  Crowd (Choice)
¨ Anyone in a large group who choses to do so
¨ Internal or External Crowds

2) Why are they doing it? (Incentives)
n  Motivation
¨  Money ($$££)
¨  Glory (reputation/prestige)
¨  Love (altruism, socialize, enjoyment)
¨  Unintended by-product (e.g. re-Captcha, captured in workflow)
¨  Self-serving resources (e.g. Wikipedia, product/customer data)
¨  Part of their job description (e.g. Data curation as part of role)
n  Determine pay and time for each task
¨  Marketplace: Delicate balance
–  Money does not improve quality but can increase participation
¨  Internal Hierarchy: Engineering opportunities for recognition
–  Performance review, prizes for top contributors, badges,
leaderboards, etc.

Effect of Payment on Quality
n  Cost does not affect quality
n  Similar results for bigger tasks [Ariely et al, 2009]
Mason, W. A., & Watts, D. J. (2009). Financial incentives and the ‘‘performance of crowds.’’
Proceedings of the Human Computation Workshop. Paris: ACM, June 28, 2009.
[Panos Ipeirotis. WWW2011 tutorial]

3) What is being done? (Goal)
3.1 Identify the Data
¨ Newly created data and/or legacy data?
¨ How is new data created?
– Do users create the data, or is it imported from an
external source?
¨ How frequently is new data created/updated?
¨ What quantity of data is created?
¨ How much legacy data exists?
¨ Is it stored within a single source, or scattered
across multiple sources?

3) What is being done? (Goal)
3.2 Identify the Tasks
¨ Creation Tasks
– Create/Generate
– Find
– Improve/ Edit / Fix
¨ Decision (Vote) Tasks
– Accept / Reject
– Thumbs up / Thumbs Down
– Vote for Best

4) How is it being done? (How)
n  Identify the workflow
¨ Tasks integrated in normal workflow of those
creating and managing data
¨ Simple as vetting or “rating” results of algorithm
n  Identify the platform
¨  Internal/Community collaboration platforms
¨  Public crowdsourcing platform
–  Consider the availability of appropriate workers (i.e. experts)
n  Identify the Algorithm
¨  Data quality
¨  Image recognition
¨  etc

Pull Routing
n  Workers seek tasks and assign to themselves
¨  Search and Discovery of tasks support by platform
¨  Task Recommendation
¨  Peer Routing
Workers
Tasks Select
Result
Algorithm
Search & Browse Interface
Result

Push Routing
n  System assigns tasks to workers based on:
¨  Past performance
¨  Expertise
¨  Cost
¨  Latency
85
Workers
Tasks
Assign
Result
Assign
Algorithm
Task Interface
* www.mobileworks.com
Result

Managing Task Quality Assurance
n  Redundancy: Quorum Votes
¨  Replicate the task (i.e. 3 times)
¨  Use majority voting to determine right value (% agreement)
¨  Weighted majority vote
n  Gold Data / Honey Pots
¨  Inject trap question to test quality
¨  Worker fatigue check (habit of saying no all the time)
n  Estimation of Worker Quality
¨  Redundancy plus gold data
n  Qualification Test
¨  Use test tasks to determine users ability for such tasks

LINKED
OPEN
DATA
EXAMPLE

PART
VI

Linked Open Data (LOD)
n  Expose and interlink datasets on the Web
n  Using URIs to identify “things” in your data
n  Using a graph representation (RDF) to describe URIs
n  Vision: The Web as a huge graph database
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.n

Linked Data Example
Mul5ple

Iden5ﬁers

Iden5ty
resolu5on

links

Identity Resolution in LOD
<h`p://www.freebase.com/view/en/galway>

<h`p://dbpedia.org/resource/Galway>

<h`p://sws.geonames.org/2964180/>

owl:sameAs

Publisher

owl:sameAs
Consumer

Mul5ple
Iden5ﬁers
for
‘Galway’
en5ty
in
Linked
Open
Data
Cloud

Diﬀerent
sources
of
iden5ty
resolu5on
links

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

LOD Application Architecture
Utility

Module
Feedback

Module
Consolidation

Module
Questions
FeedbackRules
Matching
Dependencies
Ranked
Feedback Tasks
Data
Improvement
Candidate Links
Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition), 1-136. Morgan & Claypool.

FUTURE
RESEARCH
CHALLENGES

PART
IIV

Future Research Directions
n  Incentives and social engagement
¨  Better recognition of the data curation role
¨  Understanding of social engagement mechanisms
n  Economic Models
¨  Pre-competitive and public-private partnerships
n  Curation at Scale
¨  Evolution of human computation and crowdsourcing
¨  Instrumenting popular apps for data curation
¨  General-purpose data curation pipelines
¨  Human-data interaction

n  Spatial Crowdsourcing
¨  Matching tasks with workers at right time and location
¨  Balancing workload among workers
¨  Tasks at remote locations
¨  Chaining tasks in same vicinity
¨  Preserving worker privacy
n  Interoperability
¨  Finding semantic similarity of tasks across systems
¨  Defining and measuring worker capability across
heterogeneous systems
¨  Enabling routing middleware for multiple systems
¨  Compatibility of reputation systems
¨  Defining standards for task exchange

Heterogeneous Crowds
n  Multiple requesters, tasks, workers, platform
95
Collaborative
Data Curation
Tasks Workers
Cyber Physical
Social System
Platforms

SLUA Ontology
96
Reward
Action
Capability
User Task
offersearns
includesperforms
requirespossesses
Location Skill Knowledge Ability Availability
Reputation Money Fun Altruism Learning
subClassOf
subClassOf
U. ul Hassan, S. O’Riain, E. Curry, “SLUA: Towards Semantic Linking of Users with Actions
in Crowdsourcing,” in International Workshop on Crowdsourcing the Semantic Web, 2013.

n  Task Routing
¨  Optimizing task completion, quality, and latency
¨  Inferring worker preferences, skills, and knowledge
¨  Balancing exploration-exploitation trade-off between
inference and optimization
¨  Cold-start problem for new workers or tasks
¨  Ensuring worker satisfaction via load balancing & rewards
n  Human–Computer Interaction
¨  Reducing search friction through good browsing interfaces
¨  Presenting requisite information nothing more
¨  Choosing the level of task granularity for complex tasks
¨  Ensuring worker engagement
¨  Designing games with a purpose to crowd source with fun

Summary
Algorithms Humans
Better DataData

Selected References
n  Big Data & Data Quality
¨  S. Lavalle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the
Path from Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, pp. 21–32, 2011.
¨  A. Haug and J. S. Arlbjørn, “Barriers to master data quality,” Journal of Enterprise Information
Management, vol. 24, no. 3, pp. 288–303, 2011.
¨  R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo, “Managing one master data –
challenges and preconditions,” Industrial Management & Data Systems, vol. 111, no. 1, pp. 146–
162, 2011.
¨  E. Curry, S. Hasan, and S. O’Riain, “Enterprise Energy Management using a Linked Dataspace for
Energy Intelligence,” in Second IFIP Conference on Sustainable Internet and ICT for Sustainability,
2012.
¨  D. Loshin, Master Data Management. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,
2008.
¨  Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment.Communications of the
ACM, 45(4), 211-2
¨  Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality
assessment and improvement. ACM Computing Surveys (CSUR), 41(3), 16.
¨  B. Otto and A. Reichert, “Organizing Master Data Management: Findings from an Expert Survey,” in
Proceedings of the 2010 ACM Symposium on Applied Computing - SAC ’10, 2010, pp. 106–110.
¨  Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of
Management Information Systems, 1996. 12(4): p. 5-33
¨  Ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Leveraging Matching Dependencies for Guided User
Feedback in Linked Data Applications,” In 9th International Workshop on Information Integration
on the Web (IIWeb2012) Scottsdale, Arizona,: ACM.

Selected References
n  Collective Intelligence, Crowdsourcing & Human Computation
¨  Malone, Thomas W., Robert Laubacher, and Chrysanthos Dellarocas. "Harnessing Crowds: Mapping the
Genome of Collective Intelligence." (2009).
¨  A. Doan, R. Ramakrishnan, and A. Y. Halevy, “Crowdsourcing systems on the World-Wide Web,”
Communications of the ACM, vol. 54, no. 4, p. 86, Apr. 2011.
¨  A. J. Quinn and B. B. Bederson, “Human computation: a survey and taxonomy of a growing field,” in
Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems, 2011, pp. 1403–
1412.
¨  Mason, W. A., & Watts, D. J. (2009). Financial incentives and the ‘‘performance of crowds.’’ Proceedings of
the Human Computation Workshop. Paris: ACM, June 28, 2009.
¨  E. Law and L. von Ahn, “Human Computation,” Synthesis Lectures on Artificial Intelligence and Machine
Learning, vol. 5, no. 3, pp. 1–121, Jun. 2011.
¨  M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin, “CrowdDB : Answering Queries with
Crowdsourcing,” in Proceedings of the 2011 international conference on Management of data - SIGMOD
’11, 2011, p. 61.
¨  P. Wichmann, A. Borek, R. Kern, P. Woodall, A. K. Parlikad, and G. Satzger, “Exploring the ‘Crowd’ as
Enabler of Better Information Quality,” in Proceedings of the 16th International Conference on Information
Quality, 2011, pp. 302–312.
¨  Winter A. Mason, Duncan J. Watts: Financial incentives and the "performance of crowds". SIGKDD
Explorations (SIGKDD) 11(2):100-108 (2009)
¨  Panos Ipeirotis. Managing Crowdsourced Human Computation, WWW2011 Tutorial
¨  O. Alonso & M. Lease. Crowdsourcing 101: Putting the WSDM of Crowds to Work for You, WSDM Hong Kong
2011.
¨  D. A. Grier, When Computers Were Human, vol. 13. Princeton University Press, 2005.
–  http://www.youtube.com/watch?v=YwqltwvPnkw
¨  Ul Hassan, U., & Curry, E. (2013, October). A capability requirements approach for predicting worker
performance in crowdsourcing. In Collaborative Computing: Networking, Applications and Worksharing
(Collaboratecom), 2013 9th Internatinal Conference Conference on (pp. 429-437). IEEE.

Selected References
n  Collaborative Data Management
¨  E. Curry, A. Freitas, and S. O. Riain, “The Role of Community-Driven Data Curation for Enterprises,” in
Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25–47.
¨  Ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Towards Expertise Modelling for Routing Data Cleaning
Tasks within a Community of Knowledge Workers,” In 17th International Conference on Information Quality
(ICIQ 2012), Paris, France.
¨  Ul Hassan, U., O’Riain, S., and Curry, E. 2013. “Effects of Expertise Assessment on the Quality of Task
Routing in Human Computation,” In 2nd International Workshop on Social Media for Crowdsourcing and
Human Computation, Paris, France.
¨  Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., Zdonik, S. B., ... & Xu, S. (2013). Data
Curation at Scale: The Data Tamer System. In CIDR.
¨  Parameswaran, A. G., Park, H., Garcia-Molina, H., Polyzotis, N., & Widom, J. (2012, October). Deco:
declarative crowdsourcing. In Proceedings of the 21st ACM international conference on Information and
knowledge management (pp. 1203-1212). ACM.
¨  Parameswaran, A., Boyd, S., Garcia-Molina, H., Gupta, A., Polyzotis, N., & Widom, J. (2014). Optimal crowd-
powered rating and filtering algorithms.Proceedings Very Large Data Bases (VLDB).
¨  Marcus, A., Wu, E., Karger, D., Madden, S., & Miller, R. (2011). Human-powered sorts and joins. Proceedings
of the VLDB Endowment, 5(1), 13-24.
¨  Guo, S., Parameswaran, A., & Garcia-Molina, H. (2012, May). So who won?: dynamic max discovery with the
crowd. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (pp.
385-396). ACM.
¨  Davidson, S. B., Khanna, S., Milo, T., & Roy, S. (2013, March). Using the crowd for top-k and group-by
queries. In Proceedings of the 16th International Conference on Database Theory (pp. 225-236). ACM.
¨  Chai, X., Vuong, B. Q., Doan, A., & Naughton, J. F. (2009, June). Efficiently incorporating user feedback into
information extraction and integration programs. In Proceedings of the 2009 ACM SIGMOD International
Conference on Management of data (pp. 87-100). ACM.

Selected References
n  Spatial Crowdsourcing
¨  Kazemi, L., & Shahabi, C. (2012, November). Geocrowd: enabling query answering with spatial
crowdsourcing. In Proceedings of the 20th International Conference on Advances in Geographic
Information Systems (pp. 189-198). ACM.
¨  Benouaret, K., Valliyur-Ramalingam, R., & Charoy, F. (2013). CrowdSC: Building Smart Cities with
Large Scale Citizen Participation. IEEE Internet Computing, 1.
¨  Musthag, M., & Ganesan, D. (2013, April). Labor dynamics in a mobile micro-task market. In
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 641-650).
ACM.
¨  Deng, Dingxiong, Cyrus Shahabi, and Ugur Demiryurek. "Maximizing the number of worker's self-
selected tasks in spatial crowdsourcing." Proceedings of the 21st ACM SIGSPATIAL International
Conference on Advances in Geographic Information Systems. ACM, 2013.
¨  To, H., Ghinita, G., & Shahabi, C. (2014). A Framework for Protecting Worker Location Privacy in
Spatial Crowdsourcing. Proceedings of the VLDB Endowment, 7(10).
¨  Goncalves, J., Ferreira, D., Hosio, S., Liu, Y., Rogstadius, J., Kukka, H., & Kostakos, V. (2013,
September). Crowdsourcing on the spot: altruistic use of public displays, feasibility, performance,
and behaviours. In Proceedings of the 2013 ACM international joint conference on Pervasive and
ubiquitous computing(pp. 753-762). ACM.
¨  Cardone, G., Foschini, L., Bellavista, P., Corradi, A., Borcea, C., Talasila, M., & Curtmola, R. (2013).
Fostering participaction in smart cities: a geo-social crowdsensing platform. Communications
Magazine, IEEE, 51(6).

Books
n  Surowiecki, J. (2005). The wisdom of crowds. Random House LLC.
n  Batini, C., & Scannapieco, M. (2006). Data quality: concepts, methodologies
and techniques. Springer.
n  Michelucci, P. (2013). Handbook of human computation. Springer.
n  Law, E., & Ahn, L. V. (2011). Human computation. Synthesis Lectures on
Artificial Intelligence and Machine Learning, 5(3), 1-121.
n  Heath, T., & Bizer, C. (2011). Linked data: Evolving the web into a global data
space. Synthesis lectures on the semantic web: theory and technology, 1(1),
1-136.
n  Grier, D. A. (2013). When computers were human. Princeton University Press.
n  Easley, D., & Kleinberg, J. Networks, Crowds, and Markets. Cambridge
University.
n  Sheth, A., & Thirunarayan, K. (2012). Semantics Empowered Web 3.0:
Managing Enterprise, Social, Sensor, and Cloud-based Data and Services for
Advanced Applications. Synthesis Lectures on Data Management, 4(6), 1-175.

Tutorials
n  Human Computation and Crowdsourcing
¨  http://research.microsoft.com/apps/video/default.aspx?id=169834
¨  http://www.youtube.com/watch?v=tx082gDwGcM
n  Human-Powered Data Management
¨  http://research.microsoft.com/apps/video/default.aspx?id=185336
n  Crowdsourcing Applications and Platforms: A Data Management
Perspective
¨  http://www.vldb.org/pvldb/vol4/p1508-doan-tutorial4.pdf
n  Human Computation: Core Research Questions and State of the Art
¨  http://www.humancomputation.com/Tutorial.html
n  Crowdsourcing & Machine Learning
¨  http://www.cs.rutgers.edu/~hirsh/icml-2011-tutorial/
n  Data quality and data cleaning: an overview
¨  http://dl.acm.org/citation.cfm?id=872875

Datasets
n  TREC Crowdsourcing Track
¨  https://sites.google.com/site/treccrowd/
n  2010 Crowdsourced Web Relevance Judgments Data
¨  https://docs.google.com/document/d/
1J9H7UIqTGzTO3mArkOYaTaQPibqOTYb_LwpCpu2qFCU/edit
n  Statistical QUality Assurance Robustness Evaluation Data
¨  http://ir.ischool.utexas.edu/square/data.html
n  Crowdsourcing at Scale 2013
¨  http://www.crowdscale.org/
n  USEWOD - Usage Analysis and the Web of Data
¨  http://usewod.org/usewodorg-2.html
n  NAACL 2010 Workshop
¨  https://sites.google.com/site/amtworkshop2010/data-1
n  mturk-tracker.com
n  GalaxyZoo.com
n  CrowdCrafting.com

Credits
Special thanks to Umair ul Hassan for his assistance
with the Tutorial
EarthBiAs2014

Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup (20)

Recently uploaded (20)

Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup