SlideShare a Scribd company logo
Integrating Crowd & Cloud
  Resources for Big Data
                Michael Franklin

      Middleware 2012, Montreal
                December 6 2012
                               Expeditions
  UC BERKELEY
                               in Computing
CROWDSOURCING
WHAT IS IT?
Citizen Science




NASA “Clickworkers” 2000
Citizen Journalism/Participatory Sensing




4
Communities & Expertise
Data Collection & Curation
e.g., Freebase
An Academic View




From Quinn & Bederson, “Human Computation: A Survey
and Taxonomy of a Growing Field”, CHI 2011.
The Way Industry Looks At It
 How Industry Looks At It
Useful Taxonomies
• Doan, Halevy, Ramakrishnan; (Crowdsourcing)
  CACM 4/11
  –   nature of collaboration (implicit vs. explicit)
  –   architecture (standalone vs. piggybacked)
  –   must recruit users/workers? (yes or no)
  –   What do users/workers do?
• Bederson & Quinn; (Human Computation) CHI ‟11
  –   Motivation (Pay, Altruism, Enjoyment, Reputation)
  –   Quality Control (many mechanisms)
  –   Aggregation (how are results combined?)
  –   Human Skill (Visual recognition, language, …)
  –   …
Types of Tasks

Task Granularity               Examples
Complex Tasks                  • Build a website
                               • Develop a software system
                               • Overthrow a government?
Simple Projects                • Design a logo and visual identity
                               • Write a term paper
Macro Tasks                    • Write a restaurant review
                               • Test a new website feature
                               • Identify a galaxy
Micro Tasks                    • Label an image
                               • Verify an address
                               • Simple entity resolution


 Inspired by the report: “Paid Crowdsourcing”,
 Smartsheet.com, 9/15/2009
MICRO-TASK MARKETPLACES
Amazon Mechanical Turk (AMT)
Microtasking – Virutalized Humans
• Current leader: Amazon Mechanical Turk
• Requestors place Human Intelligence Tasks
  (HITs)
      –   set price per “assignment” (usually cents)
      –   specify #of replicas (assignments), expiration, …
      –   User Interface (for workers)
      –   API-based: “createHit()”, “getAssignments()”,
          “approveAssignments()”, “forceExpire()”
• Requestors approve jobs and payment
• Workers (a.k.a. “turkers”) choose jobs, do them,
  get paid
 13
AMT Worker Interface
Middeware2012 crowd
Middeware2012 crowd
Microtask Aggregators
Crowdsourcing for Data Management
• Relational                        • Beyond relational
      –   data cleaning               –   graph search
      –   data entry                  –   classification
      –   information extraction      –   transcription
      –   schema matching             –   mobile image search
      –   entity resolution           –   social media analysis
      –   data spaces                 –   question answering
      –   building structured KBs     –   NLP
      –   sorting                     –   text summarization
      –   top-k                       –   sentiment analysis
      –   ...                         –   semantic wikis
                                      –   ...
 18
TOWARDS HYBRID
CROWD/CLOUD COMPUTING
Not Exactly Crowdsourcing, but…




“The hope is that, in not too many years, human brains
and computing machines will be coupled together very
tightly, and that the resulting partnership will think as no
human brain has ever thought and process data in a way
not approached by the information-handling machines
we know today.”
AMP: Integrating Diverse Resources

                           Algorithms:
                       Machine Learning and
                            Analytics




                                                  People:
        Machines:
                                              CrowdSourcing &
     Cloud Computing
                                          Human Computation

21
The Berkeley AMPLab
• Goal: Data analytics stack integrating A, M & P
  • BDAS: Released as BSD/Apache Open Source
• 6 year duration: 2011-2017
• 8 CS Faculty
  • Directors: Franklin(DB), Jordan (ML), Stoica (Sys)
• Industrial Support & Collaboration:




• NSF Expedition and Darpa XData
  22
People in AMP
• Long term Goal: Make people
  an integrated part of the system!
  • Leverage human activity
                                            Machines +
  • Leverage human intelligence
                                            Algorithms
• Current AMP People Projects
  – Carat: Collaborative Energy




                                                 Questions
                                      activity




                                                             Answers
    Debugging




                                      data,
  – CrowdDB: “The World‟s Dumbest
    Database System”
  – CrowdER: Hybrid computation for
    Entity Resolution
  – CrowdQ: Hybrid Unstructured
    Query Answering
  23
Carat: Leveraging Human Activity



                      ~500,000
                      downloads
                      to date
                   A. J. Oliner, et al. Collaborative
                   Energy Debugging for Mobile
                   Devices. Workshop on Hot
                   Topics in System Dependability
                   (HotDep), 2012.

24
Carat: How it works




     Collaborative Detection of Energy Bugs

25
Leveraging Human Intelligence
First Attempt:                     CrowdSQL                                         Results


CrowdDB                                                 Parser
                                                                          Turker Relationship




                                     MetaData
                                                                               Manager

                                                                            UI         Form
                                                       Optimizer
                                                                         Creation      Editor

See also:
                                                       Executor          UI Template Manager

  Qurk – MIT

                                     Statistics
                                                                            HIT Manager
  Deco – Stanford
                                                  Files Access Methods




                                                   Disk 1




                                                   Disk 2




      CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011
 26
      Query Processing with the VLDB Crowd, VLDB 2011
DB-hard Queries
Company_Name              Address                    Market Cap
Google                    Googleplex, Mtn. View CA   $210Bn
Intl. Business Machines   Armonk, NY                 $200Bn
Microsoft                 Redmond, WA                $250Bn

                                SELECT Market_Cap
                                From Companies
                                Where Company_Name = “IBM”


                                    Number of Rows: 0

                                    Problem:
                                    Entity Resolution

27
DB-hard Queries
Company_Name              Address                    Market Cap
Google                    Googleplex, Mtn. View CA   $210Bn
Intl. Business Machines   Armonk, NY                 $200Bn
Microsoft                 Redmond, WA                $250Bn

                                SELECT Market_Cap
                                From Companies
                                Where Company_Name = “Apple”


                                    Number of Rows: 0

                                    Problem:
                                    Closed-World Assumption

28
DB-hard Queries
SELECT Image
From Pictures
Where Image contains
“Good Looking Dog”




                       Number of Rows: 0

                       Problem:
                       Subjective Comparision

29
Leveraging Human Intelligence
First Attempt:                     CrowdSQL                                         Results


CrowdDB                                                 Parser
                                                                          Turker Relationship




                                     MetaData
                                                                               Manager

                                                                            UI         Form
Where to use the crowd:                                Optimizer
                                                                         Creation      Editor


• Cleaning and                                         Executor          UI Template Manager




                                     Statistics
  Disambiguation
• Find missing data                               Files Access Methods      HIT Manager


• Make subjective
  comparisons                                      Disk 1




                                                   Disk 2




      CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011
 30
      Query Processing with the VLDB Crowd, VLDB 2011
CrowdDB - Worker Interface




31
Mobile Platform




32
CrowdSQL
DDL Extensions:
Crowdsourced columns          Crowdsourced tables
CREATE TABLE company (        CREATE CROWD TABLE department (
  name STRING PRIMARY KEY,      university STRING,
  hq_address CROWD STRING);     department STRING,
                                phone_no STRING)
                              PRIMARY KEY (university, department);


DML Extensions:
 CrowdEqual:                  CROWDORDER operators (currently UDFs):
 SELECT *                     SELECT p FROM picture
 FROM companies               WHERE subject =
 WHERE Name ~= “Big Blue”         "Golden Gate Bridge"
                              ORDER BY CROWDORDER(p, "Which
                              pic shows better %subject");

  33
CrowdDB Query: Picture ordering
                                                                       Which picture visualizes better
Query:                                                                    "Golden Gate Bridge"
SELECT p FROM picture
WHERE subject = "Golden Gate Bridge"
ORDER BY CROWDORDER(p, "Which pic shows
                            better %subject");


Data-Size: 30 subject areas, with 8 pictures each
Batching: 4 orderings per HIT
Replication: 3 Assignments per HIT
Price: 1 cent per HIT
                                                                                    Submit




34                    (turker-votes, turker-ranking, expert-ranking)
User Interface vs. Quality



                       Please fill out the missing                           Please fill out the missing
                           professor data                                       department data
                     N ame        Carey                                   Department                CS
                     Department CS                                        Name
                                                                                                                               Please fill out the missing
                     name                               MTJoin                                                                     professor data
     MTJoin                                                               Phone
                     E-Mail                              (Dep)                                                               Name                   Carey
   (Professor)
                                                    p.dep = d.name                   Submit
p.name = "carey"                Submit                                                                       MTProbe         E-Mail
                                                                                                          (Professor, Dep)   Department
                                                                                                            name=Carey
                       Please fill out the missing                           Please fill out the missing                       Department
                                                       MTProbe                  professor data                               Phone
                           department data
                                                      (Professor)                                 Carey
MTProbe(Dep)         Department                                           Name
                                                     name=Carey                                                                         Submit
                     Name                                                 E-Mail

                     Phone                                                Department

                                Submit                                               Submit



                   (Department first)                                (Professor first)                          (De-normalized Probe)

                   ≈10% Error-Rate                            ≈10% Error-Rate                                     ≈80% Error-Rate
   35
Turker Affinity and Errors




               Turker Rank
36
A Bigger Underlying Issue

     Closed-World       Open-World




37
What Does This Query Mean?

SELECT COUNT(*) FROM IceCreamFlavors




 Trushkowsky et al. Croudsourcing Enumeration Queries, ICDE 2013 (to appear)

 38
Estimating Completeness
SELECT COUNT(*) FROM US States
 US States using Mechanical Turk
           Species Estimation techniques perform well on average
              •Uniform under-predicts slightly, coeff of var. = 0.5
              •Decent estimate after 100 HITs
                                      States: unique items
                                                                         Average US States
                        50
                        40
 avg # unique answers

                        30
                        20
                        10
                        0




                             0   50     100    150     200   250   300

39                                       # responses
                                         # Answers (HITs)
Estimating Completeness
SELECT COUNT(*) FROM IceCreamFlavors
• Ice Cream              Ice Cream Flavors

  Flavors
    – Estimators don‟t
      converge
    – Very highly
      skewed (CV =
      5.8)
    – Detect that # HITs
      insufficient         Few, short lists of ice cream flavors
                           (e.g. “alumni swirl, apple cobbler
      (beginning of        crunch, arboretum breeze,…” from Penn
                           State Creamery
 40
      curve)
pay-as-you-go
• “I don’t believe it is usually possible to estimate the
  number of species... but only an appropriate lower bound
  for that number. This is because there is nearly always a
  good chance that there are a very large number of
     extremely rare species” –
                      Good, 1953
• So instead, can ask: “What‟s the benefit of
  m additional HITs?”
                     Ice Cream after 1500 HITs
                    m     Actual   Shen   Spline
                    10    1        1.79   1.62
                    50    7        8.91   8.22
                    200   39       35.4   32.9
41
CrowdER - Entity Resolution




                    DB




42/17
Hybrid Entity-Resolution


                                                               Threshold = 0.2
                                                               #Pairs = 8,315
                                                               #HITs = 508
                                                               Cost= $38.1

                                                               Time = 4.5h
                                                               Time(QT) = 20h




        J. Wang et al. CrowdER: Crowdsourcing Entity Resolution, PVLDB 2012
43/17
CrowdQ – Query Generation
 • Help find answers to unstructured queries
     – Approach: Generate a structured query via templates
 • Machines do parsing and ontology lookup
 • People do the rest: verification, entity extraction, etc.




Demartini et al. CroudQ: Crowdsourced Query Understanding, CIDR 2013 (to appear)

   44
SO, WHERE DOES
MIDDLEWARE FIT IN?
Generic Architecture
                  Middleware is the software that
                  resides between applications
 application      and the underlying architecture.
                  The goal of middleware is to
                  facilitate the development of
                  applications by providing higher-
                  level abstractions for better
Hybrid Platform   programmability, performance, s
                  calability, security, and a variety
                  of essential features.
                         Middleware 2012 web page
The Challenge




                Incentives
                Latency & Prediction
                Failure Modes
Some issues:    Work Conditions
                Interface
                Task Structuring
                Task Routing
 47             …
Can you incentivize workers?




     http://waxy.org/2008/11/the_faces_of_
48
     mechanical_turk/
Incentives




49
Can you trust the crowd?
      On Wikipedia ”any user can
      change any entry, and if
      enough users agree with
      them, it becomes true."

    “The Elephant population in Africa has
    tripled over the past six months.”[1]


Wikiality: Reality as decided on by majority rule.[2]
[1] http://en.wikipedia.org/wiki/Cultural_impact_of_The_Colbert_Report
[2] http://www.urbandictionary.com/define.php?term=wikiality
Answer Quality Approaches

• Some General Techniques
     – Approval Rate / Demographic Restrictions
     – Qualification Test
     – Gold Sets/Honey Pots
     – Redundancy and Voting
     – Statistical Measures and Bias Reduction
     – Verification/Review
• Query Specific Techniques
• Worker Relationship Management
51
Can you organize the crowd?



                                        Independent agreement to identify patches


                                                               Soylent, a prototype...




                                        Randomize order of suggestions




52
     [Bernstein et al: Soylent: A Word Processor with a Crowd Inside. UIST, 2010]
Can You Predict the Crowd?

     Streakers        List walking




53
Can you build a low-latency crowd?




from: M S Bernstein, J Brandt, R C Miller, D R Karger, “Crowds in Two
Seconds: Enabling Realtime Crowdsourced Applications”, UIST 2011.
  54
Can you help the crowd?
For More Information
Crowdsourcing Tutorials:
 • P. Ipeirotis, Managing Crowdsourced Human Computation,
   WWW „11, March 2011.
 • O. Alonso, M. Lease, Crowdsourcing for Information Retrieval:
   Principles, Methods, and Applications, SIGIR July 2011.
 • A. Doan, M. Franklin, D. Kossmann, T. Kraska,
   Crowdsourcing Applications and Platforms: A Data
   Management Perspective, VLDB 2011.


AMPLab: amplab.cs.berkeley.edu
     • Papers
     • Project Descriptions and Pages
     • News updates and Blogs

56

More Related Content

PPTX
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
PDF
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
PDF
Opening Keynote: Putting IBM Watson to Work
PPTX
Big Data, Open Data & eHealth Innovation
PPTX
Big data and hr
PDF
Datafication of HR - Employee Benefits Live 2013
PDF
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
PPTX
Big Data in Human Resources
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
Opening Keynote: Putting IBM Watson to Work
Big Data, Open Data & eHealth Innovation
Big data and hr
Datafication of HR - Employee Benefits Live 2013
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
Big Data in Human Resources

Viewers also liked (20)

PDF
C2 empowering modern human resources and talent management in the cloud
PPTX
Managing human resources at data centers 1.0
PDF
Big Data Triage with Rosette Human Language Technology Conference
PDF
Big data in HR: Why all the fuss?
PPSX
HR Analytics and KPIs with LBi HR HelpDesk
PPTX
Big Trends in HR Tech for 2014 and Beyond - Human Resource Executive Webinar
PPTX
Data Science and Analytics in Human Resources - Moneyball comes to HR
PPTX
Big Data in Manufacturing Final PPT
PPTX
21st Century Talent Management: Imperatives for 2014 and 2015
PPTX
Proteomics public data resources: enabling "big data" analysis in proteomics
PPT
Walmart value chain-analysis
PPTX
Strategic planning process and human resource management
 
PDF
How Human Resources processes are improved by Advanced Analytics and Big Data
PPTX
How Knowledge Management and Big Data Multiply the Impact of CI
PPTX
Strategic Human Resource Management (SHRM) - MBA 423 Human Resources Manageme...
PDF
Airline Revenue - Case Study and Industry Analysis
PDF
Go Big on Community Management!
PDF
HR Data Management: The Hard Way vs The Easy Way
PDF
Big Data v Data Mining
PPTX
Building an Effective Data Warehouse Architecture
C2 empowering modern human resources and talent management in the cloud
Managing human resources at data centers 1.0
Big Data Triage with Rosette Human Language Technology Conference
Big data in HR: Why all the fuss?
HR Analytics and KPIs with LBi HR HelpDesk
Big Trends in HR Tech for 2014 and Beyond - Human Resource Executive Webinar
Data Science and Analytics in Human Resources - Moneyball comes to HR
Big Data in Manufacturing Final PPT
21st Century Talent Management: Imperatives for 2014 and 2015
Proteomics public data resources: enabling "big data" analysis in proteomics
Walmart value chain-analysis
Strategic planning process and human resource management
 
How Human Resources processes are improved by Advanced Analytics and Big Data
How Knowledge Management and Big Data Multiply the Impact of CI
Strategic Human Resource Management (SHRM) - MBA 423 Human Resources Manageme...
Airline Revenue - Case Study and Industry Analysis
Go Big on Community Management!
HR Data Management: The Hard Way vs The Easy Way
Big Data v Data Mining
Building an Effective Data Warehouse Architecture
Ad

Similar to Middeware2012 crowd (20)

DOC
Sample Paper.doc.doc
PPT
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
PDF
Crowdsourcing challenges and opportunities 2012
PDF
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
PPTX
Your Big Data Arsenal - Strata 2013
PPTX
Essential Tools For Your Big Data Arsenal
PPTX
The Secret Formula to Staying Customer Conscious During Late-Stage Product De...
PDF
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
PDF
The Pace of Change Requires AI (and/or its subsets)
PDF
The Future Based on AI and Analytics
PPT
Ibm and innovation overview 20150326 v15 short
PPT
Large scale computing
PPT
Integrate All The Things WS02Con
PPT
Intelligent Big Data analytics for the future.
PPTX
Introduction to Grid Computing
PDF
Introduction to Big Data
PDF
Scale, Structure, and Semantics
PDF
Big Data is changing abruptly, and where it is likely heading
PPTX
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
PPTX
NoSQL & Big Data Analytics: History, Hype, Opportunities
Sample Paper.doc.doc
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Crowdsourcing challenges and opportunities 2012
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
Your Big Data Arsenal - Strata 2013
Essential Tools For Your Big Data Arsenal
The Secret Formula to Staying Customer Conscious During Late-Stage Product De...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
The Pace of Change Requires AI (and/or its subsets)
The Future Based on AI and Analytics
Ibm and innovation overview 20150326 v15 short
Large scale computing
Integrate All The Things WS02Con
Intelligent Big Data analytics for the future.
Introduction to Grid Computing
Introduction to Big Data
Scale, Structure, and Semantics
Big Data is changing abruptly, and where it is likely heading
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
NoSQL & Big Data Analytics: History, Hype, Opportunities
Ad

Recently uploaded (20)

PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
project resource management chapter-09.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Architecture types and enterprise applications.pdf
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PDF
August Patch Tuesday
PDF
Getting Started with Data Integration: FME Form 101
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Zenith AI: Advanced Artificial Intelligence
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
project resource management chapter-09.pdf
Final SEM Unit 1 for mit wpu at pune .pptx
Web App vs Mobile App What Should You Build First.pdf
A contest of sentiment analysis: k-nearest neighbor versus neural network
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
O2C Customer Invoices to Receipt V15A.pptx
Architecture types and enterprise applications.pdf
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Programs and apps: productivity, graphics, security and other tools
Hindi spoken digit analysis for native and non-native speakers
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
OMC Textile Division Presentation 2021.pptx
NewMind AI Weekly Chronicles - August'25-Week II
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
August Patch Tuesday
Getting Started with Data Integration: FME Form 101
TLE Review Electricity (Electricity).pptx
Enhancing emotion recognition model for a student engagement use case through...
Zenith AI: Advanced Artificial Intelligence

Middeware2012 crowd

  • 1. Integrating Crowd & Cloud Resources for Big Data Michael Franklin Middleware 2012, Montreal December 6 2012 Expeditions UC BERKELEY in Computing
  • 6. Data Collection & Curation e.g., Freebase
  • 7. An Academic View From Quinn & Bederson, “Human Computation: A Survey and Taxonomy of a Growing Field”, CHI 2011.
  • 8. The Way Industry Looks At It How Industry Looks At It
  • 9. Useful Taxonomies • Doan, Halevy, Ramakrishnan; (Crowdsourcing) CACM 4/11 – nature of collaboration (implicit vs. explicit) – architecture (standalone vs. piggybacked) – must recruit users/workers? (yes or no) – What do users/workers do? • Bederson & Quinn; (Human Computation) CHI ‟11 – Motivation (Pay, Altruism, Enjoyment, Reputation) – Quality Control (many mechanisms) – Aggregation (how are results combined?) – Human Skill (Visual recognition, language, …) – …
  • 10. Types of Tasks Task Granularity Examples Complex Tasks • Build a website • Develop a software system • Overthrow a government? Simple Projects • Design a logo and visual identity • Write a term paper Macro Tasks • Write a restaurant review • Test a new website feature • Identify a galaxy Micro Tasks • Label an image • Verify an address • Simple entity resolution Inspired by the report: “Paid Crowdsourcing”, Smartsheet.com, 9/15/2009
  • 13. Microtasking – Virutalized Humans • Current leader: Amazon Mechanical Turk • Requestors place Human Intelligence Tasks (HITs) – set price per “assignment” (usually cents) – specify #of replicas (assignments), expiration, … – User Interface (for workers) – API-based: “createHit()”, “getAssignments()”, “approveAssignments()”, “forceExpire()” • Requestors approve jobs and payment • Workers (a.k.a. “turkers”) choose jobs, do them, get paid 13
  • 18. Crowdsourcing for Data Management • Relational • Beyond relational – data cleaning – graph search – data entry – classification – information extraction – transcription – schema matching – mobile image search – entity resolution – social media analysis – data spaces – question answering – building structured KBs – NLP – sorting – text summarization – top-k – sentiment analysis – ... – semantic wikis – ... 18
  • 20. Not Exactly Crowdsourcing, but… “The hope is that, in not too many years, human brains and computing machines will be coupled together very tightly, and that the resulting partnership will think as no human brain has ever thought and process data in a way not approached by the information-handling machines we know today.”
  • 21. AMP: Integrating Diverse Resources Algorithms: Machine Learning and Analytics People: Machines: CrowdSourcing & Cloud Computing Human Computation 21
  • 22. The Berkeley AMPLab • Goal: Data analytics stack integrating A, M & P • BDAS: Released as BSD/Apache Open Source • 6 year duration: 2011-2017 • 8 CS Faculty • Directors: Franklin(DB), Jordan (ML), Stoica (Sys) • Industrial Support & Collaboration: • NSF Expedition and Darpa XData 22
  • 23. People in AMP • Long term Goal: Make people an integrated part of the system! • Leverage human activity Machines + • Leverage human intelligence Algorithms • Current AMP People Projects – Carat: Collaborative Energy Questions activity Answers Debugging data, – CrowdDB: “The World‟s Dumbest Database System” – CrowdER: Hybrid computation for Entity Resolution – CrowdQ: Hybrid Unstructured Query Answering 23
  • 24. Carat: Leveraging Human Activity ~500,000 downloads to date A. J. Oliner, et al. Collaborative Energy Debugging for Mobile Devices. Workshop on Hot Topics in System Dependability (HotDep), 2012. 24
  • 25. Carat: How it works Collaborative Detection of Energy Bugs 25
  • 26. Leveraging Human Intelligence First Attempt: CrowdSQL Results CrowdDB Parser Turker Relationship MetaData Manager UI Form Optimizer Creation Editor See also: Executor UI Template Manager Qurk – MIT Statistics HIT Manager Deco – Stanford Files Access Methods Disk 1 Disk 2 CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011 26 Query Processing with the VLDB Crowd, VLDB 2011
  • 27. DB-hard Queries Company_Name Address Market Cap Google Googleplex, Mtn. View CA $210Bn Intl. Business Machines Armonk, NY $200Bn Microsoft Redmond, WA $250Bn SELECT Market_Cap From Companies Where Company_Name = “IBM” Number of Rows: 0 Problem: Entity Resolution 27
  • 28. DB-hard Queries Company_Name Address Market Cap Google Googleplex, Mtn. View CA $210Bn Intl. Business Machines Armonk, NY $200Bn Microsoft Redmond, WA $250Bn SELECT Market_Cap From Companies Where Company_Name = “Apple” Number of Rows: 0 Problem: Closed-World Assumption 28
  • 29. DB-hard Queries SELECT Image From Pictures Where Image contains “Good Looking Dog” Number of Rows: 0 Problem: Subjective Comparision 29
  • 30. Leveraging Human Intelligence First Attempt: CrowdSQL Results CrowdDB Parser Turker Relationship MetaData Manager UI Form Where to use the crowd: Optimizer Creation Editor • Cleaning and Executor UI Template Manager Statistics Disambiguation • Find missing data Files Access Methods HIT Manager • Make subjective comparisons Disk 1 Disk 2 CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011 30 Query Processing with the VLDB Crowd, VLDB 2011
  • 31. CrowdDB - Worker Interface 31
  • 33. CrowdSQL DDL Extensions: Crowdsourced columns Crowdsourced tables CREATE TABLE company ( CREATE CROWD TABLE department ( name STRING PRIMARY KEY, university STRING, hq_address CROWD STRING); department STRING, phone_no STRING) PRIMARY KEY (university, department); DML Extensions: CrowdEqual: CROWDORDER operators (currently UDFs): SELECT * SELECT p FROM picture FROM companies WHERE subject = WHERE Name ~= “Big Blue” "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject"); 33
  • 34. CrowdDB Query: Picture ordering Which picture visualizes better Query: "Golden Gate Bridge" SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject"); Data-Size: 30 subject areas, with 8 pictures each Batching: 4 orderings per HIT Replication: 3 Assignments per HIT Price: 1 cent per HIT Submit 34 (turker-votes, turker-ranking, expert-ranking)
  • 35. User Interface vs. Quality Please fill out the missing Please fill out the missing professor data department data N ame Carey Department CS Department CS Name Please fill out the missing name MTJoin professor data MTJoin Phone E-Mail (Dep) Name Carey (Professor) p.dep = d.name Submit p.name = "carey" Submit MTProbe E-Mail (Professor, Dep) Department name=Carey Please fill out the missing Please fill out the missing Department MTProbe professor data Phone department data (Professor) Carey MTProbe(Dep) Department Name name=Carey Submit Name E-Mail Phone Department Submit Submit (Department first) (Professor first) (De-normalized Probe) ≈10% Error-Rate ≈10% Error-Rate ≈80% Error-Rate 35
  • 36. Turker Affinity and Errors Turker Rank 36
  • 37. A Bigger Underlying Issue Closed-World Open-World 37
  • 38. What Does This Query Mean? SELECT COUNT(*) FROM IceCreamFlavors Trushkowsky et al. Croudsourcing Enumeration Queries, ICDE 2013 (to appear) 38
  • 39. Estimating Completeness SELECT COUNT(*) FROM US States US States using Mechanical Turk Species Estimation techniques perform well on average •Uniform under-predicts slightly, coeff of var. = 0.5 •Decent estimate after 100 HITs States: unique items Average US States 50 40 avg # unique answers 30 20 10 0 0 50 100 150 200 250 300 39 # responses # Answers (HITs)
  • 40. Estimating Completeness SELECT COUNT(*) FROM IceCreamFlavors • Ice Cream Ice Cream Flavors Flavors – Estimators don‟t converge – Very highly skewed (CV = 5.8) – Detect that # HITs insufficient Few, short lists of ice cream flavors (e.g. “alumni swirl, apple cobbler (beginning of crunch, arboretum breeze,…” from Penn State Creamery 40 curve)
  • 41. pay-as-you-go • “I don’t believe it is usually possible to estimate the number of species... but only an appropriate lower bound for that number. This is because there is nearly always a good chance that there are a very large number of extremely rare species” – Good, 1953 • So instead, can ask: “What‟s the benefit of m additional HITs?” Ice Cream after 1500 HITs m Actual Shen Spline 10 1 1.79 1.62 50 7 8.91 8.22 200 39 35.4 32.9 41
  • 42. CrowdER - Entity Resolution DB 42/17
  • 43. Hybrid Entity-Resolution Threshold = 0.2 #Pairs = 8,315 #HITs = 508 Cost= $38.1 Time = 4.5h Time(QT) = 20h J. Wang et al. CrowdER: Crowdsourcing Entity Resolution, PVLDB 2012 43/17
  • 44. CrowdQ – Query Generation • Help find answers to unstructured queries – Approach: Generate a structured query via templates • Machines do parsing and ontology lookup • People do the rest: verification, entity extraction, etc. Demartini et al. CroudQ: Crowdsourced Query Understanding, CIDR 2013 (to appear) 44
  • 46. Generic Architecture Middleware is the software that resides between applications application and the underlying architecture. The goal of middleware is to facilitate the development of applications by providing higher- level abstractions for better Hybrid Platform programmability, performance, s calability, security, and a variety of essential features. Middleware 2012 web page
  • 47. The Challenge Incentives Latency & Prediction Failure Modes Some issues: Work Conditions Interface Task Structuring Task Routing 47 …
  • 48. Can you incentivize workers? http://waxy.org/2008/11/the_faces_of_ 48 mechanical_turk/
  • 50. Can you trust the crowd? On Wikipedia ”any user can change any entry, and if enough users agree with them, it becomes true." “The Elephant population in Africa has tripled over the past six months.”[1] Wikiality: Reality as decided on by majority rule.[2] [1] http://en.wikipedia.org/wiki/Cultural_impact_of_The_Colbert_Report [2] http://www.urbandictionary.com/define.php?term=wikiality
  • 51. Answer Quality Approaches • Some General Techniques – Approval Rate / Demographic Restrictions – Qualification Test – Gold Sets/Honey Pots – Redundancy and Voting – Statistical Measures and Bias Reduction – Verification/Review • Query Specific Techniques • Worker Relationship Management 51
  • 52. Can you organize the crowd? Independent agreement to identify patches Soylent, a prototype... Randomize order of suggestions 52 [Bernstein et al: Soylent: A Word Processor with a Crowd Inside. UIST, 2010]
  • 53. Can You Predict the Crowd? Streakers List walking 53
  • 54. Can you build a low-latency crowd? from: M S Bernstein, J Brandt, R C Miller, D R Karger, “Crowds in Two Seconds: Enabling Realtime Crowdsourced Applications”, UIST 2011. 54
  • 55. Can you help the crowd?
  • 56. For More Information Crowdsourcing Tutorials: • P. Ipeirotis, Managing Crowdsourced Human Computation, WWW „11, March 2011. • O. Alonso, M. Lease, Crowdsourcing for Information Retrieval: Principles, Methods, and Applications, SIGIR July 2011. • A. Doan, M. Franklin, D. Kossmann, T. Kraska, Crowdsourcing Applications and Platforms: A Data Management Perspective, VLDB 2011. AMPLab: amplab.cs.berkeley.edu • Papers • Project Descriptions and Pages • News updates and Blogs 56

Editor's Notes

  • #10: Fix ME!!!!
  • #14: For the database administrator it is the correct answer, but for the CEO it is not really understandable
  • #15: Equal is not a good fit
  • #19: 210 HITsIt took 68 minutes to complete the whole experiment.
  • #38: -
  • #42: Lead off with saying heavily skewed distribution will be difficult to estimate, only lower bound(say quote)Instead, reason about cost vs. benefit tradeoffWhen you ask a slightly different question, you can still make progress!
  • #52: General Techniques (NON-DB techniques)