SlideShare a Scribd company logo
Data and text mining workshop
The role of crowdsourcing
Anna Noel-Storr
Wellcome Trust, London, Friday 6th March 2015
What is crowdsourcing?
“…the practice of obtaining needed services, ideas, or content by soliciting contributions
from a large group of people, and especially from an online community, rather than from
traditional employees…”
Image credit: DesignCareer
What is crowdsourcing?
Knowledge
discovery
and
management
Brabham’s problem focused crowdsourcing typology: 4 types
What is crowdsourcing?
Knowledge
discovery
and
management
Broadcast
search
Brabham’s problem focused crowdsourcing typology: 4 types
What is crowdsourcing?
Knowledge
discovery
and
management
Broadcast
search
Peer-vetted
creative
production
Brabham’s problem focused crowdsourcing typology: 4 types
What is crowdsourcing?
Knowledge
discovery
and
management
Broadcast
search
Peer-vetted
creative
production
Distributed
human
intelligence
tasking
Brabham’s problem focused crowdsourcing typology: 4 types
What is crowdsourcing?
Knowledge
discovery
and
management
Broadcast
search
Peer-vetted
creative
production
Distributed
human
intelligence
tasking
Brabham’s problem focused crowdsourcing typology: 4 types
Micro-tasking: process
Breaking down large corpus of data into smaller units
and distributing those units to a large online crowd
“the distribution of small parts of a problem”
Human computation
Humans remain better than machines at certain tasks:
e.g. Identifying pizza toppings from a picture of a pizza
e.g. “preventing obesity without eating like a rabbit”.ti. – autotag: Animal study
Tools and platforms
What platforms and tools exist and how do they work?
Image credit: ThinkStock
The Zooniverse
“each project uses the efforts and ability of volunteers to help
scientists and researchers deal with the flood of data that confronts them”
Classification and annotation
Galaxy Zoo
Operation War Diary
Health related evidence production
Can we use crowdsourcing to identify the
evidence in a more timely way?
- Known pressure point within the review production
- Between 2000 and 5000 citations per new review, but can be much more
- A not much loved task
Trial
identification
The Embase project
Cochrane’s
Central Register
of Controlled
Trials:
CENTRAL
Embase
Crowd
Embase
auto
Step 2: Use a crowd to screen thousands of search results from Embase and feed
the identified reports of RCTs into CENTRAL
Howwill the crowd do this?
Step 1: run a very sensitive search in the largest biomedical database for studies
The screening tool
Three
choice
s
You are not alone!
(and you can’t
go back)
Progress bar
Yellow highlights to
indicate a likely RCT
Red highlights
The Embase project: recruitment
- 900+ people have signed-up to screen citations in 12 months
- 110,000+ citations have been collectively screened
- 4,000 RCTs/q-RCTs identified by the crowd
0
100
200
300
400
500
600
700
800
900
1000
Feb-14 Mar-14 Apr-14 May-14 Jun-14 Jul-14 Aug-14 Sep-14 Oct-14 Nov-14 Dec-14 Jan-15 Feb-15 Mar-15
Number of Participants
Participants
Why do people do it?
Made it very easy to participate
(and equally easy to stop!)
Gain experience
(bulk up the CV)
Provide feedback: both
to the individual and to
the community
Wanting to do something to contribute
(healthcare is a strong hook)
(people are more
likely to come back)
RCT RCT RCT
Reject Reject Reject
Unsure
CENTRAL
Bin
Resolver
How accurate is the crowd?
RCTReject Resolver
5%
Crowd accuracy
TP
1565
FP
9
FN
2
TN
2888
TP
415
FP
5
FN
1
TN
2649
The Crowd:
INDEX
TEST
The Crowd:
INDEX
TEST
The Info specialist:
REFERENCE STANDARD
The Info specialists:
REFERENCE STANDARD
Validation 1
Validation 2
Sensitivity: 99.9% Specificity: 99.7% Sensitivity: 99.8% Specificity: 99.8%
Enriched sample; blinded to crowd
decision; dual independent screeners as
reference standard
Enriched sample; blinded to crowd
decision; single independent expert
screener (me!) as reference standard;
possibility of incorporation bias
Individual screener accuracy is also carefully monitored
How fast is the crowd?
Number of
weeks
Jan 2014 Jul 2014 Jan 2015
6 weeks
5 weeks
2 weeks
More screeners and more screeners screening more quickly
Length of time to screen
one month’s worth of records
More of the same, and more tasks
As the crowd becomes more efficient, we plan to do two things:
1. Increase the databases we search – feed in more citations
2. Offer other ‘micro-tasks’
Feed in more
citations – from
other databases
Bin
Y
N
Screen
Annotate,
appraise
And in these tasks the machine plays a
vital and complementary role…
e.g. is the healthcare
condition Alzheimer’s
disease? Y, N, Unsure
Perfect partnership
Machine driven probability + Collective human decision-making
It’s not one or the other, the ideal is both
In summary
• Effective method in large scale
study identification
• Identify more studies, more
quickly
• No compromise on quality or
accuracy
• Offers meaningful ways to
contribute
• Feasible to recruit a crowd
• Highly functional tool
• Complements data and text
mining
And enables the move towards the living review
Crowdsourcing:

More Related Content

PDF
Safecast long version oct 2015
PPTX
Executable Music Documents
PPTX
Taking IT for Granted
PPTX
Social Machines Paradigm
PPTX
SiDE Presentation by Prof. Paul Watson of Newcastle University
PDF
Crowdsourcing 102: Mining Real-Time Data
PDF
Ethics of Automation
Safecast long version oct 2015
Executable Music Documents
Taking IT for Granted
Social Machines Paradigm
SiDE Presentation by Prof. Paul Watson of Newcastle University
Crowdsourcing 102: Mining Real-Time Data
Ethics of Automation

What's hot (12)

PPTX
Location aware apps: patterns and solutions - Ben Butchart - Jisc Digital Fes...
PPTX
Smart Data - How you and I will exploit Big Data for personalized digital hea...
PDF
New and Emerging Forms of Data
PPTX
Digital Monitoring of societal Discussions in online Social Networks
PPTX
e-Research and the Demise of the Scholarly Article
PDF
New Forms of Data for e-Research
PPTX
Scaling Crisismapping
PPTX
Social Machines of Science and Scholarship
PDF
Scholarship in the Digital World
PPTX
Big Data meets Big Social: Social Machines and the Semantic Web
PDF
Big Data and Social Sciences
PDF
New Data `New Computation
Location aware apps: patterns and solutions - Ben Butchart - Jisc Digital Fes...
Smart Data - How you and I will exploit Big Data for personalized digital hea...
New and Emerging Forms of Data
Digital Monitoring of societal Discussions in online Social Networks
e-Research and the Demise of the Scholarly Article
New Forms of Data for e-Research
Scaling Crisismapping
Social Machines of Science and Scholarship
Scholarship in the Digital World
Big Data meets Big Social: Social Machines and the Semantic Web
Big Data and Social Sciences
New Data `New Computation
Ad

Viewers also liked (20)

PPTX
Copyright Reform and Open Data
PDF
Crowdsourcing Introduction
PPTX
Crowdsourcing - an overview
PPTX
Crowdsourcing presentation (slideshare)
PPT
Crowdsourcing
PPTX
Crowdsourcing
PDF
What is Crowdsourcing
PPTX
MOOC Primer for Evidence Based Health Care
PPTX
It Takes a DSA to Raise a Graduate
PPTX
Brand Ambassador Challenges
PPTX
Pop up Stores a new trend in retail!
PPTX
Crowdfunding Overview
PPT
Pop Up stores deconstructed
PPTX
Introduction to CrowdFunding
PPT
"Crowdfunding" How to Crowdfund
PPTX
Crowdfunding Presentation - Jan 21
PPTX
Your Neighborhood is your University
PDF
Crowdsourcing challenges and opportunities 2012
PDF
Crowdfunding: Financing Your Small Business
PDF
Crowdfunding: how it works, why it works and how you can make it work for you
Copyright Reform and Open Data
Crowdsourcing Introduction
Crowdsourcing - an overview
Crowdsourcing presentation (slideshare)
Crowdsourcing
Crowdsourcing
What is Crowdsourcing
MOOC Primer for Evidence Based Health Care
It Takes a DSA to Raise a Graduate
Brand Ambassador Challenges
Pop up Stores a new trend in retail!
Crowdfunding Overview
Pop Up stores deconstructed
Introduction to CrowdFunding
"Crowdfunding" How to Crowdfund
Crowdfunding Presentation - Jan 21
Your Neighborhood is your University
Crowdsourcing challenges and opportunities 2012
Crowdfunding: Financing Your Small Business
Crowdfunding: how it works, why it works and how you can make it work for you
Ad

Similar to Role of crowdsourcing (20)

PPT
Co-Creating Health through Digital Technology
PPT
How NOT to Aggregrate Polling Data
PDF
How Behavioural Recruitment can refresh the qualitative research industry
PPTX
Cloud Based Learning in Healthcare
PDF
Audience Lessons
PPTX
Information Products to Drive Decision Making
PDF
PMED: APPM Workshop: Crowdsourcing for Patient & Physician Medical Insights- ...
PDF
NIH Data Science Special Interest Group
PDF
The australian experience issues and solutions-Dr.Daniel Catchpoole
PDF
UX Burlington 2017: Exploratory Research in UX Design
PPTX
Reflections from a realist evaluation in progress: Scaling ladders and stitch...
PPTX
Biobanking: The Australian Experience
PPT
Crowdsourcing 101 for GLAMs
PDF
Conversation research: leveraging the power of social media
PPTX
Marketing research
PPTX
From Insights to Value Proposition: Matching Evidence to Payer Need
PDF
Data Collection Tool Used For Information About Individuals
PDF
In-Class Reflection
PPT
KM Chicago: Organisational Network Analysis
PPTX
Developing the Informatics Workforce for Scotland's Health and Social Care
Co-Creating Health through Digital Technology
How NOT to Aggregrate Polling Data
How Behavioural Recruitment can refresh the qualitative research industry
Cloud Based Learning in Healthcare
Audience Lessons
Information Products to Drive Decision Making
PMED: APPM Workshop: Crowdsourcing for Patient & Physician Medical Insights- ...
NIH Data Science Special Interest Group
The australian experience issues and solutions-Dr.Daniel Catchpoole
UX Burlington 2017: Exploratory Research in UX Design
Reflections from a realist evaluation in progress: Scaling ladders and stitch...
Biobanking: The Australian Experience
Crowdsourcing 101 for GLAMs
Conversation research: leveraging the power of social media
Marketing research
From Insights to Value Proposition: Matching Evidence to Payer Need
Data Collection Tool Used For Information About Individuals
In-Class Reflection
KM Chicago: Organisational Network Analysis
Developing the Informatics Workforce for Scotland's Health and Social Care

More from Graham Steel (18)

PDF
OER slides for OAWeek 2017
PDF
Preprints: a journey though time
PDF
RFringe15GS
PPTX
Legal aspects of content mining
PDF
Dundee - PiCLS Slides
PDF
Certificate
PPTX
#solo13mash re-remix - Instrumental Version
PPTX
#solo13mash - The Remix Lounge Version - "Fragements Of Time" style
PPTX
#solo13 Mash-up
PDF
1 s2.0-s0098791313000154-main
PPTX
New Zealand (South Island) 2002
PPT
F1000 research specialist_presentation (personalised by Graham Steel)
PDF
Opella et al
PDF
Open Access Week 2010 Pic/Music Mash-Up
PDF
Music/photo mash-up of Science Online London: 2010
PDF
Expectations Of The Screenager Generation
PPT
2007 CJD Presentation - Graham Steel
PPT
Between Biological and Digital Memory Prof David Wishart
OER slides for OAWeek 2017
Preprints: a journey though time
RFringe15GS
Legal aspects of content mining
Dundee - PiCLS Slides
Certificate
#solo13mash re-remix - Instrumental Version
#solo13mash - The Remix Lounge Version - "Fragements Of Time" style
#solo13 Mash-up
1 s2.0-s0098791313000154-main
New Zealand (South Island) 2002
F1000 research specialist_presentation (personalised by Graham Steel)
Opella et al
Open Access Week 2010 Pic/Music Mash-Up
Music/photo mash-up of Science Online London: 2010
Expectations Of The Screenager Generation
2007 CJD Presentation - Graham Steel
Between Biological and Digital Memory Prof David Wishart

Recently uploaded (20)

PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Sciences of Europe No 170 (2025)
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
famous lake in india and its disturibution and importance
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
The scientific heritage No 166 (166) (2025)
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Comparative Structure of Integument in Vertebrates.pptx
Introduction to Fisheries Biotechnology_Lesson 1.pptx
TOTAL hIP ARTHROPLASTY Presentation.pptx
Sciences of Europe No 170 (2025)
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Phytochemical Investigation of Miliusa longipes.pdf
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
2. Earth - The Living Planet Module 2ELS
famous lake in india and its disturibution and importance
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
The scientific heritage No 166 (166) (2025)
microscope-Lecturecjchchchchcuvuvhc.pptx
The KM-GBF monitoring framework – status & key messages.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...

Role of crowdsourcing

Editor's Notes

  • #2: I’m going to talk about the role that crowdsourcing can play in the evidence synthesis process and importantly in the move towards the ‘living review’.
  • #3: But first: What is crowdsourcing? Broadly speaking it’s: “…”
  • #4: There are different types/and therefore different approaches and tools needed, depending on what it is you need or want from the crowd. Brabham’s problem focused crowdsourcing typology is made up of four types: 1. Knowledge discovery and management – where you get your crowd tasked with finding and collecting information into a common location and format.
  • #5: 2. Broadcast search: where the organisation tasks the crowd with solving an empirical problem
  • #6: 3. Peer vetted creative production where the organisation tasks a crowd with creating and selecting creative ideas, and…
  • #7: 4. Distributed human-intelligence tasking which is where the organisation tasks a crowd with analysing large amounts of information
  • #8: And it’s this forth type I primarily want to focus on today.
  • #9: It’s about taking a large corpus of data and breaking it down into much smaller units which are then distributed via the internet to a community of willing volunteers to process. The distribution of small parts of a problem
  • #10: Some call this kind of work: human computation or human intelligence tasking, because these are tasks where human intelligence is still needed, and where humans still out perform the machine. Such as identifying pizza toppings from an image of a pizza, or, of more relevance to us perhaps: recognizing very quickly that an article is not actually about rabbits just because it has rabbit in the title…
  • #11: So what tools and platforms exist and what data does the crowd help to process?
  • #12: The Zooniverse, maintained and developed by the Citizen Science Alliance – largest, most successful citizen science platform. It began with one project, Galaxy Zoo, over eight years ago. The platform now hosts over 30 projects and has a world wide community of almost 1.3 million people. In their words, each project uses the “efforts and ability of volunteers to help scientists and researchers deal with the flood of data that confronts them”. Their focus began on all things space related but they have branched out into the humanities and into aspects of healthcare research.
  • #13: Here are two examples from two different projects hosted on the Zooniverse. The first, Galaxy Zoo, shows the volunteer an image of a galaxy and then asks a series of questions about that image, such as is it in a spiral shape? The second, Operation War Diary, gets volunteers to tag pages of war diaries.
  • #14: In our field, that of evidence appraisal, production and dissemination, we face similar challenges in keeping up with the amount of data produced. And within Cochrane we have been exploring the role of crowdsourcing in helping us to process the flood of data. Our efforts so far have largely focused on one well known pressure point within the review production process: that of trial identification. Our traditional model is under increasing strain as research is exponentially produced. The identified ‘micro-task’ within this broader task of trial identification, is citation screening. It is estimated that the average new systematic review identifies between 2000 and 5000 citations, but this can be much higher for reviews in certain domains or considering certain types of intervention. What if we could find a reliable and fast way to feed all reports of randomised trials into one central repository thereby removing the need for individual, and often small and under resourced review teams to create and run complex searches across multiple databases and then spend months screening those results for that one single review?
  • #15: So I’m part of team managing a project rather uninspiringly called: the Embase project. Our aim is use the crowd to help us keep up with the deluge of publications. We run one very sensitive search in Embase (the largest biomedical database in the world) for trials. This search identifies thousands of citations, as you would expect. Some of the results of this search we feed directly into Cochrane central database of controlled trials. What’s left, needs human intervention. It’s these records that we send out to the crowd to classify.
  • #16: We do this using a citation screening tool. This tool is fundamental to the crowd’s ability to perform the task. We wanted to develop something that focused almost entirely on the task in hand – that of screening a citation – as you can see the screen is mostly taken up with the citation which is stripped down to just title and abstract. There are some built in pre-defined highlighted words and phrases which are to help guide screeners to the most relevant parts of a citation. Yellow highlighted words and phrases indicate that the record is likely to be describing an RCT and Red highlights indicate that the record is likely to be a Reject. Screeners can also add their own highlights. There are three decision buttons: RCT/CCT, Reject or Unsure and screeners have to make a decision on a record; two other features I just want to quickly point out are the all important progress bar, and the feature which tells you how many others are online screening citations at the same time…
  • #17: We have a task, we have a tool, we just need a crowd. We’ve not found this difficult. In a year since going live we’ve had over 900 people sign up to take part. The crowd have screened over 110,000 citations and identified 4,000 reports of RCTs.
  • #18: We’ve been really pleased with those metrics; personally, I’m not surprised by them but I do get a lot of people asking me: why do people do it? I don’t think there’s one answer; I think many come to it knowing quite a bit about evidence based medicine and the pressures/challenges of producing timely and robust evidence, and they therefore want to help in this effort (and this provides a very real and immediate way they can help); related to that point I think having made it very easy to contribute has played a significant part in our success. We’ve adopted a rapid onboarding approach; we also offer rapid disembarking – you can stop doing this whenever you want, you are under no obligation and no pressure – this is to fit around you, not the other way round; and then two others factors for which I’m well aware we haven’t realised to their full potential, (more related to keeping people doing it) – and that’s around gaining experience, getting feedback on your performance, and perhaps being able to offer some progression or more tailored rewards.
  • #19: So we’ve managed to recruit a crowd and they collectively screened well over 100,000 citations. How do we ensure quality? How to ensure that the records are ending up in the right place (CENTRAL for RCTs), the ‘bin’ for Rejects? We use a simple, yet robust algorithm which goes like this: three consecutive agreements on a record sends that record off to Central or the bin without further intervention. Any disagreements or any records classified as Unsure go into a pot for a Resolver level screener to resolve. Happily this constitutes only about 5% of all records screened.
  • #20: So how well has this algorithm performed? We’ve performed two validation studies so far and two more are underway. Each of the those involved taking a random sample of crowd screened records and performing a re-screen on those records by ‘experts screeners’ blind to the crowd decisions. In both validation studies, crowd sensitivity (so the crowds ability to identify all the RCTs) and the crowd’s specificity (the crowd’s ability to identify records not eligible for central) has come out at over 99%. We’re happy with that.
  • #21: So we are pleased with accuracy. What about volume? Are the crowd screening enough? Collectively we are speeding up. [explain graph] This is an exciting place to be….
  • #22: It means we can: Look to increase the number of databases we search and screen in this way (ie. using crowd), and We can provide the crowd with more tasks aimed at contributing significantly to helping identify trials in a much more timely way, with no compromise on quality or accuracy. And in any such task, the machine plays a vital and complementary role..
  • #23: as we discover more about the very real role that text mining has to play, we can start to see further efficiencies reached by using both crowd and machine in a way that plays to each one’s strengths: the machine generating the probabilities and the crowd making the accurate collective decisions.