SlideShare a Scribd company logo
This project has received funding from the
European Union’s Horizon 2020 research
and innovation programme under grant
agreement No. 726992.
NLDB 2018
Classification of intangible social innovation
concepts
Nikola Milošević, Abdullah Gok, Goran Nenadić
School of Computer Science, The University of Manchester
Manchester Institute for Innovation Research, Alliance Manchester Business School
The University of Manchester
nikola.milosevic@manchester.ac.uk
http://inspiratron.org
Twitter: @dreadknight011
Context - KNOWMAK
• EU project KNOWMAK – Knowledge in the Making in
the European Society
• Traditional sources of knowledge (publications, patents)
• Untraditional sources of
knowledge creation –
Social innovations
• Actors and outputs
(publications, patents, projects)
European Social Innovation
Database
• Purpose
– Input into KNOWMAK database, as a part
addressing untraditional knowledge creation
– Unique collection of social innovation that can
be independently used by researcher and
policy makers
• Aim:
– Create database using NLP, ML and web
crawling
– Database should be comprehensive and
systematic
Social Innovations
• Development of new ideas, services and products that
aim to solve certain social problems
• There are many definitions
• Overlaps with:
– Open innovation
– Free innovation
– Grassroot innovation
– User innovation
– Conventional innovation
Examples of Social Innovations
Why to create database of social
innovation?
• Social sciences
• Creation of rules and policies
– Social policy experimentation (EU)
• Evaluation of project funding
– Impact
– Scaling
• Societal trends
Current databases
• There are some social innovation databases, but they are:
– Topic specific
– Small (case studies)
– Limited information
– One source of information
• Expert curated (expensive)
• User curated (unreliable)
– They use different definitions
– Contain large number of false positives
Other challenges
• Blurry definitions (disagreement)
• Huge number of areas
• Unstructured data (text)
• Unstandardized vocabulary
• Unstandardized sources
• Large number of variables
• Lack of labelled data
Disentangling Social Innovation Criteria
Criteria
1. Objectives
Project primarily or exclusively satisfies (often unmet) societal needs, including the needs of particular social groups; or aims at social
value creation.
Often no price involved for the main social beneficiary or the innovation is provided to the main beneficiary at cost only. However, there
might be examples that price is involved.
2. Actors and
Actor Interactions
 Satisfy one or both of the following:
i. Diversity of Actors: Project involves actors who would not normally involve in innovation as an economic activity, including
formal (e.g. NGOs, public sector organisations etc.) and informal organisations (e.g. grassroots movements, citizen groups,
etc.). This involvement might range from full partnership (i.e. project is conducted jointly) to consultation (i.e. there is
representation from different actors).
ii. Social Actor Interactions: Project creates collaborations between "social actors", small and large businesses and public
sector in different combinations. These collaborations usually involve (predominantly new types of) social interactions
towards achieving common goals such as user/community participation. Often, projects aim at significantly different action
and diffusion processes that will result in social progress. Often social innovation projects rely on trust relationships rather
than solely mutual-benefit.
3. Outputs and
Outcomes
Project primarily or exclusively creates socially oriented outputs/outcomes. Often these outputs go beyond those created by conventional
innovative activity (e.g. products, services, new technologies, patents, and publications), while conventional outputs/outcomes might also
be present. These outputs/outcomes are often intangible and they might include the following but not limited to:
 change in the attitudes, behaviours and perceptions of the actors involved and/or beneficiaries
 social technologies ( i.e. new configurations of social practices, including new routines, ways of doing things, laws, rules or
norms)
 long-term institutional/cultural change
4. Innovativeness
There should be a form of “implementation of a new or significantly improved product (good or service), or process, a new marketing
method, or a new organisational method”.
The project needs to include some form of innovative activities (i.e. scientific, technological, organisational, financial, and commercial
steps intending to lead to the implementation of the innovation in question). Innovation can be technological (involving the use of or
creating technologies) as well as non-technological.
The innovation should be at least “new” to the beneficiaries it targets (it does not have to be new to the world).
Architecture of ESID
ESID Workflow
Main methodology parts
Data collection/Crawling
Project classification
Information extraction
Data collection
• Identify a seed group of projects
– Data sources containing social innovation projects/actors
• Obtain additional unstructured and structured
information
• Input for later processing
• Data enrichment
Infrastructure
Data collection
• 46 data sources
– Downloadable
– Web crawling
• Specific crawler for
each data source
– 3451 projects
– 6092 actors
• Web crawling of websites reported in data sources
– Initially crawling whole domain – 1.715M documents,
194.3GB data
– Crawling only page when pointed, domain otherwise
Data collection
• 3451 projects
– 2043 have active websites
– 448 inactive websites (404 or server timeout)
• 314 available through WebArchive
– 1072 missing websites
– 1891 have some description from source database
– 577 (30%) have description longer than 500 characters
– 148 from missing website list have decent description
0
20
40
60
80
100
120
Series1
Main methodology parts
Data collection/Crawling
Project classification
Information extraction
Classifying social innovation
• Defined 4 criteria for social innovation:
– Social objectives
– Actors and actor interactions
– Outputs
– Innovativeness
• No previously available annotated data
• Performed annotation workshops
Annotation process
• Two-three annotators
• Qualified annotators (PhD students, experts from ZSI)
• Extensive guidelines for the purpose
• Organised 2 workshops
– To insure common understanding
– Workshop 1 – 6 people
– Workshop 2 – 4 people mainly for IAA (40 docs each)
Annotation results
• Inter-annotator agreement - a measure of how well two (or more)
annotators can make the same annotation decision for a certain
category.
• Low agreement between human annotators
• Annotations made binary (from initial 3 scale rating)
• Sentence and paragraph level annotations agreement below 40% in
most cases below 20%
Inclusion criteria/level of
annotations
Paragraph level annotations Document level annotations
Objectives 37.5% 76.60%
Actors and Actor interactions 17.2% 65.70%
Outputs 18.9% 66.73%
Innovativeness 19.5% 70.80%
Annotation data
• 277 documents
– Objectives - 166 Negative, 111 Positive instances
– Actors - 189 negative, 88 Positive instances
– Outputs - 190 negatives, 87 positive
– Innovativeness - 190 negative, 87 positive
Annotation results
• Annotated data from October/November 2017
• Almost all database contain some degree of false positives according
to annotators
Database Number of projects Percentage of false
positive projects
European Social Innovation Competition 90 12.9% (8/62)
MoPAct 140 7.3% (3/41)
Innovage 153 30% (6/20)
Digital Social Innovation (as of November 2017) 2,200 58% (105/188)
European Investment bank social innovation
tournament
72 6.8% (6/87)
SIMRA 9 0% (0/2)
Annotation challenges
• Disagreement between expert annotators
– Vague concepts
• Human bias towards certain domains
– Technological innovation vs non-technological innovation
– Some terms and buzz words
• Small dataset
• Imbalanced dataset
• Selection bias
– Datasets
– Domains
Machine learning classification
• Text pre-processing (tokenization, stop-words)
• Attempted several algorithms (Naïve Bayes, Decision trees, SVM,
Neural networks)
• Several ensemble methods (AdaBoost, Random Forests, Voting)
Classification workflow
First results
• On imbalanced dataset
• Naïve Bayes performed the best
• Range 65-71%
• Without balancing data
TP FP FN Precision Recall F1-score
Actors 62 39 26 0.614 0.705 0.656
Objective 81 36 30 0.692 0.730 0.711
Output 61 42 26 0.592 0.701 0.642
Innovativeness 58 42 29 0.580 0.667 0.620
Balancing data
• Data were oversampled to increase positive instances
• Evaluated using 10-fold cross validation
• Data are evaluated on limited number of domains in training set
• Needs expansion of training set in order to cover the whole
dataset
TP FP FN Precision Recall F1-score
Actors 118 25 15 0.825 0.887 0.855
Objective 121 25 14 0.829 0.896 0.861
Output 122 30 11 0.803 0.917 0.856
Innovativeness 117 29 16 0.801 0.880 0.839
Discussion
• Components are usually more tangible than the
abstract concept
• Definition becomes modular – the user can choose or
create their own definition of social innovation by
combining concept components
• In this case Naïve Bayes yielded the best results
because it generalizes well on small datasets. In
machine learning there is no silver bullet solution – no
free lunch theory.
Future plans
• Connections between projects and organisations
– Named entity recognition (Stanford NER)
– OrgReg, FirmReg, database of social innovators
• Location of projects and actors
– NER
– Creation of lexical rules
• Topics, Funding,...
• http://esid.eu
• http://knowmak.eu
Questions? Thank you!
nikola.milosevic@manchester.ac.uk
http://inspiratron.org
Twitter: @dreadknight011
Data collection
Challenges
• Data sources contain various
wealth of information
• Different formats of data sources
(csv, pdf, web, etc.)
• Changes of format/incorrect
inputs
• Projects with no websites
• Expired websites
• Limited descriptions/short text
• Websites hiding content behind
JS (e.g. wix.com)
• Relevance of text
• Number of pages/size of website
Proposed solutions
• Assume wealth of information
determine quality of project
• Create crawlers for different
sources and formats
• Rely on the reported data in the
initial data sources and websites
• Not good serious project, manual
• WebArchive.org
• Descriptions reflect the project
• Crawlers able to handle content
hidden behind JS
• Exclude irrelevant pages
• Tune for the content size
Classification measures used
• Precision (positive predictive value) – number
of true entities over number of entities
predicted as true
• Recall (sensitivity, true positive rate, hit rate)
– predicted true instances over the total
amount of true instances
• F1 – score – combines the two
Machine learning with n-grams
• Unigrams, bi-grams and trigrams with Naïve Bayes
• Improving stop-word list
• Innovativeness:
0.6
0.65
0.7
0.75
0.8
0.85
0.9
NB Unigram NB Bi-gram NB Trigram NB 1+2 NB 2+3 NB 1+2+3
Precision
Recall
F1-score
Ensemble methods
• AdaBoost with Naïve Bayes
– P: 0.78, R:0.79, F1:0.78
• Random forests
– P: 0.75, R:0.75, F1:0.75
• Voting Naïve Bayes, Multi-layered perceptron and
random forests
– P: 0.83, R:0.82, F1:0.82
• Coverage went a bit up to 47% on a sample data
Prikupljanje podataka iz postojećih baza
• Pojedine baze se mogu skinuti
• Nekim bazama se može pristupiti preko sajta
– Web crawling – Python+Scrapy
• Pojedine baze su u PDF formatu
– Studije slučajeva
– Manuelni unos
• Obogaćivanje podacima sa sajtova projekata
• Normalizacija
Klasifikacija društvenih inovacija
• Može se svesti na problem klasifikacije
• Supervizovano učenje
• Veliki broj definicija društvenih inovacija
• Pregledom literature utvrđene komponente definicije
Anotacija
• 6 studenata su anotirali 277 dokumenata
– Sadržaj sajtova projekta
– Anotacije na nivou rečenica i na nivou celog teksta
– Oko 30% dokumenata je anotirano od strane nekoliko
anotatora
– 4 kriterijuma
– Brat rapid annotation tool
Rezultati anotacije
Inclusion criteria/level of
annotations
Paragraph level annotations Document level annotations
Objectives 37.5% 76.6%
Actors and Actor interactions 17.2% 65.7%
Outputs 18.9% 66.73%
Innovativeness 19.5% 70.8%
Database Number of projects Percentage of false
positive projects
European Social Innovation Competition 90 12.9% (8/62)
MoPAct 140 7.3% (3/41)
Innovage 153 30% (6/20)
Digital Social Innovation 2,200 58% (105/188)
European Investment bank social innovation
tournament
72 6.8% (6/87)
SIMRA 9 0% (0/2)
Mašinsko učenje - pregled
• 4 klasifikatora, za svaki kriterijum
• Supervizovano na 277 dokumenata
– 500 – 20 000 reči po dokumentu
– Nestruktuirani, tekstualni dokumenti
• Eksperimenti:
– Naive Bayes
• Balansiranje
– Neuralne mreže
Naive Bayes
• Podaci:
– Objectives - 166 Negative, 111 Positive instances
– Actors - 189 negative, 88 Positive instances
– Outputs - 190 negatives, 87 positive
– Innovativeness - 190 negative, 87 positive
• Nebalansirano
• Stopwords
• Stemming
Rezultati – Naive Bayes
• Gledamo pozitivnu klasu – negativna je slična ili bolja
• Inicijalni rezultati u rasponu 62-71%
• Glavni krivac – nebalansiranost klasa
• Može li balansiranje da popravi?
TP FP FN Precision Recall F1-score
Social Innovation 44 51 25 0.463 0.638 0.537
Actors 62 39 26 0.614 0.705 0.656
Objective 81 36 30 0.692 0.730 0.711
Output 61 42 26 0.592 0.701 0.642
Innovativeness 58 42 29 0.580 0.667 0.620
Balansiranje
• Gledamo pozitivnu klasu
• Klase su balansirane da sadrže sličan broj primera
• Raspon rezltata 84-86% F1 score
TP FP FN Precision Recall F1-score
Social innovation 83 15 22 0.847 0.790 0.818
Actors 118 25 15 0.825 0.887 0.855
Objective 121 25 14 0.829 0.896 0.861
Output 122 30 11 0.803 0.917 0.856
Innovativeness 117 29 16 0.801 0.880 0.839
Neuralne mreže
• Duboko učenje je pokazalo dobre rezultate za rezličite
zadatke u poslednje vreme
• Uključujući klasifikaciju teksta
• Može se primeniti i na relativno male podatke
– Transfer learning
GloVe word embeddings
• Tehnika za transfer znanja i kreiranje vektora reči
• Neuralna mreža trenirana na Wikipediji i oko 10 miliona
novinskih članaka
• Modeluje vektore tako da reči koje su i ustom kontekstu
imaju slične vrednosti – Semantika
Arhitektura i rezultati Neuralne Mreže sa GloVe
• Kim Yoon (2014) – Konvolucijone mreže za klasifikaciju
teksta
• GloVe kao embedding layer
• Rezltati ne premašuju 70%
• Malo podataka – Duboko učenje zahteva velike podatke
Diskusija
• Relativno nedefinisan termin se može klasifikovati kad
se razloži definicija u komponente
• Definicija postaje modularna – korisnik može izabrati ili
kreirati defniciju društvenih inovacija sklapajući
komponente
• Duboko učenje nije rešenje za svaki problem
– Pogotovu kad nema dovoljno podataka
Budući planovi
• Veze organizacija i projekata
– Named entity recognition (Stanford NER)
– OrgReg, FirmReg, baza društvenih inovatora
• Lokacija projekata i organizacija
– NER
– Kreiranje leksičkih pravila
• Oblasti, Finansije,...
• http://esid.eu
• http://knowmak.eu
Classification of projects
• 2 approaches were designed
– Machine learning
– Rule based
• Machine learning learns on annotated data
• Heuristic rules about appearance of the words in the
description
• Not everything in the source databases is really SI
• To discover more projects
Project Classifier
Satisfies
criteria
Does not
satisfy criteria
Rule-based classification
• Introducing rules
• Rules
– Appearance of word “innovation” or “innovativeness” or
“novelty”
– Appearance of words (Technology OR product OR process
OR service OR way OR practice OR etc.) NEARBY (new
OR novel or improved OR better OR alternative OR etc.)
– Coverage is 77%
– Precision 0.55, Recall 0.715, F1 0.62
(measured on training data)

More Related Content

PDF
Key EURAXESS online platform functionalities and selected Extranet tools
PDF
SCIP workshop by Comintelli - Creating & Using Topic Maps to Visualize Your B...
PPTX
Deloitte johan ten houten
PPTX
Leading expert organizations materials handout_day2_open
PDF
Communicating About Standards: Creating A Technical Infrastructure That Every...
PPTX
Social CI: A Work method and a tool for Competitive Intelligence Networking
PDF
5. open innov ict-platf
PPT
Entrepreneurship we can review the entrepreneurship
Key EURAXESS online platform functionalities and selected Extranet tools
SCIP workshop by Comintelli - Creating & Using Topic Maps to Visualize Your B...
Deloitte johan ten houten
Leading expert organizations materials handout_day2_open
Communicating About Standards: Creating A Technical Infrastructure That Every...
Social CI: A Work method and a tool for Competitive Intelligence Networking
5. open innov ict-platf
Entrepreneurship we can review the entrepreneurship

Similar to Classifying intangible social innovation concepts using machine learning and natural language processing (20)

PDF
How to measure the impact of Research ?
PPTX
Using experiments in innovation policy (short)
PDF
Open Innovation success factors
PDF
Esteve almirall esade business school innovation policy -
PPT
IT Innovation
PPTX
SM&WA_S1-2.pptx
PPTX
A scientific framework to measure results of research investments
PPTX
Product Design & Engineering Course Slides
PPTX
PPTX
A cost structure study for French HSS journals
PPTX
Module 4.2 - Performance management
PDF
MyScienceWork & NFAIS - Webinar 07 11 2017
PPTX
module 1 activity 4
PPTX
How to move Beyond-GDP? An action plan
PPT
Communities of Practice
PPTX
Large language models in higher education
PPT
DRDC Knowledge Agenda
PPTX
Turning FAIR into Reality: Briefing on the EC’s report on FAIR data
PDF
Frank Kresin - Waag Society
PPTX
NISO-Altmetrics-NE-ACRL-ScholComIG-Nov2013
How to measure the impact of Research ?
Using experiments in innovation policy (short)
Open Innovation success factors
Esteve almirall esade business school innovation policy -
IT Innovation
SM&WA_S1-2.pptx
A scientific framework to measure results of research investments
Product Design & Engineering Course Slides
A cost structure study for French HSS journals
Module 4.2 - Performance management
MyScienceWork & NFAIS - Webinar 07 11 2017
module 1 activity 4
How to move Beyond-GDP? An action plan
Communities of Practice
Large language models in higher education
DRDC Knowledge Agenda
Turning FAIR into Reality: Briefing on the EC’s report on FAIR data
Frank Kresin - Waag Society
NISO-Altmetrics-NE-ACRL-ScholComIG-Nov2013
Ad

More from Nikola Milosevic (20)

PPTX
Machine learning (ML) and natural language processing (NLP)
PPTX
Veštačka inteligencija
PPTX
AI an the future of society
PPTX
Machine learning prediction of stock markets
PPTX
Equity forecast: Predicting long term stock market prices using machine learning
PPTX
BelBi2016 presentation: Hybrid methodology for information extraction from ta...
PPTX
Extracting patient data from tables in clinical literature
PPTX
Supporting clinical trial data curation and integration with table mining
PPTX
Mobile security, OWASP Mobile Top 10, OWASP Seraphimdroid
PPTX
PPTX
Table mining and data curation from biomedical literature
PDF
PDF
Sentiment analysis for Serbian language
PDF
Http and security
PDF
Android business models
ODP
Android(1)
PPT
Sigurnosne prijetnje i mjere zaštite IT infrastrukture
PPTX
Mašinska analiza sentimenta rečenica na srpskom jeziku
PPT
PDF
Software Freedom day Serbia - Owasp - informaciona bezbednost u Srbiji open s...
Machine learning (ML) and natural language processing (NLP)
Veštačka inteligencija
AI an the future of society
Machine learning prediction of stock markets
Equity forecast: Predicting long term stock market prices using machine learning
BelBi2016 presentation: Hybrid methodology for information extraction from ta...
Extracting patient data from tables in clinical literature
Supporting clinical trial data curation and integration with table mining
Mobile security, OWASP Mobile Top 10, OWASP Seraphimdroid
Table mining and data curation from biomedical literature
Sentiment analysis for Serbian language
Http and security
Android business models
Android(1)
Sigurnosne prijetnje i mjere zaštite IT infrastrukture
Mašinska analiza sentimenta rečenica na srpskom jeziku
Software Freedom day Serbia - Owasp - informaciona bezbednost u Srbiji open s...
Ad

Recently uploaded (20)

PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
Placing the Near-Earth Object Impact Probability in Context
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
An interstellar mission to test astrophysical black holes
PPTX
famous lake in india and its disturibution and importance
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
2Systematics of Living Organisms t-.pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
INTRODUCTION TO EVS | Concept of sustainability
AlphaEarth Foundations and the Satellite Embedding dataset
HPLC-PPT.docx high performance liquid chromatography
Cell Membrane: Structure, Composition & Functions
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Placing the Near-Earth Object Impact Probability in Context
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
POSITIONING IN OPERATION THEATRE ROOM.ppt
. Radiology Case Scenariosssssssssssssss
neck nodes and dissection types and lymph nodes levels
An interstellar mission to test astrophysical black holes
famous lake in india and its disturibution and importance
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Comparative Structure of Integument in Vertebrates.pptx
2Systematics of Living Organisms t-.pptx
Biophysics 2.pdffffffffffffffffffffffffff
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf

Classifying intangible social innovation concepts using machine learning and natural language processing

  • 1. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 726992. NLDB 2018 Classification of intangible social innovation concepts Nikola Milošević, Abdullah Gok, Goran Nenadić School of Computer Science, The University of Manchester Manchester Institute for Innovation Research, Alliance Manchester Business School The University of Manchester nikola.milosevic@manchester.ac.uk http://inspiratron.org Twitter: @dreadknight011
  • 2. Context - KNOWMAK • EU project KNOWMAK – Knowledge in the Making in the European Society • Traditional sources of knowledge (publications, patents) • Untraditional sources of knowledge creation – Social innovations • Actors and outputs (publications, patents, projects)
  • 3. European Social Innovation Database • Purpose – Input into KNOWMAK database, as a part addressing untraditional knowledge creation – Unique collection of social innovation that can be independently used by researcher and policy makers • Aim: – Create database using NLP, ML and web crawling – Database should be comprehensive and systematic
  • 4. Social Innovations • Development of new ideas, services and products that aim to solve certain social problems • There are many definitions • Overlaps with: – Open innovation – Free innovation – Grassroot innovation – User innovation – Conventional innovation
  • 5. Examples of Social Innovations
  • 6. Why to create database of social innovation? • Social sciences • Creation of rules and policies – Social policy experimentation (EU) • Evaluation of project funding – Impact – Scaling • Societal trends
  • 7. Current databases • There are some social innovation databases, but they are: – Topic specific – Small (case studies) – Limited information – One source of information • Expert curated (expensive) • User curated (unreliable) – They use different definitions – Contain large number of false positives
  • 8. Other challenges • Blurry definitions (disagreement) • Huge number of areas • Unstructured data (text) • Unstandardized vocabulary • Unstandardized sources • Large number of variables • Lack of labelled data
  • 9. Disentangling Social Innovation Criteria Criteria 1. Objectives Project primarily or exclusively satisfies (often unmet) societal needs, including the needs of particular social groups; or aims at social value creation. Often no price involved for the main social beneficiary or the innovation is provided to the main beneficiary at cost only. However, there might be examples that price is involved. 2. Actors and Actor Interactions  Satisfy one or both of the following: i. Diversity of Actors: Project involves actors who would not normally involve in innovation as an economic activity, including formal (e.g. NGOs, public sector organisations etc.) and informal organisations (e.g. grassroots movements, citizen groups, etc.). This involvement might range from full partnership (i.e. project is conducted jointly) to consultation (i.e. there is representation from different actors). ii. Social Actor Interactions: Project creates collaborations between "social actors", small and large businesses and public sector in different combinations. These collaborations usually involve (predominantly new types of) social interactions towards achieving common goals such as user/community participation. Often, projects aim at significantly different action and diffusion processes that will result in social progress. Often social innovation projects rely on trust relationships rather than solely mutual-benefit. 3. Outputs and Outcomes Project primarily or exclusively creates socially oriented outputs/outcomes. Often these outputs go beyond those created by conventional innovative activity (e.g. products, services, new technologies, patents, and publications), while conventional outputs/outcomes might also be present. These outputs/outcomes are often intangible and they might include the following but not limited to:  change in the attitudes, behaviours and perceptions of the actors involved and/or beneficiaries  social technologies ( i.e. new configurations of social practices, including new routines, ways of doing things, laws, rules or norms)  long-term institutional/cultural change 4. Innovativeness There should be a form of “implementation of a new or significantly improved product (good or service), or process, a new marketing method, or a new organisational method”. The project needs to include some form of innovative activities (i.e. scientific, technological, organisational, financial, and commercial steps intending to lead to the implementation of the innovation in question). Innovation can be technological (involving the use of or creating technologies) as well as non-technological. The innovation should be at least “new” to the beneficiaries it targets (it does not have to be new to the world).
  • 12. Main methodology parts Data collection/Crawling Project classification Information extraction
  • 13. Data collection • Identify a seed group of projects – Data sources containing social innovation projects/actors • Obtain additional unstructured and structured information • Input for later processing • Data enrichment
  • 15. Data collection • 46 data sources – Downloadable – Web crawling • Specific crawler for each data source – 3451 projects – 6092 actors • Web crawling of websites reported in data sources – Initially crawling whole domain – 1.715M documents, 194.3GB data – Crawling only page when pointed, domain otherwise
  • 16. Data collection • 3451 projects – 2043 have active websites – 448 inactive websites (404 or server timeout) • 314 available through WebArchive – 1072 missing websites – 1891 have some description from source database – 577 (30%) have description longer than 500 characters – 148 from missing website list have decent description 0 20 40 60 80 100 120 Series1
  • 17. Main methodology parts Data collection/Crawling Project classification Information extraction
  • 18. Classifying social innovation • Defined 4 criteria for social innovation: – Social objectives – Actors and actor interactions – Outputs – Innovativeness • No previously available annotated data • Performed annotation workshops
  • 19. Annotation process • Two-three annotators • Qualified annotators (PhD students, experts from ZSI) • Extensive guidelines for the purpose • Organised 2 workshops – To insure common understanding – Workshop 1 – 6 people – Workshop 2 – 4 people mainly for IAA (40 docs each)
  • 20. Annotation results • Inter-annotator agreement - a measure of how well two (or more) annotators can make the same annotation decision for a certain category. • Low agreement between human annotators • Annotations made binary (from initial 3 scale rating) • Sentence and paragraph level annotations agreement below 40% in most cases below 20% Inclusion criteria/level of annotations Paragraph level annotations Document level annotations Objectives 37.5% 76.60% Actors and Actor interactions 17.2% 65.70% Outputs 18.9% 66.73% Innovativeness 19.5% 70.80%
  • 21. Annotation data • 277 documents – Objectives - 166 Negative, 111 Positive instances – Actors - 189 negative, 88 Positive instances – Outputs - 190 negatives, 87 positive – Innovativeness - 190 negative, 87 positive
  • 22. Annotation results • Annotated data from October/November 2017 • Almost all database contain some degree of false positives according to annotators Database Number of projects Percentage of false positive projects European Social Innovation Competition 90 12.9% (8/62) MoPAct 140 7.3% (3/41) Innovage 153 30% (6/20) Digital Social Innovation (as of November 2017) 2,200 58% (105/188) European Investment bank social innovation tournament 72 6.8% (6/87) SIMRA 9 0% (0/2)
  • 23. Annotation challenges • Disagreement between expert annotators – Vague concepts • Human bias towards certain domains – Technological innovation vs non-technological innovation – Some terms and buzz words • Small dataset • Imbalanced dataset • Selection bias – Datasets – Domains
  • 24. Machine learning classification • Text pre-processing (tokenization, stop-words) • Attempted several algorithms (Naïve Bayes, Decision trees, SVM, Neural networks) • Several ensemble methods (AdaBoost, Random Forests, Voting)
  • 26. First results • On imbalanced dataset • Naïve Bayes performed the best • Range 65-71% • Without balancing data TP FP FN Precision Recall F1-score Actors 62 39 26 0.614 0.705 0.656 Objective 81 36 30 0.692 0.730 0.711 Output 61 42 26 0.592 0.701 0.642 Innovativeness 58 42 29 0.580 0.667 0.620
  • 27. Balancing data • Data were oversampled to increase positive instances • Evaluated using 10-fold cross validation • Data are evaluated on limited number of domains in training set • Needs expansion of training set in order to cover the whole dataset TP FP FN Precision Recall F1-score Actors 118 25 15 0.825 0.887 0.855 Objective 121 25 14 0.829 0.896 0.861 Output 122 30 11 0.803 0.917 0.856 Innovativeness 117 29 16 0.801 0.880 0.839
  • 28. Discussion • Components are usually more tangible than the abstract concept • Definition becomes modular – the user can choose or create their own definition of social innovation by combining concept components • In this case Naïve Bayes yielded the best results because it generalizes well on small datasets. In machine learning there is no silver bullet solution – no free lunch theory.
  • 29. Future plans • Connections between projects and organisations – Named entity recognition (Stanford NER) – OrgReg, FirmReg, database of social innovators • Location of projects and actors – NER – Creation of lexical rules • Topics, Funding,... • http://esid.eu • http://knowmak.eu
  • 31. Data collection Challenges • Data sources contain various wealth of information • Different formats of data sources (csv, pdf, web, etc.) • Changes of format/incorrect inputs • Projects with no websites • Expired websites • Limited descriptions/short text • Websites hiding content behind JS (e.g. wix.com) • Relevance of text • Number of pages/size of website Proposed solutions • Assume wealth of information determine quality of project • Create crawlers for different sources and formats • Rely on the reported data in the initial data sources and websites • Not good serious project, manual • WebArchive.org • Descriptions reflect the project • Crawlers able to handle content hidden behind JS • Exclude irrelevant pages • Tune for the content size
  • 32. Classification measures used • Precision (positive predictive value) – number of true entities over number of entities predicted as true • Recall (sensitivity, true positive rate, hit rate) – predicted true instances over the total amount of true instances • F1 – score – combines the two
  • 33. Machine learning with n-grams • Unigrams, bi-grams and trigrams with Naïve Bayes • Improving stop-word list • Innovativeness: 0.6 0.65 0.7 0.75 0.8 0.85 0.9 NB Unigram NB Bi-gram NB Trigram NB 1+2 NB 2+3 NB 1+2+3 Precision Recall F1-score
  • 34. Ensemble methods • AdaBoost with Naïve Bayes – P: 0.78, R:0.79, F1:0.78 • Random forests – P: 0.75, R:0.75, F1:0.75 • Voting Naïve Bayes, Multi-layered perceptron and random forests – P: 0.83, R:0.82, F1:0.82 • Coverage went a bit up to 47% on a sample data
  • 35. Prikupljanje podataka iz postojećih baza • Pojedine baze se mogu skinuti • Nekim bazama se može pristupiti preko sajta – Web crawling – Python+Scrapy • Pojedine baze su u PDF formatu – Studije slučajeva – Manuelni unos • Obogaćivanje podacima sa sajtova projekata • Normalizacija
  • 36. Klasifikacija društvenih inovacija • Može se svesti na problem klasifikacije • Supervizovano učenje • Veliki broj definicija društvenih inovacija • Pregledom literature utvrđene komponente definicije
  • 37. Anotacija • 6 studenata su anotirali 277 dokumenata – Sadržaj sajtova projekta – Anotacije na nivou rečenica i na nivou celog teksta – Oko 30% dokumenata je anotirano od strane nekoliko anotatora – 4 kriterijuma – Brat rapid annotation tool
  • 38. Rezultati anotacije Inclusion criteria/level of annotations Paragraph level annotations Document level annotations Objectives 37.5% 76.6% Actors and Actor interactions 17.2% 65.7% Outputs 18.9% 66.73% Innovativeness 19.5% 70.8% Database Number of projects Percentage of false positive projects European Social Innovation Competition 90 12.9% (8/62) MoPAct 140 7.3% (3/41) Innovage 153 30% (6/20) Digital Social Innovation 2,200 58% (105/188) European Investment bank social innovation tournament 72 6.8% (6/87) SIMRA 9 0% (0/2)
  • 39. Mašinsko učenje - pregled • 4 klasifikatora, za svaki kriterijum • Supervizovano na 277 dokumenata – 500 – 20 000 reči po dokumentu – Nestruktuirani, tekstualni dokumenti • Eksperimenti: – Naive Bayes • Balansiranje – Neuralne mreže
  • 40. Naive Bayes • Podaci: – Objectives - 166 Negative, 111 Positive instances – Actors - 189 negative, 88 Positive instances – Outputs - 190 negatives, 87 positive – Innovativeness - 190 negative, 87 positive • Nebalansirano • Stopwords • Stemming
  • 41. Rezultati – Naive Bayes • Gledamo pozitivnu klasu – negativna je slična ili bolja • Inicijalni rezultati u rasponu 62-71% • Glavni krivac – nebalansiranost klasa • Može li balansiranje da popravi? TP FP FN Precision Recall F1-score Social Innovation 44 51 25 0.463 0.638 0.537 Actors 62 39 26 0.614 0.705 0.656 Objective 81 36 30 0.692 0.730 0.711 Output 61 42 26 0.592 0.701 0.642 Innovativeness 58 42 29 0.580 0.667 0.620
  • 42. Balansiranje • Gledamo pozitivnu klasu • Klase su balansirane da sadrže sličan broj primera • Raspon rezltata 84-86% F1 score TP FP FN Precision Recall F1-score Social innovation 83 15 22 0.847 0.790 0.818 Actors 118 25 15 0.825 0.887 0.855 Objective 121 25 14 0.829 0.896 0.861 Output 122 30 11 0.803 0.917 0.856 Innovativeness 117 29 16 0.801 0.880 0.839
  • 43. Neuralne mreže • Duboko učenje je pokazalo dobre rezultate za rezličite zadatke u poslednje vreme • Uključujući klasifikaciju teksta • Može se primeniti i na relativno male podatke – Transfer learning
  • 44. GloVe word embeddings • Tehnika za transfer znanja i kreiranje vektora reči • Neuralna mreža trenirana na Wikipediji i oko 10 miliona novinskih članaka • Modeluje vektore tako da reči koje su i ustom kontekstu imaju slične vrednosti – Semantika
  • 45. Arhitektura i rezultati Neuralne Mreže sa GloVe • Kim Yoon (2014) – Konvolucijone mreže za klasifikaciju teksta • GloVe kao embedding layer • Rezltati ne premašuju 70% • Malo podataka – Duboko učenje zahteva velike podatke
  • 46. Diskusija • Relativno nedefinisan termin se može klasifikovati kad se razloži definicija u komponente • Definicija postaje modularna – korisnik može izabrati ili kreirati defniciju društvenih inovacija sklapajući komponente • Duboko učenje nije rešenje za svaki problem – Pogotovu kad nema dovoljno podataka
  • 47. Budući planovi • Veze organizacija i projekata – Named entity recognition (Stanford NER) – OrgReg, FirmReg, baza društvenih inovatora • Lokacija projekata i organizacija – NER – Kreiranje leksičkih pravila • Oblasti, Finansije,... • http://esid.eu • http://knowmak.eu
  • 48. Classification of projects • 2 approaches were designed – Machine learning – Rule based • Machine learning learns on annotated data • Heuristic rules about appearance of the words in the description • Not everything in the source databases is really SI • To discover more projects Project Classifier Satisfies criteria Does not satisfy criteria
  • 49. Rule-based classification • Introducing rules • Rules – Appearance of word “innovation” or “innovativeness” or “novelty” – Appearance of words (Technology OR product OR process OR service OR way OR practice OR etc.) NEARBY (new OR novel or improved OR better OR alternative OR etc.) – Coverage is 77% – Precision 0.55, Recall 0.715, F1 0.62 (measured on training data)