Classifying intangible social innovation concepts using machine learning and natural language processing

This project has received funding from the
European Union’s Horizon 2020 research
and innovation programme under grant
agreement No. 726992.
NLDB 2018
Classification of intangible social innovation
concepts
Nikola Milošević, Abdullah Gok, Goran Nenadić
School of Computer Science, The University of Manchester
Manchester Institute for Innovation Research, Alliance Manchester Business School
The University of Manchester
nikola.milosevic@manchester.ac.uk
http://inspiratron.org
Twitter: @dreadknight011

Context - KNOWMAK
• EU project KNOWMAK – Knowledge in the Making in
the European Society
• Traditional sources of knowledge (publications, patents)
• Untraditional sources of
knowledge creation –
Social innovations
• Actors and outputs
(publications, patents, projects)

European Social Innovation
Database
• Purpose
– Input into KNOWMAK database, as a part
addressing untraditional knowledge creation
– Unique collection of social innovation that can
be independently used by researcher and
policy makers
• Aim:
– Create database using NLP, ML and web
crawling
– Database should be comprehensive and
systematic

Social Innovations
• Development of new ideas, services and products that
aim to solve certain social problems
• There are many definitions
• Overlaps with:
– Open innovation
– Free innovation
– Grassroot innovation
– User innovation
– Conventional innovation

Examples of Social Innovations

Why to create database of social
innovation?
• Social sciences
• Creation of rules and policies
– Social policy experimentation (EU)
• Evaluation of project funding
– Impact
– Scaling
• Societal trends

Current databases
• There are some social innovation databases, but they are:
– Topic specific
– Small (case studies)
– Limited information
– One source of information
• Expert curated (expensive)
• User curated (unreliable)
– They use different definitions
– Contain large number of false positives

Other challenges
• Blurry definitions (disagreement)
• Huge number of areas
• Unstructured data (text)
• Unstandardized vocabulary
• Unstandardized sources
• Large number of variables
• Lack of labelled data

Disentangling Social Innovation Criteria
Criteria
1. Objectives
Project primarily or exclusively satisfies (often unmet) societal needs, including the needs of particular social groups; or aims at social
value creation.
Often no price involved for the main social beneficiary or the innovation is provided to the main beneficiary at cost only. However, there
might be examples that price is involved.
2. Actors and
Actor Interactions
 Satisfy one or both of the following:
i. Diversity of Actors: Project involves actors who would not normally involve in innovation as an economic activity, including
formal (e.g. NGOs, public sector organisations etc.) and informal organisations (e.g. grassroots movements, citizen groups,
etc.). This involvement might range from full partnership (i.e. project is conducted jointly) to consultation (i.e. there is
representation from different actors).
ii. Social Actor Interactions: Project creates collaborations between "social actors", small and large businesses and public
sector in different combinations. These collaborations usually involve (predominantly new types of) social interactions
towards achieving common goals such as user/community participation. Often, projects aim at significantly different action
and diffusion processes that will result in social progress. Often social innovation projects rely on trust relationships rather
than solely mutual-benefit.
3. Outputs and
Outcomes
Project primarily or exclusively creates socially oriented outputs/outcomes. Often these outputs go beyond those created by conventional
innovative activity (e.g. products, services, new technologies, patents, and publications), while conventional outputs/outcomes might also
be present. These outputs/outcomes are often intangible and they might include the following but not limited to:
 change in the attitudes, behaviours and perceptions of the actors involved and/or beneficiaries
 social technologies ( i.e. new configurations of social practices, including new routines, ways of doing things, laws, rules or
norms)
 long-term institutional/cultural change
4. Innovativeness
There should be a form of “implementation of a new or significantly improved product (good or service), or process, a new marketing
method, or a new organisational method”.
The project needs to include some form of innovative activities (i.e. scientific, technological, organisational, financial, and commercial
steps intending to lead to the implementation of the innovation in question). Innovation can be technological (involving the use of or
creating technologies) as well as non-technological.
The innovation should be at least “new” to the beneficiaries it targets (it does not have to be new to the world).

Main methodology parts
Data collection/Crawling
Project classification
Information extraction

Data collection
• Identify a seed group of projects
– Data sources containing social innovation projects/actors
• Obtain additional unstructured and structured
information
• Input for later processing
• Data enrichment

Data collection
• 46 data sources
– Downloadable
– Web crawling
• Specific crawler for
each data source
– 3451 projects
– 6092 actors
• Web crawling of websites reported in data sources
– Initially crawling whole domain – 1.715M documents,
194.3GB data
– Crawling only page when pointed, domain otherwise

Data collection
• 3451 projects
– 2043 have active websites
– 448 inactive websites (404 or server timeout)
• 314 available through WebArchive
– 1072 missing websites
– 1891 have some description from source database
– 577 (30%) have description longer than 500 characters
– 148 from missing website list have decent description
0
20
40
60
80
100
120
Series1

Classifying social innovation
• Defined 4 criteria for social innovation:
– Social objectives
– Actors and actor interactions
– Outputs
– Innovativeness
• No previously available annotated data
• Performed annotation workshops

Annotation process
• Two-three annotators
• Qualified annotators (PhD students, experts from ZSI)
• Extensive guidelines for the purpose
• Organised 2 workshops
– To insure common understanding
– Workshop 1 – 6 people
– Workshop 2 – 4 people mainly for IAA (40 docs each)

Annotation results
• Inter-annotator agreement - a measure of how well two (or more)
annotators can make the same annotation decision for a certain
category.
• Low agreement between human annotators
• Annotations made binary (from initial 3 scale rating)
• Sentence and paragraph level annotations agreement below 40% in
most cases below 20%
Inclusion criteria/level of
annotations
Paragraph level annotations Document level annotations
Objectives 37.5% 76.60%
Actors and Actor interactions 17.2% 65.70%
Outputs 18.9% 66.73%
Innovativeness 19.5% 70.80%

Annotation data
• 277 documents
– Objectives - 166 Negative, 111 Positive instances
– Actors - 189 negative, 88 Positive instances
– Outputs - 190 negatives, 87 positive
– Innovativeness - 190 negative, 87 positive

Annotation results
• Annotated data from October/November 2017
• Almost all database contain some degree of false positives according
to annotators
Database Number of projects Percentage of false
positive projects
European Social Innovation Competition 90 12.9% (8/62)
MoPAct 140 7.3% (3/41)
Innovage 153 30% (6/20)
Digital Social Innovation (as of November 2017) 2,200 58% (105/188)
European Investment bank social innovation
tournament
72 6.8% (6/87)
SIMRA 9 0% (0/2)

Annotation challenges
• Disagreement between expert annotators
– Vague concepts
• Human bias towards certain domains
– Technological innovation vs non-technological innovation
– Some terms and buzz words
• Small dataset
• Imbalanced dataset
• Selection bias
– Datasets
– Domains

Machine learning classification
• Text pre-processing (tokenization, stop-words)
• Attempted several algorithms (Naïve Bayes, Decision trees, SVM,
Neural networks)
• Several ensemble methods (AdaBoost, Random Forests, Voting)

First results
• On imbalanced dataset
• Naïve Bayes performed the best
• Range 65-71%
• Without balancing data
TP FP FN Precision Recall F1-score
Actors 62 39 26 0.614 0.705 0.656
Objective 81 36 30 0.692 0.730 0.711
Output 61 42 26 0.592 0.701 0.642
Innovativeness 58 42 29 0.580 0.667 0.620

Balancing data
• Data were oversampled to increase positive instances
• Evaluated using 10-fold cross validation
• Data are evaluated on limited number of domains in training set
• Needs expansion of training set in order to cover the whole
dataset
Actors 118 25 15 0.825 0.887 0.855
Objective 121 25 14 0.829 0.896 0.861
Output 122 30 11 0.803 0.917 0.856
Innovativeness 117 29 16 0.801 0.880 0.839

Discussion
• Components are usually more tangible than the
abstract concept
• Definition becomes modular – the user can choose or
create their own definition of social innovation by
combining concept components
• In this case Naïve Bayes yielded the best results
because it generalizes well on small datasets. In
machine learning there is no silver bullet solution – no
free lunch theory.

Future plans
• Connections between projects and organisations
– Named entity recognition (Stanford NER)
– OrgReg, FirmReg, database of social innovators
• Location of projects and actors
– NER
– Creation of lexical rules
• Topics, Funding,...
• http://esid.eu
• http://knowmak.eu

Questions? Thank you!
nikola.milosevic@manchester.ac.uk
http://inspiratron.org
Twitter: @dreadknight011

Data collection
Challenges
• Data sources contain various
wealth of information
• Different formats of data sources
(csv, pdf, web, etc.)
• Changes of format/incorrect
inputs
• Projects with no websites
• Expired websites
• Limited descriptions/short text
• Websites hiding content behind
JS (e.g. wix.com)
• Relevance of text
• Number of pages/size of website
Proposed solutions
• Assume wealth of information
determine quality of project
• Create crawlers for different
sources and formats
• Rely on the reported data in the
initial data sources and websites
• Not good serious project, manual
• WebArchive.org
• Descriptions reflect the project
• Crawlers able to handle content
hidden behind JS
• Exclude irrelevant pages
• Tune for the content size

Classification measures used
• Precision (positive predictive value) – number
of true entities over number of entities
predicted as true
• Recall (sensitivity, true positive rate, hit rate)
– predicted true instances over the total
amount of true instances
• F1 – score – combines the two

Machine learning with n-grams
• Unigrams, bi-grams and trigrams with Naïve Bayes
• Improving stop-word list
• Innovativeness:
0.6
0.65
0.7
0.75
0.8
0.85
0.9
NB Unigram NB Bi-gram NB Trigram NB 1+2 NB 2+3 NB 1+2+3
Precision
Recall
F1-score

Ensemble methods
• AdaBoost with Naïve Bayes
– P: 0.78, R:0.79, F1:0.78
• Random forests
– P: 0.75, R:0.75, F1:0.75
• Voting Naïve Bayes, Multi-layered perceptron and
random forests
– P: 0.83, R:0.82, F1:0.82
• Coverage went a bit up to 47% on a sample data

Prikupljanje podataka iz postojećih baza
• Pojedine baze se mogu skinuti
• Nekim bazama se može pristupiti preko sajta
– Web crawling – Python+Scrapy
• Pojedine baze su u PDF formatu
– Studije slučajeva
– Manuelni unos
• Obogaćivanje podacima sa sajtova projekata
• Normalizacija

Klasifikacija društvenih inovacija
• Može se svesti na problem klasifikacije
• Supervizovano učenje
• Veliki broj definicija društvenih inovacija
• Pregledom literature utvrđene komponente definicije

Anotacija
• 6 studenata su anotirali 277 dokumenata
– Sadržaj sajtova projekta
– Anotacije na nivou rečenica i na nivou celog teksta
– Oko 30% dokumenata je anotirano od strane nekoliko
anotatora
– 4 kriterijuma
– Brat rapid annotation tool

Rezultati anotacije
Inclusion criteria/level of
annotations
Paragraph level annotations Document level annotations
Objectives 37.5% 76.6%
Actors and Actor interactions 17.2% 65.7%
Outputs 18.9% 66.73%
Innovativeness 19.5% 70.8%
Database Number of projects Percentage of false
positive projects
European Social Innovation Competition 90 12.9% (8/62)
MoPAct 140 7.3% (3/41)
Innovage 153 30% (6/20)
Digital Social Innovation 2,200 58% (105/188)
European Investment bank social innovation
tournament
72 6.8% (6/87)
SIMRA 9 0% (0/2)

Mašinsko učenje - pregled
• 4 klasifikatora, za svaki kriterijum
• Supervizovano na 277 dokumenata
– 500 – 20 000 reči po dokumentu
– Nestruktuirani, tekstualni dokumenti
• Eksperimenti:
– Naive Bayes
• Balansiranje
– Neuralne mreže

Naive Bayes
• Podaci:
– Objectives - 166 Negative, 111 Positive instances
– Actors - 189 negative, 88 Positive instances
– Outputs - 190 negatives, 87 positive
– Innovativeness - 190 negative, 87 positive
• Nebalansirano
• Stopwords
• Stemming

Rezultati – Naive Bayes
• Gledamo pozitivnu klasu – negativna je slična ili bolja
• Inicijalni rezultati u rasponu 62-71%
• Glavni krivac – nebalansiranost klasa
• Može li balansiranje da popravi?
Social Innovation 44 51 25 0.463 0.638 0.537
Actors 62 39 26 0.614 0.705 0.656
Objective 81 36 30 0.692 0.730 0.711
Output 61 42 26 0.592 0.701 0.642
Innovativeness 58 42 29 0.580 0.667 0.620

Balansiranje
• Gledamo pozitivnu klasu
• Klase su balansirane da sadrže sličan broj primera
• Raspon rezltata 84-86% F1 score
Social innovation 83 15 22 0.847 0.790 0.818
Actors 118 25 15 0.825 0.887 0.855
Objective 121 25 14 0.829 0.896 0.861
Output 122 30 11 0.803 0.917 0.856
Innovativeness 117 29 16 0.801 0.880 0.839

Neuralne mreže
• Duboko učenje je pokazalo dobre rezultate za rezličite
zadatke u poslednje vreme
• Uključujući klasifikaciju teksta
• Može se primeniti i na relativno male podatke
– Transfer learning

GloVe word embeddings
• Tehnika za transfer znanja i kreiranje vektora reči
• Neuralna mreža trenirana na Wikipediji i oko 10 miliona
novinskih članaka
• Modeluje vektore tako da reči koje su i ustom kontekstu
imaju slične vrednosti – Semantika

Arhitektura i rezultati Neuralne Mreže sa GloVe
• Kim Yoon (2014) – Konvolucijone mreže za klasifikaciju
teksta
• GloVe kao embedding layer
• Rezltati ne premašuju 70%
• Malo podataka – Duboko učenje zahteva velike podatke

Diskusija
• Relativno nedefinisan termin se može klasifikovati kad
se razloži definicija u komponente
• Definicija postaje modularna – korisnik može izabrati ili
kreirati defniciju društvenih inovacija sklapajući
komponente
• Duboko učenje nije rešenje za svaki problem
– Pogotovu kad nema dovoljno podataka

Budući planovi
• Veze organizacija i projekata
– Named entity recognition (Stanford NER)
– OrgReg, FirmReg, baza društvenih inovatora
• Lokacija projekata i organizacija
– NER
– Kreiranje leksičkih pravila
• Oblasti, Finansije,...
• http://esid.eu
• http://knowmak.eu

Classification of projects
• 2 approaches were designed
– Machine learning
– Rule based
• Machine learning learns on annotated data
• Heuristic rules about appearance of the words in the
description
• Not everything in the source databases is really SI
• To discover more projects
Project Classifier
Satisfies
criteria
Does not
satisfy criteria

Rule-based classification
• Introducing rules
• Rules
– Appearance of word “innovation” or “innovativeness” or
“novelty”
– Appearance of words (Technology OR product OR process
OR service OR way OR practice OR etc.) NEARBY (new
OR novel or improved OR better OR alternative OR etc.)
– Coverage is 77%
– Precision 0.55, Recall 0.715, F1 0.62
(measured on training data)

Classifying intangible social innovation concepts using machine learning and natural language processing

More Related Content

Similar to Classifying intangible social innovation concepts using machine learning and natural language processing (20)

More from Nikola Milosevic (20)

Recently uploaded (20)

Classifying intangible social innovation concepts using machine learning and natural language processing