SlideShare a Scribd company logo
Large Scale Vandalism Detection in
Knowledge Bases
by Alexey Grigorev
ods.ai
About Me
Software Engineer BI Masters @ TU Berlin Data Scientist
Knowledge Bases
<subject predicate object>
<ML partOf CS>
https://www.wikidata.org/wiki/Q2539
Vandalism in Knowledge Bases
WSDM Cup 2017: Vandalism Detection
● http://www.wsdm-cup-2017.org/vandalism-detection.html
● Goal of the competition:
● Predict if a Wikidata revision should be rolled back or
not
● Not a usual “kaggle-like” competition:
○ Code to be executed within a VM
○ Read data from socket, predict, write back
○ Hardware: 1 core, 4 GB of RAM
Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017
Dataset
● Wikidata XML dump with edits
○ From 2012 to 2016, 72 mln revisions
○ 24.8 Gb compressed, 400 Gb uncompressed
● Very skewed:
○ 127k rollbacks (out of 72m)
○ i.e. 0.0025 - fraction of positives
Validation
Train 2012-10-29 to
2016-02-29
65 mln
Validation 2016-03-01 to
2016-04-30
7.2 mln
Test from 2016-05-01 10.4 mln
Train 2015-01-01 to
2015-31-31
27.9 mln
Validation 2016-01-01 to
2016-02-29
9.3 mln
Test 2016-03-01 to
2016-04-30
7.2 mln
← Not available
← Available from start
← Available towards end
<page>
<title>Q5704066</title>
<ns>0</ns>
<id>5468191</id>
<revision>
<id>185142906</id>
<parentid>185051367</parentid>
<timestamp>2015-01-01</timestamp>
<contributor>
<username>username</username>
<id>52267</id>
</contributor>
<comment>/* wbsetdescription-add:1|es */ futbolista
irlandes</comment>
<model>wikibase-item</model>
<format>application/json</format>
<text xml:space="preserve">{{PAGE_JSON}}</text>
</revision>
</page>
IP if anonymous &
geo info in meta file
<page>
<title>Q5704066</title>
<ns>0</ns>
<id>5468191</id>
<revision>
<id>185142906</id>
<parentid>185051367</parentid>
<timestamp>2015-01-01</timestamp>
<contributor>
<username>username</username>
<id>52267</id>
</contributor>
<comment>/* wbsetdescription-add:1|es */ futbolista
irlandes</comment>
<model>wikibase-item</model>
<format>application/json</format>
<text xml:space="preserve">{{PAGE_JSON}}</text>
</revision>
</page>
Title features
User features
JSON (not used)
Comment features
User Features
● Username, if logged in, if not “anonymous=True”
● IP: 90.219.230.105 →
○ 90, 90_219, 90_219_230, 90_219_230_105
● Geo information from the meta file:
○ country_code=GB
○ continent_code=EU
○ time_zone=GMT
○ regio_code=EN
○ city_name=LEEDS
○ county_name=WEST_YORKSHIRE
● Use One Hot Encoding
User Features
● OHE: put this together as one string
○ anonymous 90 90_219 90_219_230 90_219_230_105
country_code=GB …
○ username
○ Use CountVectorizer from sklearn to get a matrix
Comment Features
● Structured” part: inside /* */ - split on “:” and “|”:
○ wbsetdescription-add 1 es
○ wbsetsitelink-add 1 idwiki
○ clientsitelink-remove 1 frwiki
● Wiki-Links: [[Property:P31]], [[Q5]]
● Free text: outside of /* */:
○ futbolista irlandes
○ autolist2
○ origyn web browser
/* wbsetdescription-add:1|es */ futbolista irlandes
/* wbcreateclaim-create:1| */ [[Property:P31]]: [[Q5]], #autolist2
/* wbsetsitelink-add:1|idwiki */ Megaloharpya
/* clientsitelink-remove:1||frwiki */ Origyn Web Browser
<subject predicate object>
Models
● High dimensionality? Sparse? Linear SVM!
● LinearSVC, dual=False, penalty=L1
Ensembling
● Usual “kaggle” way to improve:
● Ensembling and stacking
● Not successful - user model still better on test
Imbalance
● The dataset is highly skewed: 0.0025 positive
● Ways to deal with it:
○ Oversampling and undersampling
● Not helpful
Final Model
● Put all features together as one string
● title=Q123 username=user wbsetdescription-add 1
es P31 Q5 …
● OHE?
○ CountVectorizer - dictionary too large
○ HashingVectorizer with 10m columns
Hashing for One Hot Encoding
CountVectorizer HashingVectorizer
Final Model
● OHE with HashingVectorizer (no memory)
● SVM on OHE matrix (300 mb)
● ~ 0.96 AUC on my test
Final Results
That’s me :-)
Conclusions
Lessons Learned:
● Trust your CV
● Sometimes ensembling does not work
● Prefer simpler models to avoid overfitting
New tools I tried:
● Feather - very fast!
Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017
Links and Further Info
● Competition website: http://www.wsdm-cup-2017.org/
● Competition platform: http://tira.io/
● My solution:
○ https://github.com/alexeygrigorev/wsdmcup17-vandalism-de
tection
What’s next?
● Wikimedia plans to integrate the model
● Not as is, but they found many ideas useful
https://www.packtpub.com/big-data-and-business-intelligence/mastering-java-data-science
Shameless Promotion
Thank you
Questions?

More Related Content

PDF
WSDM Cup 2017: Vandalism Detection
PDF
Outbrain Click Prediction
PDF
CIKM Cup 2016: Cross-Device Linking
PDF
Avito Duplicate Ads Detection @ kaggle
PDF
Ad Placement Challenge
PDF
Feel++ webinar 9 27 2012
PDF
3D webservices - where do we stand? (ENG)
PDF
Bitcoin:Next
WSDM Cup 2017: Vandalism Detection
Outbrain Click Prediction
CIKM Cup 2016: Cross-Device Linking
Avito Duplicate Ads Detection @ kaggle
Ad Placement Challenge
Feel++ webinar 9 27 2012
3D webservices - where do we stand? (ENG)
Bitcoin:Next

What's hot (17)

PDF
Unit test demo for calculatechinesenamenumber
PDF
MathML: onde estamos?
PDF
OSMC 2018 | Visualization of your distributed infrastructure by Nicolai Buchwitz
PPTX
Talend connect BE Vincent Harcq - Talend ESB - DI
PDF
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
PPT
Creating Cubes using Cognos 8 Framework Manager
PDF
Formalising Graph Pattern Matching Gremlin traversals in Graph Alegra
PDF
Serverless Apps - droidcon london 2012
PDF
MongoDB and Play! Framework workshop
PDF
Build web server
PDF
Text Indexing / Inverted Indices
PDF
Grails workshops
PDF
高速・省メモリにlibsvm形式で ダンプする方法を研究してみた
PDF
Challenges in knowledge graph visualization
PDF
Globe Infographics
PDF
for "Parallelizing Multiple Group-by Queries using MapReduce"
PPT
Linq e Ef
Unit test demo for calculatechinesenamenumber
MathML: onde estamos?
OSMC 2018 | Visualization of your distributed infrastructure by Nicolai Buchwitz
Talend connect BE Vincent Harcq - Talend ESB - DI
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
Creating Cubes using Cognos 8 Framework Manager
Formalising Graph Pattern Matching Gremlin traversals in Graph Alegra
Serverless Apps - droidcon london 2012
MongoDB and Play! Framework workshop
Build web server
Text Indexing / Inverted Indices
Grails workshops
高速・省メモリにlibsvm形式で ダンプする方法を研究してみた
Challenges in knowledge graph visualization
Globe Infographics
for "Parallelizing Multiple Group-by Queries using MapReduce"
Linq e Ef
Ad

Similar to Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017 (20)

PDF
Optimizing a React application for Core Web Vitals
PDF
Secure software supply chain on a shoestring budget
PDF
Data ops in practice - Swedish style
PDF
Sprint 62
PDF
Supercharge your data analytics with BigQuery
PDF
Gradle
PDF
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
PDF
Sprint 45 review
PDF
Data Science in the Cloud @StitchFix
PDF
Iterative Methodology for Personalization Models Optimization
PDF
Clearing Airflow Obstructions
PDF
Sprint 53
PPTX
Tips to drive maria db cluster performance for nextcloud
PDF
JavascriptMVC: Another choice of web framework
PDF
Voldemort : Prototype to Production
PDF
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
PDF
Distributed Development, Centralised Delivery - SAGrid Jenkins + CVMFS
PPTX
Eko10 Workshop Opensource Database Auditing
PDF
PDF
Big Query - Women Techmarkers (Ukraine - March 2014)
Optimizing a React application for Core Web Vitals
Secure software supply chain on a shoestring budget
Data ops in practice - Swedish style
Sprint 62
Supercharge your data analytics with BigQuery
Gradle
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
Sprint 45 review
Data Science in the Cloud @StitchFix
Iterative Methodology for Personalization Models Optimization
Clearing Airflow Obstructions
Sprint 53
Tips to drive maria db cluster performance for nextcloud
JavascriptMVC: Another choice of web framework
Voldemort : Prototype to Production
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
Distributed Development, Centralised Delivery - SAGrid Jenkins + CVMFS
Eko10 Workshop Opensource Database Auditing
Big Query - Women Techmarkers (Ukraine - March 2014)
Ad

More from Alexey Grigorev (20)

PDF
MLOps week 1 intro
PDF
Codementor - Data Science at OLX
PDF
Data Monitoring with whylogs
PDF
Data engineering zoomcamp introduction
PDF
AI in Fashion - Size & Fit - Nour Karessli
PDF
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
PDF
ML Zoomcamp 10 - Kubernetes
PDF
Paradoxes in Data Science
PDF
ML Zoomcamp 8 - Neural networks and deep learning
PDF
Algorithmic fairness
PDF
MLOps at OLX
PDF
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
PDF
ML Zoomcamp 5 - Model deployment
PDF
Introduction to Transformers for NLP - Olga Petrova
PDF
ML Zoomcamp 4 - Evaluation Metrics for Classification
PDF
ML Zoomcamp 3 - Machine Learning for Classification
PDF
ML Zoomcamp Week #2 Office Hours
PDF
AMLD2021 - ML in online marketplaces
PDF
ML Zoomcamp 2 - Slides
PDF
ML Zoomcamp 2.1 - Car Price Prediction Project
MLOps week 1 intro
Codementor - Data Science at OLX
Data Monitoring with whylogs
Data engineering zoomcamp introduction
AI in Fashion - Size & Fit - Nour Karessli
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
ML Zoomcamp 10 - Kubernetes
Paradoxes in Data Science
ML Zoomcamp 8 - Neural networks and deep learning
Algorithmic fairness
MLOps at OLX
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 5 - Model deployment
Introduction to Transformers for NLP - Olga Petrova
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp Week #2 Office Hours
AMLD2021 - ML in online marketplaces
ML Zoomcamp 2 - Slides
ML Zoomcamp 2.1 - Car Price Prediction Project

Recently uploaded (20)

PPT
Ethics in Information System - Management Information System
PPTX
Digital Literacy And Online Safety on internet
PPTX
SAP Ariba Sourcing PPT for learning material
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPT
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PPTX
artificialintelligenceai1-copy-210604123353.pptx
PPTX
newyork.pptxirantrafgshenepalchinachinane
PPT
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
DOCX
Unit-3 cyber security network security of internet system
PDF
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
PPTX
Funds Management Learning Material for Beg
PDF
The Ikigai Template _ Recalibrate How You Spend Your Time.pdf
PDF
Sims 4 Historia para lo sims 4 para jugar
PPTX
E -tech empowerment technologies PowerPoint
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PPT
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
Ethics in Information System - Management Information System
Digital Literacy And Online Safety on internet
SAP Ariba Sourcing PPT for learning material
SASE Traffic Flow - ZTNA Connector-1.pdf
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
INTERNET------BASICS-------UPDATED PPT PRESENTATION
The New Creative Director: How AI Tools for Social Media Content Creation Are...
artificialintelligenceai1-copy-210604123353.pptx
newyork.pptxirantrafgshenepalchinachinane
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
Unit-3 cyber security network security of internet system
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
Funds Management Learning Material for Beg
The Ikigai Template _ Recalibrate How You Spend Your Time.pdf
Sims 4 Historia para lo sims 4 para jugar
E -tech empowerment technologies PowerPoint
Unit-1 introduction to cyber security discuss about how to secure a system
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt

Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017