SlideShare a Scribd company logo
Assurance Scoring:
Using Machine Learning
and Analytics to Reduce
Risk in the Public Sector
Matt Thomson
17/11/2016
2Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Outline
Introduction
Traditional Fraud Detection
Assurance Scoring
Machine Learning
Business Rules
Anomaly Detection
Graph Links
3Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Who am I?
Matt Thomson
Senior Data Scientist at Capgemini
PhD in Astrophysics (http://arxiv.org/abs/1010.3315)
Several years experience in fraud detection
Capgemini
Big Data Analytics team
~100 Data Scientists, Big Data Engineers and Data Analysts
Focus on Open Source and Big Data technologies to solve client
problems
Sponsor the meetup today!
4Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Introduction to the Problem
Public sector constantly working in an environment of reduced resources
Want to provide a better service but with greater efficiency
Therefore very important that limited resources are focussed correctly
Assurance Scoring
 Use ML and other analytical methods to identify the least risky people or applications so
that investigators resources can be targeted on the most risky
5Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Hypothetical Example – 2016 Olympics tickets
Imagine running the application process for selling tickets to the 2016
Olympics
Avoid selling tickets to touts/resellers
 Vast majority of people applying for tickets are genuine
 Fraud detection with big class imbalance problem (<0.1%)
 Avoid approach of investigating each person applying
Lets say we know from 2012 Olympics which people ended up reselling
their tickets – training data
Use ML to identify the 30% (say) least likely to be touts – fast tracked
Investigators focus on the high risk
6Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Traditional Fraud Detection
Identify
Historical
Training Data
Feature
Engineering
Model
Training and
Evaluation
Model
Execution
Feedback
7Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
Focus on low-risk
Allows resources to be better focussed
Not limited to Machine Learning
Built using Python!
 Pandas, Scikit-learn etc
 Scala version using Spark MLlib
8Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
9Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
POLE ‘Analytical’ Data Layer
Disparate data sources - Atomic Layer
Atomic data is
Transformed and
Loaded into POLE
POLE Layer
EventLocationObjectPerson
10Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
POLE ‘Analytical’ Data Layer
POLE contains ALL entities from the Atomic Layer, plus their inter-linkages
11Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
12Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Machine learning
Transform Selection Model
Training
Validation
Test
Feature extraction and selection Model Building
Variety of output files: logs, graphics, pickle models, etc
Testing: Unit tests, monitoring tests and integration tests
Vector Build
Input Data
Manipulate, Explore
Data
Framework: Structure, flexibility, consistency
13Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Machine learning : Feature Engineering
SQL, Python
Transform
Explore
Select
Ask
questions,
validate
Refine
features
• Feature Extraction
• Data exploration
• Feature selection
Historical Data
14Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Machine Learning: Model Building
Training
Validation
Test
Split Datasets
Build
Models
Hyper-
parameter
tuning
Selected
features Models
Training
results
Validation
results
Tests
results
Compare
Models
15Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Low risk? High risk? Depends on classifier’s
threshold
• True-positives : applications the
model correctly classifies as high
risk
• True negatives: applications model
correctly classifies as low risk
• False-positives: applications the
model scores as high risk but are
not
• False-negatives: applications the
model scores as low risk but were in
fact high risk
16Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
17Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Business Rules
Identifying Fraud often been done using deterministic rules
Look for transactions near a threshold or at the end of the day
Primarily data queries on your feature vector
Olympics example – Anyone applying for more than £10,000 tickets
18Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
19Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Anomaly Detection
Use the training data to create a baseline of applications by postcode
(say)
If a particular postcode has a larger than expected number of applications
then those cases pushed into high-risk bucket
20Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
21Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Graph Links - Matching
Key part of assurance scoring – bringing data together from disparate
sources
Probability of Match: 80%
Attribute Data Source 1 Data Source 2
Name Matt Thomson Matthew Thosmon
Phone Number 07123 456 789 07123 456 798
Favourite Sport Football Cricket
22Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
23Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Further Details
Come and find me!
matt.thomson@capgemini.com / @MattGThomson
Assurance Scoring brochure: http://ow.ly/4nbEUI
Blogs:
• Introduction: https://www.capgemini.com/node/1380596
• Integrating multiple techniques: http://bit.ly/24BmszV
• Machine Learning: http://bit.ly/1QTMGnq
• Many more on other topics
24Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
We’re Hiring!
Data Science
https://www.uk.capgemini.com/careers/jobs/data-scientist-0
Big Data Engineer
https://www.uk.capgemini.com/careers/jobs/big-data-engineer
matt.thomson@capgemini.com
The information contained in this presentation is proprietary.
© 2012 Capgemini. All rights reserved.
www.capgemini.com
About Capgemini
With more than 120,000 people in 40 countries, Capgemini is one
of the world's foremost providers of consulting, technology and
outsourcing services. The Group reported 2011 global revenues
of EUR 9.7 billion.
Together with its clients, Capgemini creates and delivers
business and technology solutions that fit their needs and drive
the results they want. A deeply multicultural organization,
Capgemini has developed its own way of working, the
Collaborative Business ExperienceTM, and draws on Rightshore ®,
its worldwide delivery model.
Rightshore® is a trademark belonging to Capgemini

More Related Content

PPTX
Assurance Scoring Pydata London 2016
PDF
Data Driven Engineering 2014
PDF
How to breakthrough barriers and drive more value from your data analytics pr...
PPTX
Adding Open Data Value to 'Closed Data' Problems
PDF
Barga Galvanize Sept 2015
PDF
Data analytics, a (short) tour
PDF
The Softer Skills Analysts need to make an impact
PDF
ORGANISING YOUR ADVANCED ANALYTICS PROJECTS FOR SUCCESS - Big Data Expo 2019
Assurance Scoring Pydata London 2016
Data Driven Engineering 2014
How to breakthrough barriers and drive more value from your data analytics pr...
Adding Open Data Value to 'Closed Data' Problems
Barga Galvanize Sept 2015
Data analytics, a (short) tour
The Softer Skills Analysts need to make an impact
ORGANISING YOUR ADVANCED ANALYTICS PROJECTS FOR SUCCESS - Big Data Expo 2019

Similar to Assurance Scoring: using machine learning and analytics to reduce risk in the public sector (20)

PDF
Data Science Highlights
PDF
Data Science and Analytics
PDF
Everyday Data Science
PDF
The Myths + Realities of Machine-Learning Cybersecurity
PPTX
Are you ready for Data science? A 12 point test
PDF
Impact of big data on analytics
PPTX
MADHU namaste to you too much to me and I am
PDF
Data Science Isn't a Fad: Let's Keep it That Way
PDF
The Data Science Process
PDF
Data Science Introduction and Process in Data Science
PDF
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
PDF
Data Science Introduction - Data Science: What Art Thou?
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
PDF
Defining Data Science: A Comprehensive Overview
PPTX
The Data Science Product Management Toolkit
PDF
Applied_Data_Science_Presented_by_Yhat
PDF
Data science mastery course in pitampura
PDF
From Rocket Science to Data Science
PDF
Case sas 2
PDF
Data science and its potential to change business as we know it. The Roadmap ...
Data Science Highlights
Data Science and Analytics
Everyday Data Science
The Myths + Realities of Machine-Learning Cybersecurity
Are you ready for Data science? A 12 point test
Impact of big data on analytics
MADHU namaste to you too much to me and I am
Data Science Isn't a Fad: Let's Keep it That Way
The Data Science Process
Data Science Introduction and Process in Data Science
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
Data Science Introduction - Data Science: What Art Thou?
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
Defining Data Science: A Comprehensive Overview
The Data Science Product Management Toolkit
Applied_Data_Science_Presented_by_Yhat
Data science mastery course in pitampura
From Rocket Science to Data Science
Case sas 2
Data science and its potential to change business as we know it. The Roadmap ...
Ad

More from South West Data Meetup (11)

PDF
Leveraging open source for large scale analytics
PDF
Met Office Informatics Lab
PDF
Time Series Analytics for Big Fast Data
PDF
@Bristol Data Dome Workshop (ISO/Urban Tide)
PDF
Imagine Bristol - interactive workshop day
PDF
Open Data Institute (ODI) Node
PPTX
Bristol's Open Data Journey
PDF
@Bristol Data Dome workshop - NSC Creative
PDF
Declarative data analysis
PPTX
Bristol is Open: Exploring Open Data in the City
PDF
Ask bigger questions
Leveraging open source for large scale analytics
Met Office Informatics Lab
Time Series Analytics for Big Fast Data
@Bristol Data Dome Workshop (ISO/Urban Tide)
Imagine Bristol - interactive workshop day
Open Data Institute (ODI) Node
Bristol's Open Data Journey
@Bristol Data Dome workshop - NSC Creative
Declarative data analysis
Bristol is Open: Exploring Open Data in the City
Ask bigger questions
Ad

Recently uploaded (20)

PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Quality review (1)_presentation of this 21
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction to machine learning and Linear Models
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Quality review (1)_presentation of this 21
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
STUDY DESIGN details- Lt Col Maksud (21).pptx
Qualitative Qantitative and Mixed Methods.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to machine learning and Linear Models
Fluorescence-microscope_Botany_detailed content
Data_Analytics_and_PowerBI_Presentation.pptx
Foundation of Data Science unit number two notes
Business Acumen Training GuidePresentation.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
IB Computer Science - Internal Assessment.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
iec ppt-1 pptx icmr ppt on rehabilitation.pptx

Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

  • 1. Assurance Scoring: Using Machine Learning and Analytics to Reduce Risk in the Public Sector Matt Thomson 17/11/2016
  • 2. 2Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Outline Introduction Traditional Fraud Detection Assurance Scoring Machine Learning Business Rules Anomaly Detection Graph Links
  • 3. 3Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Who am I? Matt Thomson Senior Data Scientist at Capgemini PhD in Astrophysics (http://arxiv.org/abs/1010.3315) Several years experience in fraud detection Capgemini Big Data Analytics team ~100 Data Scientists, Big Data Engineers and Data Analysts Focus on Open Source and Big Data technologies to solve client problems Sponsor the meetup today!
  • 4. 4Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Introduction to the Problem Public sector constantly working in an environment of reduced resources Want to provide a better service but with greater efficiency Therefore very important that limited resources are focussed correctly Assurance Scoring  Use ML and other analytical methods to identify the least risky people or applications so that investigators resources can be targeted on the most risky
  • 5. 5Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Hypothetical Example – 2016 Olympics tickets Imagine running the application process for selling tickets to the 2016 Olympics Avoid selling tickets to touts/resellers  Vast majority of people applying for tickets are genuine  Fraud detection with big class imbalance problem (<0.1%)  Avoid approach of investigating each person applying Lets say we know from 2012 Olympics which people ended up reselling their tickets – training data Use ML to identify the 30% (say) least likely to be touts – fast tracked Investigators focus on the high risk
  • 6. 6Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Traditional Fraud Detection Identify Historical Training Data Feature Engineering Model Training and Evaluation Model Execution Feedback
  • 7. 7Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Assurance Scoring Focus on low-risk Allows resources to be better focussed Not limited to Machine Learning Built using Python!  Pandas, Scikit-learn etc  Scala version using Spark MLlib
  • 8. 8Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Assurance Scoring
  • 9. 9Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date POLE ‘Analytical’ Data Layer Disparate data sources - Atomic Layer Atomic data is Transformed and Loaded into POLE POLE Layer EventLocationObjectPerson
  • 10. 10Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date POLE ‘Analytical’ Data Layer POLE contains ALL entities from the Atomic Layer, plus their inter-linkages
  • 11. 11Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Assurance Scoring
  • 12. 12Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Machine learning Transform Selection Model Training Validation Test Feature extraction and selection Model Building Variety of output files: logs, graphics, pickle models, etc Testing: Unit tests, monitoring tests and integration tests Vector Build Input Data Manipulate, Explore Data Framework: Structure, flexibility, consistency
  • 13. 13Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Machine learning : Feature Engineering SQL, Python Transform Explore Select Ask questions, validate Refine features • Feature Extraction • Data exploration • Feature selection Historical Data
  • 14. 14Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Machine Learning: Model Building Training Validation Test Split Datasets Build Models Hyper- parameter tuning Selected features Models Training results Validation results Tests results Compare Models
  • 15. 15Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Low risk? High risk? Depends on classifier’s threshold • True-positives : applications the model correctly classifies as high risk • True negatives: applications model correctly classifies as low risk • False-positives: applications the model scores as high risk but are not • False-negatives: applications the model scores as low risk but were in fact high risk
  • 16. 16Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Assurance Scoring
  • 17. 17Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Business Rules Identifying Fraud often been done using deterministic rules Look for transactions near a threshold or at the end of the day Primarily data queries on your feature vector Olympics example – Anyone applying for more than £10,000 tickets
  • 18. 18Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Assurance Scoring
  • 19. 19Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Anomaly Detection Use the training data to create a baseline of applications by postcode (say) If a particular postcode has a larger than expected number of applications then those cases pushed into high-risk bucket
  • 20. 20Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Assurance Scoring
  • 21. 21Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Graph Links - Matching Key part of assurance scoring – bringing data together from disparate sources Probability of Match: 80% Attribute Data Source 1 Data Source 2 Name Matt Thomson Matthew Thosmon Phone Number 07123 456 789 07123 456 798 Favourite Sport Football Cricket
  • 22. 22Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Assurance Scoring
  • 23. 23Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Further Details Come and find me! matt.thomson@capgemini.com / @MattGThomson Assurance Scoring brochure: http://ow.ly/4nbEUI Blogs: • Introduction: https://www.capgemini.com/node/1380596 • Integrating multiple techniques: http://bit.ly/24BmszV • Machine Learning: http://bit.ly/1QTMGnq • Many more on other topics
  • 24. 24Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date We’re Hiring! Data Science https://www.uk.capgemini.com/careers/jobs/data-scientist-0 Big Data Engineer https://www.uk.capgemini.com/careers/jobs/big-data-engineer matt.thomson@capgemini.com
  • 25. The information contained in this presentation is proprietary. © 2012 Capgemini. All rights reserved. www.capgemini.com About Capgemini With more than 120,000 people in 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2011 global revenues of EUR 9.7 billion. Together with its clients, Capgemini creates and delivers business and technology solutions that fit their needs and drive the results they want. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business ExperienceTM, and draws on Rightshore ®, its worldwide delivery model. Rightshore® is a trademark belonging to Capgemini