Machine Learning and Social Good

© 2 0 2 0 S P L U N K I N C .
© 2 0 2 0 S P L U N K I N C .
Machine
Learning for
Social Good
Dr. Greg Ainslie-Malik – Machine Learning Architect

During the course of this presentation, we may make forward‐looking statements regarding
future events or plans of the company. We caution you that such statements reflect our
current expectations and estimates based on factors currently known to us and that actual
events or results may differ materially. The forward-looking statements made in the this
presentation are being made as of the time and date of its live presentation. If reviewed after
its live presentation, it may not contain current or accurate information. We do not assume
any obligation to update any forward‐looking statements made herein.
In addition, any information about our roadmap outlines our general product direction and is
subject to change at any time without notice. It is for informational purposes only, and shall
not be incorporated into any contract or other commitment. Splunk undertakes no obligation
either to develop the features or functionalities described or to include any such feature or
functionality in a future release.
Splunk, Splunk>, Data-to-Everything, D2E, and Turn Data Into Doing are trademarks and registered trademarks of Splunk Inc. in the United States
and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2020 Splunk Inc. All rights reserved.
Forward-
Looking
Statements
© 2 0 2 0 S P L U N K I N C .

© 2 0 2 0 S P L U N K I N C .
Introduction to Machine Learning
Common challenges with Machine Learning
Where have we seen Machine Learning used for social good?
Anomaly detection
Fraud detection
Learning analytics
What else are we doing to promote good use of Machine Learning?
Agenda
4
3
2
1

© 2 0 2 0 S P L U N K I N C .
Introduction to
Machine Learning

© 2 0 2 0 S P L U N K I N C .
What is Machine Learning?
Artificial
Intelligence (AI)
Machine
Learning
Deep Learning
• AI is supposed to mean any type of
algorithm or programme that allows
computers to mimic human behaviour
• ML is a subset of this that allows
machines to make improvements over
time
• Deep Learning is a type of machine
learning that is based on neural
networks

© 2 0 2 0 S P L U N K I N C .
What is Machine Learning?
Data Rules Outcomes Data
Outcomes
(supervised
only)
Rules
Classic Programming Machine Learning

© 2 0 2 0 S P L U N K I N C .
Why Use Machine Learning?
Observation from Splunk customers
Identify anomalies
or ‘unknown
unknowns’
Improve alert
accuracy
Highlight weak
relationships

© 2 0 2 0 S P L U N K I N C .
How Machine Learning Fits into Splunk
Search
Every Search Can
Use Machine Learning
Third-Party
Applications
Smartphones
and Devices
Tickets
Email
Send an email
File a ticket
Send a text
Flash lights
Trigger
process
flow
AlertReal TimeOT
Industrial
Assets
IT
Consumer and
Mobile Devices
Security

© 2 0 2 0 S P L U N K I N C .
Common Challenges
with Machine Learning

© 2 0 2 0 S P L U N K I N C .
Problem Statement
There is a lack of trust in
Machine Learning.
This is largely caused by limited transparency or
explainability of most Machine Learning processes.
Therefore it can be difficult to identify
negative bias when applying Machine Learning.

© 2 0 2 0 S P L U N K I N C .
UNTAPPED
UNANALYSED
UNOWNED
MOST ORGANIZATIONS’ DATA IS STILL
DARK DATA
60%
of organizations report
that the majority of
their data is still dark
*Splunk Inc., “State of Dark Data Report” , May 2019

© 2 0 2 0 S P L U N K I N C .
Our World
Never Stops
Evolving.
How can we handle the half-life of data?
© 2 0 2 0 S P L U N K I N C .

© 2 0 2 0 S P L U N K I N C .
Use of AI
Globally, 61%-67% saw value in AI
for their organizations.
60%–70% of respondents believe that they
will be using AI across IT, operations and
talent management in the future.
And yet …
Only 10%–15% say their organizations are
deploying AI for use cases today.
While only 12% say that AI is currently
guiding their business strategy, 61% expect
it to do so in the next five years.
of respondents say
they expect AI to guide
business strategy in
the next five years.
Organizations admit they’re not ready for AI.
Their top four concerns:
1. Lack of trained AI experts
2. Lack of understanding of AI
3. Not knowing what can be automated
4. Difficulty successfully wrangling the data
61%
81%
80%
78%
78%

© 2 0 2 0 S P L U N K I N C .
Do you know what’s
happening?
Can you turn data
into action?
How do you build
for the future?
© 2 0 2 0 S P L U N K I N C .

© 2 0 2 0 S P L U N K I N C .
Try to gain as much visibility
of your data as possible
Minimise the delivery time
for that data
Invest in data skills
Key
Takeaways
1
2
3

© 2 0 2 0 S P L U N K I N C .
Machine Learning for
Social Good
Example case studies

© 2 0 2 0 S P L U N K I N C .
Finding Potential
Cyber Security
Incidents
Identifying anomalies in massive datasets

© 2 0 2 0 S P L U N K I N C .
https://conf.splunk.com/files/2019/slides/SEC1374.pdf
Use Case:
Proxy Communication Investigation Workflow
1
2
3
4
5

© 2 0 2 0 S P L U N K I N C .
Using the DensityFunction to Find Anomalies
| tstats count WHERE (index=botsv2) BY _time span=60m

© 2 0 2 0 S P L U N K I N C .
| eval HourOfDay=strftime(_time, "%H")
| fit DensityFunction count by "HourOfDay" into df_bots_dns
| table _time count IsOutlier(count)

© 2 0 2 0 S P L U N K I N C .
| apply df_bots_dns threshold=0.03
| table _time count IsOutlier(count)

© 2 0 2 0 S P L U N K I N C .
| summary df_bots_dns

© 2 0 2 0 S P L U N K I N C .
| summary df_bots_dns
Much bigger standard deviation
Much higher mean than the
other times of day
None of the times of day have
many training points

© 2 0 2 0 S P L U N K I N C .
| apply df_bots_dns threshold=0.003 show_density=true
| where 'IsOutlier(count)'>0
| join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std]
| table _time count ProbabilityDensity(count) cardinality mean std
| eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std
Reduce the threshold and
include the probability density in
the results

© 2 0 2 0 S P L U N K I N C .
Filter the data to only show the
anomalies

© 2 0 2 0 S P L U N K I N C .
Join with the summary data to
include the cardinality, mean
and standard deviation

© 2 0 2 0 S P L U N K I N C .
Calculate some additional fields using
the mean and standard deviation that
describe how extreme the outlier is

© 2 0 2 0 S P L U N K I N C .
Using the DensityFunction to find anomalies

© 2 0 2 0 S P L U N K I N C .
Identifying Fraud
Finding anomalies in credit card transactions,
prescriptions and accesses to patient data

© 2 0 2 0 S P L U N K I N C .
Common
for exploring
transactional
data
Credit Card Fraud Example
Group Like with Like
Data is
often “batch”
loaded
Often
proactively
searching for
Unknown
Unknowns

© 2 0 2 0 S P L U N K I N C .
Enrich the Transactions
Region Change between
card txns?
Cal time delta between
card txns.
Merchant Change between
card txns?

© 2 0 2 0 S P L U N K I N C .
Synthesize More Context
Too quickly between
regions?
Avg Merchant/Region
change by num txns.
Aggregate counts per card
Stdev TimeDelta/Amt
by averages.
Too quickly between
merchants?

© 2 0 2 0 S P L U N K I N C .
Prep for Clustering and Visualization
1. Standard Scalar –
normalize distribution
2. Principal Component
Analysis (PCA) – reduce
to 3 dimensions

© 2 0 2 0 S P L U N K I N C .
Finally – Cluster with KMeans

© 2 0 2 0 S P L U N K I N C .
https://medcitynews.com/2019/02/splunk-and-newyork-presbyterian/
https://www.healthcareitnews.com/news/newyork-presbyterian-working-machine-
learning-analytics-combat-opioid-crisis
“At a time when overdose deaths are at
crisis levels across the country and in New
York City, largely due to the opioid
epidemic, healthcare providers have a
responsibility to safeguard against any
potential diversion of drugs. NewYork-
Presbyterian is taking a leading role in
protecting the public by implementing highly
effective controls to avoid the illegitimate
use of controlled substances. Ultimately, we
hope that other hospitals benefit from this
new platform as well.”
Jennings Aske, senior vice president and chief
information security officer at NewYork-Presbyterian

© 2 0 2 0 S P L U N K I N C .
Together, NewYork-Presbyterian
and Splunk are also creating an
enhanced data analytics solution
that investigates unauthorized
access to patient records.

© 2 0 2 0 S P L U N K I N C .
Detect the anomaly…

© 2 0 2 0 S P L U N K I N C .
…drill down into that user…

© 2 0 2 0 S P L U N K I N C .
Predicting
Student Outcomes
Predicting student grades based on their
digital interactions with university IT and
identifying students that are at risk of
dropping out

© 2 0 2 0 S P L U N K I N C .
What Data Scientists Really Do
Data Preparation accounts for about 80% of the work of data scientists
“Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says”, Forbes Mar 23, 2016

© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
index=oulad code_module=AAA

© 2 0 2 0 S P L U N K I N C .
| eval weighted_score=score*(weight/100)
| eval student_code=id_student."_".code_module."_".code_presentation
| bin _time span=1mon
| stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as
average_score by student_code _time
| streamstats sum(month_score) as cumulative_score last(average_score) as last_average
count by student_code
| eventstats max(count) as course_length
| eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)),
cumulative_score=if(cumulative_score>0,cumulative_score,0),
module_perc_complete=count/course_length
| join student_code [| inputlookup student_info.csv | eval
student_code=id_student."_".code_module."_".code_presentation | table student_code age_band
highest_education imd_band studied_credits final_result]
| table _time student_code sum_clicks average_score cumulative_score module_perc_complete
studied_credits age_band highest_education imd_band final_result
| outputlookup oulad_aaa.csv

© 2 0 2 0 S P L U N K I N C .
| eval weighted_score=score*(weight/100,
student_code=id_student."_".code_module."_".code_presentation
Calculate a weighted score and
create a unique identifier for each
student and module combination

© 2 0 2 0 S P L U N K I N C .
Calculate the number of clicks, total
score and average score for each
student in each month

© 2 0 2 0 S P L U N K I N C .
Calculate the cumulative score over
time for each student and also get the
previous average score for each
month and create a rolling count

© 2 0 2 0 S P L U N K I N C .
Fill in empty average and cumulative
results and also calculate the module
percentage complete

© 2 0 2 0 S P L U N K I N C .
| inputlookup oulad_aaa.csv
| search final_result!="Withdrawn"
| sample partitions=10 seed=42
| search partition_number<7
| fit RandomForestClassifier final_result from average_score cumulative_score
module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into
rf_oulad_aaa

© 2 0 2 0 S P L U N K I N C .
| eval
withdrawn=if(final_result="Withdrawn","Yes","No")
| fit RandomForestClassifier withdrawn from
average_score cumulative_score
module_perc_complete studied_credits sum_clicks
age_band highest_education imd_band into
rf_withdrawn_oulad_aaa
| eval
withdrawn=if(final_result="Withdrawn","Yes","No")
| apply rf_withdrawn_oulad_aaa
Train model Test model

© 2 0 2 0 S P L U N K I N C .
UK Government
First to Pilot AI
Procurement
Guidelines Co-
Designed with
World Economic
Forum
https://www.weforum.org/press/2019/09/uk-government-
first-to-pilot-ai-procurement-guidelines-co-designed-with-
world-economic-forum/
Splunk has supported the
development of these guidelines and
worked closely with the WEF and UK
Government. We will help pilot them in
the UK and believe the guidance will
enable Governments across the world
transform citizen services and deliver
ethically sound and beneficial AI
based solutions.”
— Lenny Stein, Senior Vice President,
Global Affairs, Splunk
“

© 2 0 2 0 S P L U N K I N C .
Work with the WEF
Intent
Provide information to non-specialists so that they can assess the suitability of ML for a given
problem/solution
Current solution
Procurement guidance for ‘unlocking public sector AI’
High level procurement processes
Best practices when evaluating an RFP
Map for creating AI-related RFPs
Unlocking Public Sector AI go-live
Expected in the coming months
4
3
2
1

Machine Learning and Social Good

More Related Content

What's hot (20)

Similar to Machine Learning and Social Good (20)

More from Splunk (20)

Recently uploaded (20)

Machine Learning and Social Good