SlideShare a Scribd company logo
© 2 0 2 0 S P L U N K I N C .
© 2 0 2 0 S P L U N K I N C .
Machine
Learning for
Social Good
Dr. Greg Ainslie-Malik – Machine Learning Architect
During the course of this presentation, we may make forward‐looking statements regarding
future events or plans of the company. We caution you that such statements reflect our
current expectations and estimates based on factors currently known to us and that actual
events or results may differ materially. The forward-looking statements made in the this
presentation are being made as of the time and date of its live presentation. If reviewed after
its live presentation, it may not contain current or accurate information. We do not assume
any obligation to update any forward‐looking statements made herein.
In addition, any information about our roadmap outlines our general product direction and is
subject to change at any time without notice. It is for informational purposes only, and shall
not be incorporated into any contract or other commitment. Splunk undertakes no obligation
either to develop the features or functionalities described or to include any such feature or
functionality in a future release.
Splunk, Splunk>, Data-to-Everything, D2E, and Turn Data Into Doing are trademarks and registered trademarks of Splunk Inc. in the United States
and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2020 Splunk Inc. All rights reserved.
Forward-
Looking
Statements
© 2 0 2 0 S P L U N K I N C .
© 2 0 2 0 S P L U N K I N C .
Introduction to Machine Learning
Common challenges with Machine Learning
Where have we seen Machine Learning used for social good?
Anomaly detection
Fraud detection
Learning analytics
What else are we doing to promote good use of Machine Learning?
Agenda
4
3
2
1
© 2 0 2 0 S P L U N K I N C .
Introduction to
Machine Learning
© 2 0 2 0 S P L U N K I N C .
What is Machine Learning?
Artificial
Intelligence (AI)
Machine
Learning
Deep Learning
• AI is supposed to mean any type of
algorithm or programme that allows
computers to mimic human behaviour
• ML is a subset of this that allows
machines to make improvements over
time
• Deep Learning is a type of machine
learning that is based on neural
networks
© 2 0 2 0 S P L U N K I N C .
What is Machine Learning?
Data Rules Outcomes Data
Outcomes
(supervised
only)
Rules
Classic Programming Machine Learning
© 2 0 2 0 S P L U N K I N C .
Why Use Machine Learning?
Observation from Splunk customers
Identify anomalies
or ‘unknown
unknowns’
Improve alert
accuracy
Highlight weak
relationships
© 2 0 2 0 S P L U N K I N C .
How Machine Learning Fits into Splunk
Search
Every Search Can
Use Machine Learning
Third-Party
Applications
Smartphones
and Devices
Tickets
Email
Send an email
File a ticket
Send a text
Flash lights
Trigger
process
flow
AlertReal TimeOT
Industrial
Assets
IT
Consumer and
Mobile Devices
Security
© 2 0 2 0 S P L U N K I N C .
Common Challenges
with Machine Learning
© 2 0 2 0 S P L U N K I N C .
Problem Statement
There is a lack of trust in
Machine Learning.
This is largely caused by limited transparency or
explainability of most Machine Learning processes.
Therefore it can be difficult to identify
negative bias when applying Machine Learning.
© 2 0 2 0 S P L U N K I N C .
UNTAPPED
UNANALYSED
UNOWNED
MOST ORGANIZATIONS’ DATA IS STILL
DARK DATA
60%
of organizations report
that the majority of
their data is still dark
*Splunk Inc., “State of Dark Data Report” , May 2019
© 2 0 2 0 S P L U N K I N C .
Our World
Never Stops
Evolving.
How can we handle the half-life of data?
© 2 0 2 0 S P L U N K I N C .
© 2 0 2 0 S P L U N K I N C .
Use of AI
Globally, 61%-67% saw value in AI
for their organizations.
60%–70% of respondents believe that they
will be using AI across IT, operations and
talent management in the future.
And yet …
Only 10%–15% say their organizations are
deploying AI for use cases today.
While only 12% say that AI is currently
guiding their business strategy, 61% expect
it to do so in the next five years.
of respondents say
they expect AI to guide
business strategy in
the next five years.
Organizations admit they’re not ready for AI.
Their top four concerns:
1. Lack of trained AI experts
2. Lack of understanding of AI
3. Not knowing what can be automated
4. Difficulty successfully wrangling the data
61%
81%
80%
78%
78%
© 2 0 2 0 S P L U N K I N C .
Do you know what’s
happening?
Can you turn data
into action?
How do you build
for the future?
© 2 0 2 0 S P L U N K I N C .
© 2 0 2 0 S P L U N K I N C .
Try to gain as much visibility
of your data as possible
Minimise the delivery time
for that data
Invest in data skills
Key
Takeaways
1
2
3
© 2 0 2 0 S P L U N K I N C .
Machine Learning for
Social Good
Example case studies
© 2 0 2 0 S P L U N K I N C .
Finding Potential
Cyber Security
Incidents
Identifying anomalies in massive datasets
© 2 0 2 0 S P L U N K I N C .
https://conf.splunk.com/files/2019/slides/SEC1374.pdf
Use Case:
Proxy Communication Investigation Workflow
1
2
3
4
5
© 2 0 2 0 S P L U N K I N C .
Using the DensityFunction to Find Anomalies
| tstats count WHERE (index=botsv2) BY _time span=60m
© 2 0 2 0 S P L U N K I N C .
Using the DensityFunction to Find Anomalies
| tstats count WHERE (index=botsv2) BY _time span=60m
| eval HourOfDay=strftime(_time, "%H")
| fit DensityFunction count by "HourOfDay" into df_bots_dns
| table _time count IsOutlier(count)
© 2 0 2 0 S P L U N K I N C .
Using the DensityFunction to Find Anomalies
| tstats count WHERE (index=botsv2) BY _time span=60m
| eval HourOfDay=strftime(_time, "%H")
| apply df_bots_dns threshold=0.03
| table _time count IsOutlier(count)
© 2 0 2 0 S P L U N K I N C .
Using the DensityFunction to Find Anomalies
| summary df_bots_dns
© 2 0 2 0 S P L U N K I N C .
Using the DensityFunction to Find Anomalies
| summary df_bots_dns
Much bigger standard deviation
Much higher mean than the
other times of day
None of the times of day have
many training points
© 2 0 2 0 S P L U N K I N C .
Using the DensityFunction to Find Anomalies
| tstats count WHERE (index=botsv2) BY _time span=60m
| eval HourOfDay=strftime(_time, "%H")
| apply df_bots_dns threshold=0.003 show_density=true
| where 'IsOutlier(count)'>0
| join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std]
| table _time count ProbabilityDensity(count) cardinality mean std
| eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std
Reduce the threshold and
include the probability density in
the results
© 2 0 2 0 S P L U N K I N C .
Using the DensityFunction to Find Anomalies
| tstats count WHERE (index=botsv2) BY _time span=60m
| eval HourOfDay=strftime(_time, "%H")
| apply df_bots_dns threshold=0.003 show_density=true
| where 'IsOutlier(count)'>0
| join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std]
| table _time count ProbabilityDensity(count) cardinality mean std
| eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std
Filter the data to only show the
anomalies
© 2 0 2 0 S P L U N K I N C .
Using the DensityFunction to Find Anomalies
| tstats count WHERE (index=botsv2) BY _time span=60m
| eval HourOfDay=strftime(_time, "%H")
| apply df_bots_dns threshold=0.003 show_density=true
| where 'IsOutlier(count)'>0
| join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std]
| table _time count ProbabilityDensity(count) cardinality mean std
| eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std
Join with the summary data to
include the cardinality, mean
and standard deviation
© 2 0 2 0 S P L U N K I N C .
Using the DensityFunction to Find Anomalies
| tstats count WHERE (index=botsv2) BY _time span=60m
| eval HourOfDay=strftime(_time, "%H")
| apply df_bots_dns threshold=0.003 show_density=true
| where 'IsOutlier(count)'>0
| join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std]
| table _time count ProbabilityDensity(count) cardinality mean std
| eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std
Calculate some additional fields using
the mean and standard deviation that
describe how extreme the outlier is
© 2 0 2 0 S P L U N K I N C .
Using the DensityFunction to find anomalies
| tstats count WHERE (index=botsv2) BY _time span=60m
| eval HourOfDay=strftime(_time, "%H")
| apply df_bots_dns threshold=0.003 show_density=true
| where 'IsOutlier(count)'>0
| join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std]
| table _time count ProbabilityDensity(count) cardinality mean std
| eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std
© 2 0 2 0 S P L U N K I N C .
Identifying Fraud
Finding anomalies in credit card transactions,
prescriptions and accesses to patient data
© 2 0 2 0 S P L U N K I N C .
Common
for exploring
transactional
data
Credit Card Fraud Example
Group Like with Like
Data is
often “batch”
loaded
Often
proactively
searching for
Unknown
Unknowns
© 2 0 2 0 S P L U N K I N C .
Enrich the Transactions
Region Change between
card txns?
Cal time delta between
card txns.
Merchant Change between
card txns?
© 2 0 2 0 S P L U N K I N C .
Synthesize More Context
Too quickly between
regions?
Avg Merchant/Region
change by num txns.
Aggregate counts per card
Stdev TimeDelta/Amt
by averages.
Too quickly between
merchants?
© 2 0 2 0 S P L U N K I N C .
Prep for Clustering and Visualization
1. Standard Scalar –
normalize distribution
2. Principal Component
Analysis (PCA) – reduce
to 3 dimensions
© 2 0 2 0 S P L U N K I N C .
Finally – Cluster with KMeans
© 2 0 2 0 S P L U N K I N C .
https://medcitynews.com/2019/02/splunk-and-newyork-presbyterian/
https://www.healthcareitnews.com/news/newyork-presbyterian-working-machine-
learning-analytics-combat-opioid-crisis
“At a time when overdose deaths are at
crisis levels across the country and in New
York City, largely due to the opioid
epidemic, healthcare providers have a
responsibility to safeguard against any
potential diversion of drugs. NewYork-
Presbyterian is taking a leading role in
protecting the public by implementing highly
effective controls to avoid the illegitimate
use of controlled substances. Ultimately, we
hope that other hospitals benefit from this
new platform as well.”
Jennings Aske, senior vice president and chief
information security officer at NewYork-Presbyterian
© 2 0 2 0 S P L U N K I N C .
© 2 0 2 0 S P L U N K I N C .
© 2 0 2 0 S P L U N K I N C .
Together, NewYork-Presbyterian
and Splunk are also creating an
enhanced data analytics solution
that investigates unauthorized
access to patient records.
© 2 0 2 0 S P L U N K I N C .
© 2 0 2 0 S P L U N K I N C .
Detect the anomaly…
© 2 0 2 0 S P L U N K I N C .
…drill down into that user…
© 2 0 2 0 S P L U N K I N C .
Predicting
Student Outcomes
Predicting student grades based on their
digital interactions with university IT and
identifying students that are at risk of
dropping out
© 2 0 2 0 S P L U N K I N C .
© 2 0 2 0 S P L U N K I N C .
What Data Scientists Really Do
Data Preparation accounts for about 80% of the work of data scientists
“Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says”, Forbes Mar 23, 2016
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
index=oulad code_module=AAA
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
index=oulad code_module=AAA
| eval weighted_score=score*(weight/100)
| eval student_code=id_student."_".code_module."_".code_presentation
| bin _time span=1mon
| stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as
average_score by student_code _time
| streamstats sum(month_score) as cumulative_score last(average_score) as last_average
count by student_code
| eventstats max(count) as course_length
| eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)),
cumulative_score=if(cumulative_score>0,cumulative_score,0),
module_perc_complete=count/course_length
| join student_code [| inputlookup student_info.csv | eval
student_code=id_student."_".code_module."_".code_presentation | table student_code age_band
highest_education imd_band studied_credits final_result]
| table _time student_code sum_clicks average_score cumulative_score module_perc_complete
studied_credits age_band highest_education imd_band final_result
| outputlookup oulad_aaa.csv
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
index=oulad code_module=AAA
| eval weighted_score=score*(weight/100,
student_code=id_student."_".code_module."_".code_presentation
| bin _time span=1mon
| stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as
average_score by student_code _time
| streamstats sum(month_score) as cumulative_score last(average_score) as last_average
count by student_code
| eventstats max(count) as course_length
| eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)),
cumulative_score=if(cumulative_score>0,cumulative_score,0),
module_perc_complete=count/course_length
| join student_code [| inputlookup student_info.csv | eval
student_code=id_student."_".code_module."_".code_presentation | table student_code age_band
highest_education imd_band studied_credits final_result]
| table _time student_code sum_clicks average_score cumulative_score module_perc_complete
studied_credits age_band highest_education imd_band final_result
| outputlookup oulad_aaa.csv
Calculate a weighted score and
create a unique identifier for each
student and module combination
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
index=oulad code_module=AAA
| eval weighted_score=score*(weight/100,
student_code=id_student."_".code_module."_".code_presentation
| bin _time span=1mon
| stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as
average_score by student_code _time
| streamstats sum(month_score) as cumulative_score last(average_score) as last_average
count by student_code
| eventstats max(count) as course_length
| eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)),
cumulative_score=if(cumulative_score>0,cumulative_score,0),
module_perc_complete=count/course_length
| join student_code [| inputlookup student_info.csv | eval
student_code=id_student."_".code_module."_".code_presentation | table student_code age_band
highest_education imd_band studied_credits final_result]
| table _time student_code sum_clicks average_score cumulative_score module_perc_complete
studied_credits age_band highest_education imd_band final_result
| outputlookup oulad_aaa.csv
Calculate the number of clicks, total
score and average score for each
student in each month
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
index=oulad code_module=AAA
| eval weighted_score=score*(weight/100,
student_code=id_student."_".code_module."_".code_presentation
| bin _time span=1mon
| stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as
average_score by student_code _time
| streamstats sum(month_score) as cumulative_score last(average_score) as last_average
count by student_code
| eventstats max(count) as course_length
| eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)),
cumulative_score=if(cumulative_score>0,cumulative_score,0),
module_perc_complete=count/course_length
| join student_code [| inputlookup student_info.csv | eval
student_code=id_student."_".code_module."_".code_presentation | table student_code age_band
highest_education imd_band studied_credits final_result]
| table _time student_code sum_clicks average_score cumulative_score module_perc_complete
studied_credits age_band highest_education imd_band final_result
| outputlookup oulad_aaa.csv
Calculate the cumulative score over
time for each student and also get the
previous average score for each
month and create a rolling count
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
index=oulad code_module=AAA
| eval weighted_score=score*(weight/100,
student_code=id_student."_".code_module."_".code_presentation
| bin _time span=1mon
| stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as
average_score by student_code _time
| streamstats sum(month_score) as cumulative_score last(average_score) as last_average
count by student_code
| eventstats max(count) as course_length
| eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)),
cumulative_score=if(cumulative_score>0,cumulative_score,0),
module_perc_complete=count/course_length
| join student_code [| inputlookup student_info.csv | eval
student_code=id_student."_".code_module."_".code_presentation | table student_code age_band
highest_education imd_band studied_credits final_result]
| table _time student_code sum_clicks average_score cumulative_score module_perc_complete
studied_credits age_band highest_education imd_band final_result
| outputlookup oulad_aaa.csv
Find the highest rolling count to use
as the course length
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
index=oulad code_module=AAA
| eval weighted_score=score*(weight/100,
student_code=id_student."_".code_module."_".code_presentation
| bin _time span=1mon
| stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as
average_score by student_code _time
| streamstats sum(month_score) as cumulative_score last(average_score) as last_average
count by student_code
| eventstats max(count) as course_length
| eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)),
cumulative_score=if(cumulative_score>0,cumulative_score,0),
module_perc_complete=count/course_length
| join student_code [| inputlookup student_info.csv | eval
student_code=id_student."_".code_module."_".code_presentation | table student_code age_band
highest_education imd_band studied_credits final_result]
| table _time student_code sum_clicks average_score cumulative_score module_perc_complete
studied_credits age_band highest_education imd_band final_result
| outputlookup oulad_aaa.csv
Fill in empty average and cumulative
results and also calculate the module
percentage complete
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
index=oulad code_module=AAA
| eval weighted_score=score*(weight/100,
student_code=id_student."_".code_module."_".code_presentation
| bin _time span=1mon
| stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as
average_score by student_code _time
| streamstats sum(month_score) as cumulative_score last(average_score) as last_average
count by student_code
| eventstats max(count) as course_length
| eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)),
cumulative_score=if(cumulative_score>0,cumulative_score,0),
module_perc_complete=count/course_length
| join student_code [| inputlookup student_info.csv | eval
student_code=id_student."_".code_module."_".code_presentation | table student_code age_band
highest_education imd_band studied_credits final_result]
| table _time student_code sum_clicks average_score cumulative_score module_perc_complete
studied_credits age_band highest_education imd_band final_result
| outputlookup oulad_aaa.csv
Enrich the data with additional context
for each student
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
index=oulad code_module=AAA
| eval weighted_score=score*(weight/100,
student_code=id_student."_".code_module."_".code_presentation
| bin _time span=1mon
| stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as
average_score by student_code _time
| streamstats sum(month_score) as cumulative_score last(average_score) as last_average
count by student_code
| eventstats max(count) as course_length
| eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)),
cumulative_score=if(cumulative_score>0,cumulative_score,0),
module_perc_complete=count/course_length
| join student_code [| inputlookup student_info.csv | eval
student_code=id_student."_".code_module."_".code_presentation | table student_code age_band
highest_education imd_band studied_credits final_result]
| table _time student_code sum_clicks average_score cumulative_score module_perc_complete
studied_credits age_band highest_education imd_band final_result
| outputlookup oulad_aaa.csv
Select only the fields we are
interested in and save to a lookup
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
| inputlookup oulad_aaa.csv
| search final_result!="Withdrawn"
| sample partitions=10 seed=42
| search partition_number<7
| fit RandomForestClassifier final_result from average_score cumulative_score
module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into
rf_oulad_aaa
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
| inputlookup oulad_aaa.csv
| search final_result!="Withdrawn"
| sample partitions=10 seed=42
| search partition_number<7
| fit RandomForestClassifier final_result from average_score cumulative_score
module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into
rf_oulad_aaa
Remove data for withdrawn students
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
| inputlookup oulad_aaa.csv
| search final_result!="Withdrawn"
| sample partitions=10 seed=42
| search partition_number<7
| fit RandomForestClassifier final_result from average_score cumulative_score
module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into
rf_oulad_aaa
Select a random sample of 70% of the data
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
| inputlookup oulad_aaa.csv
| search final_result!="Withdrawn"
| sample partitions=10 seed=42
| search partition_number<7
| fit RandomForestClassifier final_result from average_score cumulative_score
module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into
rf_oulad_aaa
Train a random forest classifier
on the data
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
| inputlookup oulad_aaa.csv
| search final_result!="Withdrawn"
| sample partitions=10 seed=42
| search partition_number>6
| apply rf_oulad_aaa
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
| inputlookup oulad_aaa.csv
| search final_result!="Withdrawn"
| sample partitions=10 seed=42
| search partition_number>6
| apply rf_oulad_aaa
Apply the random forest classifier on
the remaining 30% of the data
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
| inputlookup oulad_aaa.csv
| eval
withdrawn=if(final_result="Withdrawn","Yes","No")
| sample partitions=10 seed=42
| search partition_number<7
| fit RandomForestClassifier withdrawn from
average_score cumulative_score
module_perc_complete studied_credits sum_clicks
age_band highest_education imd_band into
rf_withdrawn_oulad_aaa
| inputlookup oulad_aaa.csv
| eval
withdrawn=if(final_result="Withdrawn","Yes","No")
| sample partitions=10 seed=42
| search partition_number>6
| apply rf_withdrawn_oulad_aaa
Train model Test model
© 2 0 2 0 S P L U N K I N C .
Predicting Student Outcomes
© 2 0 2 0 S P L U N K I N C .
What Can Be Done to
Promote Good Use of
Machine Learning?
© 2 0 2 0 S P L U N K I N C .
UK Government
First to Pilot AI
Procurement
Guidelines Co-
Designed with
World Economic
Forum
https://www.weforum.org/press/2019/09/uk-government-
first-to-pilot-ai-procurement-guidelines-co-designed-with-
world-economic-forum/
Splunk has supported the
development of these guidelines and
worked closely with the WEF and UK
Government. We will help pilot them in
the UK and believe the guidance will
enable Governments across the world
transform citizen services and deliver
ethically sound and beneficial AI
based solutions.”
— Lenny Stein, Senior Vice President,
Global Affairs, Splunk
“
© 2 0 2 0 S P L U N K I N C .
Work with the WEF
Intent
Provide information to non-specialists so that they can assess the suitability of ML for a given
problem/solution
Current solution
Procurement guidance for ‘unlocking public sector AI’
High level procurement processes
Best practices when evaluating an RFP
Map for creating AI-related RFPs
Unlocking Public Sector AI go-live
Expected in the coming months
4
3
2
1
© 2 0 2 0 S P L U N K I N C .
You!
Thank

More Related Content

PPTX
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
PPTX
Splunk Discovery Köln - 17-01-2020 - Willkommen!
PPTX
Splunk Platform 2020 & Beyond
PPTX
Do You Really Need to Evolve From Monitoring to Observability?
PPTX
Splunk Discovery Köln - 17-01-2020 - Accelerate Incident Response
PPTX
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOps
PPTX
Splunk Overview
PPTX
Security Automation & Orchestration
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
Splunk Discovery Köln - 17-01-2020 - Willkommen!
Splunk Platform 2020 & Beyond
Do You Really Need to Evolve From Monitoring to Observability?
Splunk Discovery Köln - 17-01-2020 - Accelerate Incident Response
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOps
Splunk Overview
Security Automation & Orchestration

What's hot (20)

PPTX
The Risks and Rewards of AI
PPTX
The Top 10 Glasstable Design Principles to Boost Your Career and Your Business
PDF
Splunk AI & Machine Learning Roundtable 2019 - Zurich
PDF
Splunk Artificial Intelligence & Machine Learning Webinar
PPTX
Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
PPTX
SplunkLive! Stockholm 2019 - Customer presentation: Norlys
PPTX
SplunkLive! Stockholm 2019 - Customer presentation: ISS
PPTX
Spliunk Discovery Köln - 17-01-2020 - Intro to Security Analytics Methods
PPTX
Best Practices for Forwarder Hierarchies
PPTX
Splunk4Leaders
PDF
Manufacturing Webinar AMS
PPTX
Catch these Sessions on-demand at .conf Online
PPTX
Worst Splunk practices...and how to fix them
PPTX
IoT Analytics @ splunk
PPTX
Introduction into Security Analytics Methods
PPTX
Extending Splunk to Business Use Cases With Automated Process Mining
PPTX
Clear the Mist from your Clouds with Splunk
PDF
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...
PPTX
How to justify the economic value of your data investment
PPTX
SplunkLive! Utrecht 2019: NXP
The Risks and Rewards of AI
The Top 10 Glasstable Design Principles to Boost Your Career and Your Business
Splunk AI & Machine Learning Roundtable 2019 - Zurich
Splunk Artificial Intelligence & Machine Learning Webinar
Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
SplunkLive! Stockholm 2019 - Customer presentation: Norlys
SplunkLive! Stockholm 2019 - Customer presentation: ISS
Spliunk Discovery Köln - 17-01-2020 - Intro to Security Analytics Methods
Best Practices for Forwarder Hierarchies
Splunk4Leaders
Manufacturing Webinar AMS
Catch these Sessions on-demand at .conf Online
Worst Splunk practices...and how to fix them
IoT Analytics @ splunk
Introduction into Security Analytics Methods
Extending Splunk to Business Use Cases With Automated Process Mining
Clear the Mist from your Clouds with Splunk
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...
How to justify the economic value of your data investment
SplunkLive! Utrecht 2019: NXP
Ad

Similar to Machine Learning and Social Good (20)

PDF
The Data Science Process
PDF
A Risk Based Approach to Security Detection and Investigation by Kelby Shelton
PDF
Data Analysis - Making Big Data Work
PDF
Blue Eye Technology
PPTX
SplunkLive! Paris 2018: Splunk And AI 101
PDF
28022017 Simen Munter Mindfields
DOC
Hyperlink
PDF
Splunk 4 Ninja ITSI Workshop
PDF
Fake News and Message Detection
PDF
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
PDF
H2020 finsec-ort-webinar-ml-dl-cybersecurity-july 2020
PDF
Crime Prediction and Analysis
PDF
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
PDF
Predictive Modeling for Topographical Analysis of Crime Rate
PPTX
Webinar: Building a Business Case for Enterprise Search
PDF
Person Acquisition and Identification Tool
PDF
IRJET- Credit Card Fraud Detection using Machine Learning
PDF
Employment Performance Management Using Machine Learning
PDF
Itsi in-the-wild-why-micron-chose-splunk-it-service-intelligence-and-lessons-...
PPTX
Merging forensics w data analytics
The Data Science Process
A Risk Based Approach to Security Detection and Investigation by Kelby Shelton
Data Analysis - Making Big Data Work
Blue Eye Technology
SplunkLive! Paris 2018: Splunk And AI 101
28022017 Simen Munter Mindfields
Hyperlink
Splunk 4 Ninja ITSI Workshop
Fake News and Message Detection
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
H2020 finsec-ort-webinar-ml-dl-cybersecurity-july 2020
Crime Prediction and Analysis
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
Predictive Modeling for Topographical Analysis of Crime Rate
Webinar: Building a Business Case for Enterprise Search
Person Acquisition and Identification Tool
IRJET- Credit Card Fraud Detection using Machine Learning
Employment Performance Management Using Machine Learning
Itsi in-the-wild-why-micron-chose-splunk-it-service-intelligence-and-lessons-...
Merging forensics w data analytics
Ad

More from Splunk (20)

PDF
Splunk Leadership Forum Wien - 20.05.2025
PDF
Splunk Security Update | Public Sector Summit Germany 2025
PDF
Building Resilience with Energy Management for the Public Sector
PDF
IT-Lagebild: Observability for Resilience (SVA)
PDF
Nach dem SOC-Aufbau ist vor der Automatisierung (OFD Baden-Württemberg)
PDF
Monitoring einer Sicheren Inter-Netzwerk Architektur (SINA)
PDF
Praktische Erfahrungen mit dem Attack Analyser (gematik)
PDF
Cisco XDR & Splunk SIEM - stronger together (DATAGROUP Cyber Security)
PDF
Security - Mit Sicherheit zum Erfolg (Telekom)
PDF
One Cisco - Splunk Public Sector Summit Germany April 2025
PDF
.conf Go 2023 - Data analysis as a routine
PDF
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
PDF
.conf Go 2023 - Navegando la normativa SOX (Telefónica)
PDF
.conf Go 2023 - Raiffeisen Bank International
PDF
.conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett
PDF
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)
PDF
.conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu...
PDF
.conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever...
PDF
.conf go 2023 - De NOC a CSIRT (Cellnex)
PDF
conf go 2023 - El camino hacia la ciberseguridad (ABANCA)
Splunk Leadership Forum Wien - 20.05.2025
Splunk Security Update | Public Sector Summit Germany 2025
Building Resilience with Energy Management for the Public Sector
IT-Lagebild: Observability for Resilience (SVA)
Nach dem SOC-Aufbau ist vor der Automatisierung (OFD Baden-Württemberg)
Monitoring einer Sicheren Inter-Netzwerk Architektur (SINA)
Praktische Erfahrungen mit dem Attack Analyser (gematik)
Cisco XDR & Splunk SIEM - stronger together (DATAGROUP Cyber Security)
Security - Mit Sicherheit zum Erfolg (Telekom)
One Cisco - Splunk Public Sector Summit Germany April 2025
.conf Go 2023 - Data analysis as a routine
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - Navegando la normativa SOX (Telefónica)
.conf Go 2023 - Raiffeisen Bank International
.conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)
.conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu...
.conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever...
.conf go 2023 - De NOC a CSIRT (Cellnex)
conf go 2023 - El camino hacia la ciberseguridad (ABANCA)

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation theory and applications.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Machine learning based COVID-19 study performance prediction
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Spectral efficient network and resource selection model in 5G networks
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Diabetes mellitus diagnosis method based random forest with bat algorithm
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Machine learning based COVID-19 study performance prediction
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
sap open course for s4hana steps from ECC to s4
Encapsulation_ Review paper, used for researhc scholars
Network Security Unit 5.pdf for BCA BBA.
20250228 LYD VKU AI Blended-Learning.pptx

Machine Learning and Social Good

  • 1. © 2 0 2 0 S P L U N K I N C . © 2 0 2 0 S P L U N K I N C . Machine Learning for Social Good Dr. Greg Ainslie-Malik – Machine Learning Architect
  • 2. During the course of this presentation, we may make forward‐looking statements regarding future events or plans of the company. We caution you that such statements reflect our current expectations and estimates based on factors currently known to us and that actual events or results may differ materially. The forward-looking statements made in the this presentation are being made as of the time and date of its live presentation. If reviewed after its live presentation, it may not contain current or accurate information. We do not assume any obligation to update any forward‐looking statements made herein. In addition, any information about our roadmap outlines our general product direction and is subject to change at any time without notice. It is for informational purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligation either to develop the features or functionalities described or to include any such feature or functionality in a future release. Splunk, Splunk>, Data-to-Everything, D2E, and Turn Data Into Doing are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2020 Splunk Inc. All rights reserved. Forward- Looking Statements © 2 0 2 0 S P L U N K I N C .
  • 3. © 2 0 2 0 S P L U N K I N C . Introduction to Machine Learning Common challenges with Machine Learning Where have we seen Machine Learning used for social good? Anomaly detection Fraud detection Learning analytics What else are we doing to promote good use of Machine Learning? Agenda 4 3 2 1
  • 4. © 2 0 2 0 S P L U N K I N C . Introduction to Machine Learning
  • 5. © 2 0 2 0 S P L U N K I N C . What is Machine Learning? Artificial Intelligence (AI) Machine Learning Deep Learning • AI is supposed to mean any type of algorithm or programme that allows computers to mimic human behaviour • ML is a subset of this that allows machines to make improvements over time • Deep Learning is a type of machine learning that is based on neural networks
  • 6. © 2 0 2 0 S P L U N K I N C . What is Machine Learning? Data Rules Outcomes Data Outcomes (supervised only) Rules Classic Programming Machine Learning
  • 7. © 2 0 2 0 S P L U N K I N C . Why Use Machine Learning? Observation from Splunk customers Identify anomalies or ‘unknown unknowns’ Improve alert accuracy Highlight weak relationships
  • 8. © 2 0 2 0 S P L U N K I N C . How Machine Learning Fits into Splunk Search Every Search Can Use Machine Learning Third-Party Applications Smartphones and Devices Tickets Email Send an email File a ticket Send a text Flash lights Trigger process flow AlertReal TimeOT Industrial Assets IT Consumer and Mobile Devices Security
  • 9. © 2 0 2 0 S P L U N K I N C . Common Challenges with Machine Learning
  • 10. © 2 0 2 0 S P L U N K I N C . Problem Statement There is a lack of trust in Machine Learning. This is largely caused by limited transparency or explainability of most Machine Learning processes. Therefore it can be difficult to identify negative bias when applying Machine Learning.
  • 11. © 2 0 2 0 S P L U N K I N C . UNTAPPED UNANALYSED UNOWNED MOST ORGANIZATIONS’ DATA IS STILL DARK DATA 60% of organizations report that the majority of their data is still dark *Splunk Inc., “State of Dark Data Report” , May 2019
  • 12. © 2 0 2 0 S P L U N K I N C . Our World Never Stops Evolving. How can we handle the half-life of data? © 2 0 2 0 S P L U N K I N C .
  • 13. © 2 0 2 0 S P L U N K I N C . Use of AI Globally, 61%-67% saw value in AI for their organizations. 60%–70% of respondents believe that they will be using AI across IT, operations and talent management in the future. And yet … Only 10%–15% say their organizations are deploying AI for use cases today. While only 12% say that AI is currently guiding their business strategy, 61% expect it to do so in the next five years. of respondents say they expect AI to guide business strategy in the next five years. Organizations admit they’re not ready for AI. Their top four concerns: 1. Lack of trained AI experts 2. Lack of understanding of AI 3. Not knowing what can be automated 4. Difficulty successfully wrangling the data 61% 81% 80% 78% 78%
  • 14. © 2 0 2 0 S P L U N K I N C . Do you know what’s happening? Can you turn data into action? How do you build for the future? © 2 0 2 0 S P L U N K I N C .
  • 15. © 2 0 2 0 S P L U N K I N C . Try to gain as much visibility of your data as possible Minimise the delivery time for that data Invest in data skills Key Takeaways 1 2 3
  • 16. © 2 0 2 0 S P L U N K I N C . Machine Learning for Social Good Example case studies
  • 17. © 2 0 2 0 S P L U N K I N C . Finding Potential Cyber Security Incidents Identifying anomalies in massive datasets
  • 18. © 2 0 2 0 S P L U N K I N C . https://conf.splunk.com/files/2019/slides/SEC1374.pdf Use Case: Proxy Communication Investigation Workflow 1 2 3 4 5
  • 19. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | tstats count WHERE (index=botsv2) BY _time span=60m
  • 20. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | tstats count WHERE (index=botsv2) BY _time span=60m | eval HourOfDay=strftime(_time, "%H") | fit DensityFunction count by "HourOfDay" into df_bots_dns | table _time count IsOutlier(count)
  • 21. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | tstats count WHERE (index=botsv2) BY _time span=60m | eval HourOfDay=strftime(_time, "%H") | apply df_bots_dns threshold=0.03 | table _time count IsOutlier(count)
  • 22. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | summary df_bots_dns
  • 23. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | summary df_bots_dns Much bigger standard deviation Much higher mean than the other times of day None of the times of day have many training points
  • 24. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | tstats count WHERE (index=botsv2) BY _time span=60m | eval HourOfDay=strftime(_time, "%H") | apply df_bots_dns threshold=0.003 show_density=true | where 'IsOutlier(count)'>0 | join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std] | table _time count ProbabilityDensity(count) cardinality mean std | eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std Reduce the threshold and include the probability density in the results
  • 25. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | tstats count WHERE (index=botsv2) BY _time span=60m | eval HourOfDay=strftime(_time, "%H") | apply df_bots_dns threshold=0.003 show_density=true | where 'IsOutlier(count)'>0 | join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std] | table _time count ProbabilityDensity(count) cardinality mean std | eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std Filter the data to only show the anomalies
  • 26. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | tstats count WHERE (index=botsv2) BY _time span=60m | eval HourOfDay=strftime(_time, "%H") | apply df_bots_dns threshold=0.003 show_density=true | where 'IsOutlier(count)'>0 | join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std] | table _time count ProbabilityDensity(count) cardinality mean std | eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std Join with the summary data to include the cardinality, mean and standard deviation
  • 27. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | tstats count WHERE (index=botsv2) BY _time span=60m | eval HourOfDay=strftime(_time, "%H") | apply df_bots_dns threshold=0.003 show_density=true | where 'IsOutlier(count)'>0 | join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std] | table _time count ProbabilityDensity(count) cardinality mean std | eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std Calculate some additional fields using the mean and standard deviation that describe how extreme the outlier is
  • 28. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to find anomalies | tstats count WHERE (index=botsv2) BY _time span=60m | eval HourOfDay=strftime(_time, "%H") | apply df_bots_dns threshold=0.003 show_density=true | where 'IsOutlier(count)'>0 | join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std] | table _time count ProbabilityDensity(count) cardinality mean std | eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std
  • 29. © 2 0 2 0 S P L U N K I N C . Identifying Fraud Finding anomalies in credit card transactions, prescriptions and accesses to patient data
  • 30. © 2 0 2 0 S P L U N K I N C . Common for exploring transactional data Credit Card Fraud Example Group Like with Like Data is often “batch” loaded Often proactively searching for Unknown Unknowns
  • 31. © 2 0 2 0 S P L U N K I N C . Enrich the Transactions Region Change between card txns? Cal time delta between card txns. Merchant Change between card txns?
  • 32. © 2 0 2 0 S P L U N K I N C . Synthesize More Context Too quickly between regions? Avg Merchant/Region change by num txns. Aggregate counts per card Stdev TimeDelta/Amt by averages. Too quickly between merchants?
  • 33. © 2 0 2 0 S P L U N K I N C . Prep for Clustering and Visualization 1. Standard Scalar – normalize distribution 2. Principal Component Analysis (PCA) – reduce to 3 dimensions
  • 34. © 2 0 2 0 S P L U N K I N C . Finally – Cluster with KMeans
  • 35. © 2 0 2 0 S P L U N K I N C . https://medcitynews.com/2019/02/splunk-and-newyork-presbyterian/ https://www.healthcareitnews.com/news/newyork-presbyterian-working-machine- learning-analytics-combat-opioid-crisis “At a time when overdose deaths are at crisis levels across the country and in New York City, largely due to the opioid epidemic, healthcare providers have a responsibility to safeguard against any potential diversion of drugs. NewYork- Presbyterian is taking a leading role in protecting the public by implementing highly effective controls to avoid the illegitimate use of controlled substances. Ultimately, we hope that other hospitals benefit from this new platform as well.” Jennings Aske, senior vice president and chief information security officer at NewYork-Presbyterian
  • 36. © 2 0 2 0 S P L U N K I N C .
  • 37. © 2 0 2 0 S P L U N K I N C .
  • 38. © 2 0 2 0 S P L U N K I N C . Together, NewYork-Presbyterian and Splunk are also creating an enhanced data analytics solution that investigates unauthorized access to patient records.
  • 39. © 2 0 2 0 S P L U N K I N C .
  • 40. © 2 0 2 0 S P L U N K I N C . Detect the anomaly…
  • 41. © 2 0 2 0 S P L U N K I N C . …drill down into that user…
  • 42. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes Predicting student grades based on their digital interactions with university IT and identifying students that are at risk of dropping out
  • 43. © 2 0 2 0 S P L U N K I N C .
  • 44. © 2 0 2 0 S P L U N K I N C . What Data Scientists Really Do Data Preparation accounts for about 80% of the work of data scientists “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says”, Forbes Mar 23, 2016
  • 45. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA
  • 46. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100) | eval student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv
  • 47. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100, student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv Calculate a weighted score and create a unique identifier for each student and module combination
  • 48. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100, student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv Calculate the number of clicks, total score and average score for each student in each month
  • 49. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100, student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv Calculate the cumulative score over time for each student and also get the previous average score for each month and create a rolling count
  • 50. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100, student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv Find the highest rolling count to use as the course length
  • 51. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100, student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv Fill in empty average and cumulative results and also calculate the module percentage complete
  • 52. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100, student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv Enrich the data with additional context for each student
  • 53. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100, student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv Select only the fields we are interested in and save to a lookup
  • 54. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes
  • 55. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes | inputlookup oulad_aaa.csv | search final_result!="Withdrawn" | sample partitions=10 seed=42 | search partition_number<7 | fit RandomForestClassifier final_result from average_score cumulative_score module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into rf_oulad_aaa
  • 56. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes | inputlookup oulad_aaa.csv | search final_result!="Withdrawn" | sample partitions=10 seed=42 | search partition_number<7 | fit RandomForestClassifier final_result from average_score cumulative_score module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into rf_oulad_aaa Remove data for withdrawn students
  • 57. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes | inputlookup oulad_aaa.csv | search final_result!="Withdrawn" | sample partitions=10 seed=42 | search partition_number<7 | fit RandomForestClassifier final_result from average_score cumulative_score module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into rf_oulad_aaa Select a random sample of 70% of the data
  • 58. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes | inputlookup oulad_aaa.csv | search final_result!="Withdrawn" | sample partitions=10 seed=42 | search partition_number<7 | fit RandomForestClassifier final_result from average_score cumulative_score module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into rf_oulad_aaa Train a random forest classifier on the data
  • 59. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes | inputlookup oulad_aaa.csv | search final_result!="Withdrawn" | sample partitions=10 seed=42 | search partition_number>6 | apply rf_oulad_aaa
  • 60. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes | inputlookup oulad_aaa.csv | search final_result!="Withdrawn" | sample partitions=10 seed=42 | search partition_number>6 | apply rf_oulad_aaa Apply the random forest classifier on the remaining 30% of the data
  • 61. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes
  • 62. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes | inputlookup oulad_aaa.csv | eval withdrawn=if(final_result="Withdrawn","Yes","No") | sample partitions=10 seed=42 | search partition_number<7 | fit RandomForestClassifier withdrawn from average_score cumulative_score module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into rf_withdrawn_oulad_aaa | inputlookup oulad_aaa.csv | eval withdrawn=if(final_result="Withdrawn","Yes","No") | sample partitions=10 seed=42 | search partition_number>6 | apply rf_withdrawn_oulad_aaa Train model Test model
  • 63. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes
  • 64. © 2 0 2 0 S P L U N K I N C . What Can Be Done to Promote Good Use of Machine Learning?
  • 65. © 2 0 2 0 S P L U N K I N C . UK Government First to Pilot AI Procurement Guidelines Co- Designed with World Economic Forum https://www.weforum.org/press/2019/09/uk-government- first-to-pilot-ai-procurement-guidelines-co-designed-with- world-economic-forum/ Splunk has supported the development of these guidelines and worked closely with the WEF and UK Government. We will help pilot them in the UK and believe the guidance will enable Governments across the world transform citizen services and deliver ethically sound and beneficial AI based solutions.” — Lenny Stein, Senior Vice President, Global Affairs, Splunk “
  • 66. © 2 0 2 0 S P L U N K I N C . Work with the WEF Intent Provide information to non-specialists so that they can assess the suitability of ML for a given problem/solution Current solution Procurement guidance for ‘unlocking public sector AI’ High level procurement processes Best practices when evaluating an RFP Map for creating AI-related RFPs Unlocking Public Sector AI go-live Expected in the coming months 4 3 2 1
  • 67. © 2 0 2 0 S P L U N K I N C . You! Thank