SlideShare a Scribd company logo
Cluster, Classify, Associate, Regress:
Satisfy Your Inner Data Scientist
with OML and Analytics
Jim Czuprynski
@JimTheWhyGuy
Zero Defect Computing, Inc.
Photo credit: National Cancer Institute on UnSplash
Who Am I, and How Did I Get Here?
E-mail me at jczuprynski@zerodefectcomputing.com
Follow me on Twitter (@JimTheWhyGuy)
Connect with me on LinkedIn (Jim Czuprynski)
Traveler & public speaker Summers:
Wisconsin
Winters:
Illinois
Cyclist
XC skier
Avid
amateur
bird
watcher
Oldest dude in
Krav Maga class
Jim Czuprynski
Liron Amitzi
https://www.beyondtechskills.com
The podcast that talks about everything tech – except tech.TM
Typical Business Use Cases Driving Machine Learning & Analytics
Are any of our
customers at financial
risk to default on their
investments in solar
panel technology? Are current solar energy
program customers likely
to recommend our
program to colleagues,
or maybe even request
more alternative energy
sources?
Has the 2020 pandemic
hurt our best customers’
ability to continue to
invest in alternate
energy sources?
What’s the long-term
return our customers
can expect to realize
on their solar energy
investment?
Which customers
would be best
candidates to take
advantage of our solar
panel program?
ML Basics: Clustering, Classification, Association, Regression
Which customers would be
best candidates to take
advantage of our solar panel
program?
Are any of our customers
at financial risk to default
on their investments in
solar panel technology?
What’s the long-term return
our customers can expect to
realize on their solar energy
investment?
Are current solar energy program
customers likely to recommend
our program to colleagues, or
maybe even request more
alternative energy sources?
Clustering identifies how
certain like items are grouped
together into one or more
previously undefined
categories
Classification lets us
predict the probability for
a single item to belong to a
defined category, or even
multiple categories
Regression gives us the
power to project trends
forward in time within
upper and lower confidence
boundaries
Association discovers the
chance that two or more items
are likely to be found within the
same collection as well as the
relationships between those
items
Check out this summary that breaks out all of the inherent algorithms that support these methods
SIMIOT: A Modern Electric Utility’s Information Model
SMART_METERS
Contains information about
individual smart meters for
which data is being collected
METER_READINGS
Individual readings for each smart
meter over (often extremely!) short
time intervals
DISPATCH_CENTERS
Utility repairs depots from which
servicepeople & trucks are dispatched
BUSINESS_DESCRIPTIONS
Describes unique business classifications
based on licensing issued
CUSTOMER_DEMOGRAPHICS
Demographics for each customer,
including ethnicity, minority status, and
employee counts
CUSTOMER_FINANCIALS
Pertinent financial data for each
customer, including credit worthiness
CUSTOMER_RESPONSES
Selected social media posts reflecting
how customers anonymously view
services & products
Probing Data Sources, Wherever They Live And What They Comprise
Captured from
an OSS stream -
10K readings
every minute
Most recent data
within database, but
history available as
CSV files in Amazon
S3 bucket (which
cannot be moved!)
Stored in an
Excel
spreadsheet …
“somewhere”
Pulled from City
of Chicago
public records
via API calls,
saved as CSVs
and (re)loaded
quarterly
Captured via
Twitter API
stream, stored as
multiple JSON
documents
Converged Database: A Vision for the Future, 21c and Beyond
Personal /
External
Datasets
Enterprise Applications
Data Integration
OAC, OML,
APEX, and
Graph Studio
Ad hoc,
Batch or
Scheduled
Business
Leaders
Analysts
Data
Scientists
Developers
OAC Dataflow,
Manual or ETL
Data Management
ADW
Business Analytics
ERP CRM HCM
Self-sufficient,
encrypted, secured
data storehouse
Self-service
analytics via ML
REST-Enabled
External APIs
IoT and
Edge
Computing
ATP
AJD AGD
The new kid on the block:
Autonomous JSON
Database
Recently announced:
Autonomous Database
for Graph Studio
AutoML features make it
simple to apply the right
algorithm(s) with confidence
Picking the Right Algorithm: An Algorithm Would Be Nice …
Here’s the good news:
There’s at least one powerful Machine Learning
algorithm to choose from for any analytic scenario
Here’s the bad news:
You’ve got to pick the right one considering its
advantages and drawbacks for each scenario
Uncover Associations From Customer Survey Results (1)
CS_ID CS_POSITIVE_RESPONSE
---------- -------------------------
. . .
2633595 PlanningToRenew
2633595 WindTurbineInterest
2635273 PlanningToRenew
2635273 RecommendedToColleague
2635273 WindTurbineInterest
2635712 PlanningToRenew
2637099 PlanningToRenew
2637464 PlanningToRenew
2637464 WindTurbineInterest
2637995 PlanningToRenew
2637995 RecommendedToColleague
2637995 WindTurbineInterest
. . .
Our marketing department
recently surveyed the top 100
users of our solar power
initiative. Some were excited
enough to recommend it to
others, and some even want to
expand to wind turbines …
… and others,
not so much.
Uncover Associations From Customer Survey Results (2)
DECLARE
-- Processing variables:
SQLERRNUM INTEGER := 0;
SQLERRMSG VARCHAR2(255);
vcReturnValue VARCHAR2(256) := 'Failure!';
v_setlist DBMS_DATA_MINING.SETTING_LIST;
BEGIN
v_setlst('ALGO_NAME’) := 'ALGO_APRIORI_ASSOCIATION_RULES';
v_setlst('PREP_AUTO’) := 'ON';
v_setlst('ASSO_MIN_SUPPORT') := '0.04';
v_setlst('ASSO_MIN_CONFIDENCE') := '0.1';
v_setlst('ASSO_MAX_RULE_LENGTH'):= '2';
v_setlst('ODMS_ITEM_ID_COLUMN_NAME'):= 'CS_POSITIVE_RESPONSE';
. . .
We’ll use the APriori
algorithm to uncover
any hidden relationships
between different
responses to our recent
customer satisfaction
survey …
Uncover Associations From Customer Survey Results (3)
. . .
DBMS_DATA_MINING.CREATE_MODEL2(
MODEL_NAME => 'CUSTOMER_RENEWABILITY'
,MINING_FUNCTION => 'ASSOCIATION'
,DATA_QUERY => 'SELECT * FROM t_survey_results'
,SET_LIST => v_setlst
,CASE_ID_COLUMN_NAME => 'CS_ID');
EXCEPTION
WHEN OTHERS THEN
SQLERRNUM := SQLCODE;
SQLERRMSG := SQLERRM;
vcReturnValue :=
'Error creating new OML4SQL model: ' ||
SQLERRNUM || ' - ' || SQLERRMSG;
DBMS_APPLICATION_INFO.SET_MODULE(NULL,NULL);
DBMS_OUTPUT.PUT_LINE(vcReturnValue);
END;
/
… using each
customer’s ID as the
individual case ID
for the model
Uncover Associations From Customer Survey Results (4)
SELECT * FROM (
SELECT
rule_id
,antecedent_predicate antecedent
,consequent_predicate consequent
,ROUND(rule_support,3) supp_rtg
,ROUND(rule_confidence,3) conf_rating
,NUMBER_OF_ITEMS num_items
FROM dm$vacustomer_renewability
ORDER BY rule_confidence DESC, rule_support DESC
)
WHERE ROWNUM <= 10
ORDER BY rule_id, antecedent, consequent;
RULE_ID ANTECEDENT CONSEQUENT SUPP_RTG CONF_RATING NUM_ITEMS
------- ---------------------- ------------------------ -------- ----------- ----------
1 RecommendedToColleague PlanningToRenew .172 .895 2
2 PlanningToRenew RecommendedToColleague .172 .179 2
3 WindTurbineInterest PlanningToRenew .818 .953 2
4 PlanningToRenew WindTurbineInterest .818 .853 2
5 WindTurbineInterest RecommendedToColleague .162 .188 2
6 RecommendedToColleague WindTurbineInterest .162 .842 2
A very simple query against the results of
the Association model shows the
individual support and confidence ratings
for each survey response
Clustering, Brute Force: Looking for Solar Power Super-Users (1)
CREATE OR REPLACE VIEW solar_superusers AS
SELECT
sm_business_type
,sm_zipcode
,sm_id
,ROUND(AVG(smr_kwh_used),2) avg_kwh_used
,ROUND(AVG(smr_solar_kwh),2) avg_solar_kwh
,ROUND(AVG(smr_solar_kwh) / AVG(smr_kwh_used) ,2) pct_solar
,CASE
WHEN ROUND(AVG(smr_solar_kwh) / AVG(smr_kwh_used) ,2) >= 0.15
THEN 1 ELSE 0
END as solar_superuser
FROM
t_smartmeters
,t_meter_readings
WHERE smr_id = sm_id
GROUP BY sm_business_type, sm_zipcode, sm_id
ORDER BY sm_business_type, sm_zipcode, sm_id
;
We’ll use this
metric to identify
which
SmartMeter
customers are
solar energy
super-users
(defined as 15%
or more of their
energy generated
via solar power
on-site)
Clustering, Brute Force: Looking for Solar Power Super-Users (2)
DECLARE
-- Processing variables:
SQLERRNUM INTEGER := 0;
SQLERRMSG VARCHAR2(255);
vcReturnValue VARCHAR2(256) := 'Failure!';
v_setlist DBMS_DATA_MINING.SETTING_LIST;
BEGIN
v_setlist('ALGO_NAME') := 'ALGO_KMEANS';
v_setlist('PREP_AUTO') := 'ON';
v_setlist('KMNS_DISTANCE') := 'KMNS_EUCLIDEAN';
v_setlist('KMNS_DETAILS') := 'KMNS_DETAILS_ALL';
v_setlist('KMNS_ITERATIONS') := '3';
v_setlist('KMNS_NUM_BINS') := '10’;
. . .
We’ll use a k-Means
algorithm to delve into
which Smart Meters are
clustered together, based
on various attributes of the
companies using them …
Clustering, Brute Force: Looking for Solar Power Super-Users (3)
. . .
DBMS_DATA_MINING.CREATE_MODEL2(
model_name => ‘SOLAR_SUPERUSERS_CLUSTERING',
mining_function => 'CLUSTERING',
data_query => 'SELECT * FROM solar_superusers',
set_list => v_setlist,
case_id_column_name => 'SM_ID');
EXCEPTION
WHEN OTHERS THEN
SQLERRNUM := SQLCODE;
SQLERRMSG := SQLERRM;
vcReturnValue :=
'Error creating new OML4SQL model: ' ||
SQLERRNUM || ' - ' || SQLERRMSG;
DBMS_APPLICATION_INFO.SET_MODULE(NULL,NULL);
DBMS_OUTPUT.PUT_LINE(vcReturnValue);
END;
/
… and leverage the
view we just
created as the
source for each
Smart Meter’s
related attributes
Clustering, Brute Force: ML Modeling Results
SELECT
cluster_id
,attribute_name
,operator
,numeric_value
,support
,ROUND(confidence,3) confidence
FROM dm$vrsolar_superusers_clustering
WHERE cluster_id IN (SELECT cluster_id
FROM dm$vdSOLAR_SUPERUSERS_CLUSTERING
WHERE left_child_id IS NULL AND right_child_id IS NULL)
ORDER BY
cluster_id
,attribute_name
,operator
,numeric_value;
Now that the clustering model is
complete, we can review which features
were grouped together, as well as the
support and confidence ratings for their
placement in each of the several clusters
Essentially, these are the
rules that drive the
clustering model to assign
each subset of customers
to a specific cluster
Missed It By That Much: Determining ML Model Accuracy
All models are wrong, but some are useful.
- George E.P Box
Founder, Department of Statistics, UW- Madison
The good news:
All Oracle Data Mining functions offer transparency into exactly
how results have been produced via accuracy determination
methods specific to each mining function
Mining Function Accuracy Determination Methods
Clustering Centroid
Classification Confusion Matrix, Lift, ROC
Regression Root Mean Squared Error (RMSE),
Mean Absolute Error (MAE)
Time Series Mean Square Error (MSE)
Missed It By That Much: Determining ML Model Accuracy
Here’s an example of using a
Radar Area graph to show the
top 15 business types that are
the most egregious outliers in
terms of solar power used (aka
distance from centroid)
AutoML: Let the Database Decide!
Check out the summary of all the latest AutoML enhancements!
This makes it easier for
“citizen data scientists”
to apply the power of
ML & Analytics …
… the new AutoML
interface makes selection
of the proper algorithms
a snap …
… and many more new
features, including
Graph Studio
Classification Via AutoML: A Demonstration
To illustrate how powerful a tool
AutoML is, let’s run another
experiment against the same
data source … but this time
aiming at producing models and
results aimed at classification
rather than regression
What could
possibly go
wrong?
Building a Data Source for AutoML to Devour
CREATE TABLE t_smartmeter_business_profiles AS
SELECT
sm_id
,CD.cd_minority_owned
,CD.cd_family_generations
,CD.cd_years_in_business
,CD.cd_locale_ownership
,CF.pct_profit_margin
,CF.avg_credit_score
,SM.avg_kwh_used
,SM.avg_solar_kwh
,SM.pct_solar
,SM.solar_superuser
FROM
t_customer_creditscoring CF
. . .
We’re drawing on
data summarized
from a Hybrid
Partitioned table
containing financial
statistics …
. . .
,t_customer_demographics CD
,(SELECT
sm_id
,ROUND(AVG(smr_kwh_used),2) AS avg_kwh_used
,ROUND(AVG(smr_solar_kwh),2) AS avg_solar_kwh
,ROUND(AVG(smr_solar_kwh) / AVG(smr_kwh_used) ,2) AS pct_solar
,CASE
WHEN ROUND(AVG(smr_solar_kwh) / AVG(smr_kwh_used) ,2) >= 0.15
THEN 1 ELSE 0
END AS solar_superuser
FROM
t_smartmeters
,t_meter_readings
WHERE smr_id = sm_id
GROUP BY sm_id
ORDER BY sm_id) SM
WHERE SM.sm_id = CF.cf_id
AND SM.sm_id = CD.cd_id
ORDER BY sm_id;
… as well as customer
demographics and
solar energy usage data
Regression Experiments with AutoML (1)
First, select an
appropriate data
source
1
AutoML automatically
builds a list of
potential features and
their key metrics
2
Regression Experiments with AutoML (2)
Review settings
for prediction
type, run time,
model metric, and
ML algorithms to
apply
3
Start the experiment, choosing either
speed or accuracy
4
Regression Experiments with AutoML (3)
AutoML now
finishes any
sampling
needed and
moves on to
feature
selection
5
Next, AutoML begins building the selected models
6
Regression Experiments with AutoML (4)
Model generation is
complete! On to
Feature Prediction
Impact assessment …
7
Regression Experiments with AutoML (5)
Regression(s) complete!
Now let’s transform the
Neural Network model into
a Zeppelin notebook, with
just a few mouse clicks
8
Transform an AutoML Experiment into a NoteBook (1)
From the Leader Board,
select one of the algorithms
and click on Create
Notebook
1
Name the new notebook
2
Transform an AutoML Experiment into a NoteBook (2)
The new notebook is ready.
Click the link to start
building paragraphs and
retrieving data
3
Don’t know Python? No
worries! The new
notebook uses OML4Py
to construct paragraphs
for data retrieval and
modeling
4
Transform an AutoML Experiment into a NoteBook (3)
Et voila! Here’s your
first results from a
notebook completely
generated via
AutoML!
5
Useful Resources and Documentation
OML Algorithms “Cheat Sheet” :
https://www.oracle.com/a/tech/docs/oml4sql-algorithm-cheat-sheet.pdf
Oracle 21c Machine Learning Basics (including AutoML):
https://docs.oracle.com/en/database/oracle/machine-learning/oml4sql/21/dmcon/machine-learning-basics.html
ADW Next Generation Availability Announcement:
https://www.oracle.com/news/announcement/oracle-adds-innovations-to-cloud-data-warehouse-031721.html

More Related Content

PPSX
Keep Your Code Low, Low, Low, Low, Low: Getting to Digitally Driven With Orac...
PDF
Customer Event Hub – a modern Customer 360° view with DataStax Enterprise (DSE)
PPTX
Planning your move to the cloud: SaaS Enablement and User Experience (Oracle ...
PPTX
Cepta The Future of Data with Power BI
PDF
Time to Talk about Data Mesh
PDF
Taming the shrew Power BI
PDF
APAC Confluent Consumer Data Right the Lowdown and the Lessons
PPTX
Building Modern Data Platform with AWS
Keep Your Code Low, Low, Low, Low, Low: Getting to Digitally Driven With Orac...
Customer Event Hub – a modern Customer 360° view with DataStax Enterprise (DSE)
Planning your move to the cloud: SaaS Enablement and User Experience (Oracle ...
Cepta The Future of Data with Power BI
Time to Talk about Data Mesh
Taming the shrew Power BI
APAC Confluent Consumer Data Right the Lowdown and the Lessons
Building Modern Data Platform with AWS

What's hot (19)

PDF
Fast Data at ING – the why, what and how of the streaming analytics platform ...
PDF
Confluent & MongoDB APAC Lunch & Learn
PDF
Apache Kafka for Automotive Industry, Mobility Services & Smart City
PDF
Transforming Oracle Enterprise Mobility Using Intelligent Chatbot & AI - A Wh...
PDF
Kappa vs Lambda Architectures and Technology Comparison
PPTX
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
PDF
Javaedge 2010-cschalk
PPTX
Extreme Analytics @ eBay
PDF
Webinar: 10-Step Guide to Creating a Single View of your Business
PDF
Kafka and Machine Learning in Banking and Insurance Industry
PDF
Apache Kafka in Gaming Industry (Games, Mobile, Betting, Gambling, Bookmaker,...
PDF
Connecting Apache Kafka to Cash
PDF
Event Streaming in Retail with Apache Kafka
PPTX
Apache Spark Streaming -Real time web server log analytics
PPTX
Bots & Teams: el poder de Grayskull
PDF
Enabling SQL Access to Data Lakes
PDF
Microsoft AI Platform Whitepaper
PPTX
Microsoft business intelligence and analytics
PDF
How to Quantify the Value of Kafka in Your Organization
Fast Data at ING – the why, what and how of the streaming analytics platform ...
Confluent & MongoDB APAC Lunch & Learn
Apache Kafka for Automotive Industry, Mobility Services & Smart City
Transforming Oracle Enterprise Mobility Using Intelligent Chatbot & AI - A Wh...
Kappa vs Lambda Architectures and Technology Comparison
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Javaedge 2010-cschalk
Extreme Analytics @ eBay
Webinar: 10-Step Guide to Creating a Single View of your Business
Kafka and Machine Learning in Banking and Insurance Industry
Apache Kafka in Gaming Industry (Games, Mobile, Betting, Gambling, Bookmaker,...
Connecting Apache Kafka to Cash
Event Streaming in Retail with Apache Kafka
Apache Spark Streaming -Real time web server log analytics
Bots & Teams: el poder de Grayskull
Enabling SQL Access to Data Lakes
Microsoft AI Platform Whitepaper
Microsoft business intelligence and analytics
How to Quantify the Value of Kafka in Your Organization
Ad

Similar to Cluster, Classify, Associate, Regress: Satisfy Your Inner Data Scientist with OML and Analytics (20)

PDF
Einführung in Amazon Machine Learning - AWS Machine Learning Web Day
PPTX
First CASSANDRA Webinar - Concept and theoretical aspects of CASSANDRA by Gio...
PDF
An introduction to Machine Learning
PPT
SmartGrid and the Customer Experience
PDF
Analysis of differential investor performance captstone presentation final
PDF
Bank Customer Segmentation & Insurance Claim Prediction
DOC
Open06
PDF
IRJET- Finding Optimal Skyline Product Combinations Under Price Promotion
PPTX
A visual review of america's online solar store solar systems usa - financial
PPTX
Solar Powered Electronic Calculator Manufacturing Plant Project Report
PDF
Filling in the blanks: How what your energy data monitoring isn't telling you...
PPTX
Alexander Lukin, Yandex
PDF
Business Intelligence and Data Analytics in Renewable Energy Sector
PPTX
6.2_Integrated Design and Financial Model_Knapp_EPRI/SNL Microgrid
PDF
EMS Company Profile - Energy Management Systems
PDF
Feasibility study of pervasive computing
PDF
7 Ways to unlock value from Smartmeter Big Data
PDF
Retail Energy Analytics_Marketelligent
PPTX
facilitiesmanagementpowerpointpresentationslides-220211043558.pptx
PDF
AutoGrid DER Webinar Slides
Einführung in Amazon Machine Learning - AWS Machine Learning Web Day
First CASSANDRA Webinar - Concept and theoretical aspects of CASSANDRA by Gio...
An introduction to Machine Learning
SmartGrid and the Customer Experience
Analysis of differential investor performance captstone presentation final
Bank Customer Segmentation & Insurance Claim Prediction
Open06
IRJET- Finding Optimal Skyline Product Combinations Under Price Promotion
A visual review of america's online solar store solar systems usa - financial
Solar Powered Electronic Calculator Manufacturing Plant Project Report
Filling in the blanks: How what your energy data monitoring isn't telling you...
Alexander Lukin, Yandex
Business Intelligence and Data Analytics in Renewable Energy Sector
6.2_Integrated Design and Financial Model_Knapp_EPRI/SNL Microgrid
EMS Company Profile - Energy Management Systems
Feasibility study of pervasive computing
7 Ways to unlock value from Smartmeter Big Data
Retail Energy Analytics_Marketelligent
facilitiesmanagementpowerpointpresentationslides-220211043558.pptx
AutoGrid DER Webinar Slides
Ad

More from Jim Czuprynski (16)

PDF
From DBA to DE: Becoming a Data Engineer
PDF
Going Native: Leveraging the New JSON Native Datatype in Oracle 21c
PDF
Access Denied: Real-World Use Cases for APEX and Real Application Security
PDF
Charge Me Up! Using Oracle ML, Analytics, and APEX For Finding Optimal Charge...
PDF
Graphing Grifters: Identify & Display Patterns of Corruption With Oracle Graph
PPSX
So an Airline Pilot, a Urologist, and an IT Technologist Walk Into a Bar: Thi...
PPSX
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
PPSX
Conquer Big Data with Oracle 18c, In-Memory External Tables and Analytic Func...
PPSX
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
PPSX
What's Your Super-Power? Mine is Machine Learning with Oracle Autonomous DB.
PPSX
An Autonomous Singularity Approaches: Force Multipliers For Overwhelmed DBAs
PPSX
Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...
PPSX
One Less Thing For DBAs to Worry About: Automatic Indexing
PPSX
Where the %$#^ Is Everybody? Geospatial Solutions For Oracle APEX
PPSX
JSON, A Splash of SODA, and a SQL Chaser: Real-World Use Cases for Autonomous...
PPSX
Fast and Furious: Handling Edge Computing Data With Oracle 19c Fast Ingest an...
From DBA to DE: Becoming a Data Engineer
Going Native: Leveraging the New JSON Native Datatype in Oracle 21c
Access Denied: Real-World Use Cases for APEX and Real Application Security
Charge Me Up! Using Oracle ML, Analytics, and APEX For Finding Optimal Charge...
Graphing Grifters: Identify & Display Patterns of Corruption With Oracle Graph
So an Airline Pilot, a Urologist, and an IT Technologist Walk Into a Bar: Thi...
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Conquer Big Data with Oracle 18c, In-Memory External Tables and Analytic Func...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
What's Your Super-Power? Mine is Machine Learning with Oracle Autonomous DB.
An Autonomous Singularity Approaches: Force Multipliers For Overwhelmed DBAs
Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...
One Less Thing For DBAs to Worry About: Automatic Indexing
Where the %$#^ Is Everybody? Geospatial Solutions For Oracle APEX
JSON, A Splash of SODA, and a SQL Chaser: Real-World Use Cases for Autonomous...
Fast and Furious: Handling Edge Computing Data With Oracle 19c Fast Ingest an...

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Cloud computing and distributed systems.
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Approach and Philosophy of On baking technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Per capita expenditure prediction using model stacking based on satellite ima...
Cloud computing and distributed systems.
Programs and apps: productivity, graphics, security and other tools
Mobile App Security Testing_ A Comprehensive Guide.pdf
Big Data Technologies - Introduction.pptx
Network Security Unit 5.pdf for BCA BBA.
Encapsulation_ Review paper, used for researhc scholars
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The Rise and Fall of 3GPP – Time for a Sabbatical?
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Approach and Philosophy of On baking technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Understanding_Digital_Forensics_Presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Cluster, Classify, Associate, Regress: Satisfy Your Inner Data Scientist with OML and Analytics

  • 1. Cluster, Classify, Associate, Regress: Satisfy Your Inner Data Scientist with OML and Analytics Jim Czuprynski @JimTheWhyGuy Zero Defect Computing, Inc. Photo credit: National Cancer Institute on UnSplash
  • 2. Who Am I, and How Did I Get Here? E-mail me at jczuprynski@zerodefectcomputing.com Follow me on Twitter (@JimTheWhyGuy) Connect with me on LinkedIn (Jim Czuprynski) Traveler & public speaker Summers: Wisconsin Winters: Illinois Cyclist XC skier Avid amateur bird watcher Oldest dude in Krav Maga class
  • 3. Jim Czuprynski Liron Amitzi https://www.beyondtechskills.com The podcast that talks about everything tech – except tech.TM
  • 4. Typical Business Use Cases Driving Machine Learning & Analytics Are any of our customers at financial risk to default on their investments in solar panel technology? Are current solar energy program customers likely to recommend our program to colleagues, or maybe even request more alternative energy sources? Has the 2020 pandemic hurt our best customers’ ability to continue to invest in alternate energy sources? What’s the long-term return our customers can expect to realize on their solar energy investment? Which customers would be best candidates to take advantage of our solar panel program?
  • 5. ML Basics: Clustering, Classification, Association, Regression Which customers would be best candidates to take advantage of our solar panel program? Are any of our customers at financial risk to default on their investments in solar panel technology? What’s the long-term return our customers can expect to realize on their solar energy investment? Are current solar energy program customers likely to recommend our program to colleagues, or maybe even request more alternative energy sources? Clustering identifies how certain like items are grouped together into one or more previously undefined categories Classification lets us predict the probability for a single item to belong to a defined category, or even multiple categories Regression gives us the power to project trends forward in time within upper and lower confidence boundaries Association discovers the chance that two or more items are likely to be found within the same collection as well as the relationships between those items Check out this summary that breaks out all of the inherent algorithms that support these methods
  • 6. SIMIOT: A Modern Electric Utility’s Information Model SMART_METERS Contains information about individual smart meters for which data is being collected METER_READINGS Individual readings for each smart meter over (often extremely!) short time intervals DISPATCH_CENTERS Utility repairs depots from which servicepeople & trucks are dispatched BUSINESS_DESCRIPTIONS Describes unique business classifications based on licensing issued CUSTOMER_DEMOGRAPHICS Demographics for each customer, including ethnicity, minority status, and employee counts CUSTOMER_FINANCIALS Pertinent financial data for each customer, including credit worthiness CUSTOMER_RESPONSES Selected social media posts reflecting how customers anonymously view services & products
  • 7. Probing Data Sources, Wherever They Live And What They Comprise Captured from an OSS stream - 10K readings every minute Most recent data within database, but history available as CSV files in Amazon S3 bucket (which cannot be moved!) Stored in an Excel spreadsheet … “somewhere” Pulled from City of Chicago public records via API calls, saved as CSVs and (re)loaded quarterly Captured via Twitter API stream, stored as multiple JSON documents
  • 8. Converged Database: A Vision for the Future, 21c and Beyond Personal / External Datasets Enterprise Applications Data Integration OAC, OML, APEX, and Graph Studio Ad hoc, Batch or Scheduled Business Leaders Analysts Data Scientists Developers OAC Dataflow, Manual or ETL Data Management ADW Business Analytics ERP CRM HCM Self-sufficient, encrypted, secured data storehouse Self-service analytics via ML REST-Enabled External APIs IoT and Edge Computing ATP AJD AGD The new kid on the block: Autonomous JSON Database Recently announced: Autonomous Database for Graph Studio AutoML features make it simple to apply the right algorithm(s) with confidence
  • 9. Picking the Right Algorithm: An Algorithm Would Be Nice … Here’s the good news: There’s at least one powerful Machine Learning algorithm to choose from for any analytic scenario Here’s the bad news: You’ve got to pick the right one considering its advantages and drawbacks for each scenario
  • 10. Uncover Associations From Customer Survey Results (1) CS_ID CS_POSITIVE_RESPONSE ---------- ------------------------- . . . 2633595 PlanningToRenew 2633595 WindTurbineInterest 2635273 PlanningToRenew 2635273 RecommendedToColleague 2635273 WindTurbineInterest 2635712 PlanningToRenew 2637099 PlanningToRenew 2637464 PlanningToRenew 2637464 WindTurbineInterest 2637995 PlanningToRenew 2637995 RecommendedToColleague 2637995 WindTurbineInterest . . . Our marketing department recently surveyed the top 100 users of our solar power initiative. Some were excited enough to recommend it to others, and some even want to expand to wind turbines … … and others, not so much.
  • 11. Uncover Associations From Customer Survey Results (2) DECLARE -- Processing variables: SQLERRNUM INTEGER := 0; SQLERRMSG VARCHAR2(255); vcReturnValue VARCHAR2(256) := 'Failure!'; v_setlist DBMS_DATA_MINING.SETTING_LIST; BEGIN v_setlst('ALGO_NAME’) := 'ALGO_APRIORI_ASSOCIATION_RULES'; v_setlst('PREP_AUTO’) := 'ON'; v_setlst('ASSO_MIN_SUPPORT') := '0.04'; v_setlst('ASSO_MIN_CONFIDENCE') := '0.1'; v_setlst('ASSO_MAX_RULE_LENGTH'):= '2'; v_setlst('ODMS_ITEM_ID_COLUMN_NAME'):= 'CS_POSITIVE_RESPONSE'; . . . We’ll use the APriori algorithm to uncover any hidden relationships between different responses to our recent customer satisfaction survey …
  • 12. Uncover Associations From Customer Survey Results (3) . . . DBMS_DATA_MINING.CREATE_MODEL2( MODEL_NAME => 'CUSTOMER_RENEWABILITY' ,MINING_FUNCTION => 'ASSOCIATION' ,DATA_QUERY => 'SELECT * FROM t_survey_results' ,SET_LIST => v_setlst ,CASE_ID_COLUMN_NAME => 'CS_ID'); EXCEPTION WHEN OTHERS THEN SQLERRNUM := SQLCODE; SQLERRMSG := SQLERRM; vcReturnValue := 'Error creating new OML4SQL model: ' || SQLERRNUM || ' - ' || SQLERRMSG; DBMS_APPLICATION_INFO.SET_MODULE(NULL,NULL); DBMS_OUTPUT.PUT_LINE(vcReturnValue); END; / … using each customer’s ID as the individual case ID for the model
  • 13. Uncover Associations From Customer Survey Results (4) SELECT * FROM ( SELECT rule_id ,antecedent_predicate antecedent ,consequent_predicate consequent ,ROUND(rule_support,3) supp_rtg ,ROUND(rule_confidence,3) conf_rating ,NUMBER_OF_ITEMS num_items FROM dm$vacustomer_renewability ORDER BY rule_confidence DESC, rule_support DESC ) WHERE ROWNUM <= 10 ORDER BY rule_id, antecedent, consequent; RULE_ID ANTECEDENT CONSEQUENT SUPP_RTG CONF_RATING NUM_ITEMS ------- ---------------------- ------------------------ -------- ----------- ---------- 1 RecommendedToColleague PlanningToRenew .172 .895 2 2 PlanningToRenew RecommendedToColleague .172 .179 2 3 WindTurbineInterest PlanningToRenew .818 .953 2 4 PlanningToRenew WindTurbineInterest .818 .853 2 5 WindTurbineInterest RecommendedToColleague .162 .188 2 6 RecommendedToColleague WindTurbineInterest .162 .842 2 A very simple query against the results of the Association model shows the individual support and confidence ratings for each survey response
  • 14. Clustering, Brute Force: Looking for Solar Power Super-Users (1) CREATE OR REPLACE VIEW solar_superusers AS SELECT sm_business_type ,sm_zipcode ,sm_id ,ROUND(AVG(smr_kwh_used),2) avg_kwh_used ,ROUND(AVG(smr_solar_kwh),2) avg_solar_kwh ,ROUND(AVG(smr_solar_kwh) / AVG(smr_kwh_used) ,2) pct_solar ,CASE WHEN ROUND(AVG(smr_solar_kwh) / AVG(smr_kwh_used) ,2) >= 0.15 THEN 1 ELSE 0 END as solar_superuser FROM t_smartmeters ,t_meter_readings WHERE smr_id = sm_id GROUP BY sm_business_type, sm_zipcode, sm_id ORDER BY sm_business_type, sm_zipcode, sm_id ; We’ll use this metric to identify which SmartMeter customers are solar energy super-users (defined as 15% or more of their energy generated via solar power on-site)
  • 15. Clustering, Brute Force: Looking for Solar Power Super-Users (2) DECLARE -- Processing variables: SQLERRNUM INTEGER := 0; SQLERRMSG VARCHAR2(255); vcReturnValue VARCHAR2(256) := 'Failure!'; v_setlist DBMS_DATA_MINING.SETTING_LIST; BEGIN v_setlist('ALGO_NAME') := 'ALGO_KMEANS'; v_setlist('PREP_AUTO') := 'ON'; v_setlist('KMNS_DISTANCE') := 'KMNS_EUCLIDEAN'; v_setlist('KMNS_DETAILS') := 'KMNS_DETAILS_ALL'; v_setlist('KMNS_ITERATIONS') := '3'; v_setlist('KMNS_NUM_BINS') := '10’; . . . We’ll use a k-Means algorithm to delve into which Smart Meters are clustered together, based on various attributes of the companies using them …
  • 16. Clustering, Brute Force: Looking for Solar Power Super-Users (3) . . . DBMS_DATA_MINING.CREATE_MODEL2( model_name => ‘SOLAR_SUPERUSERS_CLUSTERING', mining_function => 'CLUSTERING', data_query => 'SELECT * FROM solar_superusers', set_list => v_setlist, case_id_column_name => 'SM_ID'); EXCEPTION WHEN OTHERS THEN SQLERRNUM := SQLCODE; SQLERRMSG := SQLERRM; vcReturnValue := 'Error creating new OML4SQL model: ' || SQLERRNUM || ' - ' || SQLERRMSG; DBMS_APPLICATION_INFO.SET_MODULE(NULL,NULL); DBMS_OUTPUT.PUT_LINE(vcReturnValue); END; / … and leverage the view we just created as the source for each Smart Meter’s related attributes
  • 17. Clustering, Brute Force: ML Modeling Results SELECT cluster_id ,attribute_name ,operator ,numeric_value ,support ,ROUND(confidence,3) confidence FROM dm$vrsolar_superusers_clustering WHERE cluster_id IN (SELECT cluster_id FROM dm$vdSOLAR_SUPERUSERS_CLUSTERING WHERE left_child_id IS NULL AND right_child_id IS NULL) ORDER BY cluster_id ,attribute_name ,operator ,numeric_value; Now that the clustering model is complete, we can review which features were grouped together, as well as the support and confidence ratings for their placement in each of the several clusters Essentially, these are the rules that drive the clustering model to assign each subset of customers to a specific cluster
  • 18. Missed It By That Much: Determining ML Model Accuracy All models are wrong, but some are useful. - George E.P Box Founder, Department of Statistics, UW- Madison The good news: All Oracle Data Mining functions offer transparency into exactly how results have been produced via accuracy determination methods specific to each mining function Mining Function Accuracy Determination Methods Clustering Centroid Classification Confusion Matrix, Lift, ROC Regression Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) Time Series Mean Square Error (MSE)
  • 19. Missed It By That Much: Determining ML Model Accuracy Here’s an example of using a Radar Area graph to show the top 15 business types that are the most egregious outliers in terms of solar power used (aka distance from centroid)
  • 20. AutoML: Let the Database Decide! Check out the summary of all the latest AutoML enhancements! This makes it easier for “citizen data scientists” to apply the power of ML & Analytics … … the new AutoML interface makes selection of the proper algorithms a snap … … and many more new features, including Graph Studio
  • 21. Classification Via AutoML: A Demonstration To illustrate how powerful a tool AutoML is, let’s run another experiment against the same data source … but this time aiming at producing models and results aimed at classification rather than regression What could possibly go wrong?
  • 22. Building a Data Source for AutoML to Devour CREATE TABLE t_smartmeter_business_profiles AS SELECT sm_id ,CD.cd_minority_owned ,CD.cd_family_generations ,CD.cd_years_in_business ,CD.cd_locale_ownership ,CF.pct_profit_margin ,CF.avg_credit_score ,SM.avg_kwh_used ,SM.avg_solar_kwh ,SM.pct_solar ,SM.solar_superuser FROM t_customer_creditscoring CF . . . We’re drawing on data summarized from a Hybrid Partitioned table containing financial statistics … . . . ,t_customer_demographics CD ,(SELECT sm_id ,ROUND(AVG(smr_kwh_used),2) AS avg_kwh_used ,ROUND(AVG(smr_solar_kwh),2) AS avg_solar_kwh ,ROUND(AVG(smr_solar_kwh) / AVG(smr_kwh_used) ,2) AS pct_solar ,CASE WHEN ROUND(AVG(smr_solar_kwh) / AVG(smr_kwh_used) ,2) >= 0.15 THEN 1 ELSE 0 END AS solar_superuser FROM t_smartmeters ,t_meter_readings WHERE smr_id = sm_id GROUP BY sm_id ORDER BY sm_id) SM WHERE SM.sm_id = CF.cf_id AND SM.sm_id = CD.cd_id ORDER BY sm_id; … as well as customer demographics and solar energy usage data
  • 23. Regression Experiments with AutoML (1) First, select an appropriate data source 1 AutoML automatically builds a list of potential features and their key metrics 2
  • 24. Regression Experiments with AutoML (2) Review settings for prediction type, run time, model metric, and ML algorithms to apply 3 Start the experiment, choosing either speed or accuracy 4
  • 25. Regression Experiments with AutoML (3) AutoML now finishes any sampling needed and moves on to feature selection 5 Next, AutoML begins building the selected models 6
  • 26. Regression Experiments with AutoML (4) Model generation is complete! On to Feature Prediction Impact assessment … 7
  • 27. Regression Experiments with AutoML (5) Regression(s) complete! Now let’s transform the Neural Network model into a Zeppelin notebook, with just a few mouse clicks 8
  • 28. Transform an AutoML Experiment into a NoteBook (1) From the Leader Board, select one of the algorithms and click on Create Notebook 1 Name the new notebook 2
  • 29. Transform an AutoML Experiment into a NoteBook (2) The new notebook is ready. Click the link to start building paragraphs and retrieving data 3 Don’t know Python? No worries! The new notebook uses OML4Py to construct paragraphs for data retrieval and modeling 4
  • 30. Transform an AutoML Experiment into a NoteBook (3) Et voila! Here’s your first results from a notebook completely generated via AutoML! 5
  • 31. Useful Resources and Documentation OML Algorithms “Cheat Sheet” : https://www.oracle.com/a/tech/docs/oml4sql-algorithm-cheat-sheet.pdf Oracle 21c Machine Learning Basics (including AutoML): https://docs.oracle.com/en/database/oracle/machine-learning/oml4sql/21/dmcon/machine-learning-basics.html ADW Next Generation Availability Announcement: https://www.oracle.com/news/announcement/oracle-adds-innovations-to-cloud-data-warehouse-031721.html

Editor's Notes

  • #14: Support is a measure of how frequently two items occur within the same transaction. Thus, a support rating of .818 for Rule #3 means that a customer’s interest in wind turbines and expectation to renew their solar power service is likely to occur 81.8% of the time in a transaction based on this model’s results. Confidence is a measure of how often the antecedent and the consequent appear within the same transaction. Thus, for Rule #4, there is 85.3% confidence that a customer’s expectation to renew their solar power service implies they are likely to be interested in wind turbines.