SlideShare a Scribd company logo
From DBA to DE:
Becoming a Data Engineer
Jim Czuprynski
Zero Defect Computing, Inc.
Who Am I, and What Am I Doing Here?
➢E-mail me at jim@jimthewhyguy.com
➢Follow me on Twitter (@JimTheWhyGuy)
➢Connect with me on LinkedIn (Jim Czuprynski)
Traveler & public speaker Summers:
Wisconsin
Winters:
Illinois
Cyclist
XC skier
Avid
amateur
bird
watcher
Oldest dude in
martial arts class
Jim Czuprynski
Liron Amitzi
https://www.beyondtechskills.com
The podcast that talks about everything tech – except tech.TM
What Does a Modern Oracle DBA Spend Her Time On?
Protecting database
health, recoverability
and security
Tuning queries for optimal
performance and efficiency
Building flexible yet
resilient data models, thus
ensuring data is accurate
and trustworthy
Keeping data sources
as pristine as possible
to refresh data
domains efficiently
Not Everyone Can Be A Data Scientist. Thank Goodness.
Data scientists report that they
typically spend as much as 90% of
their time cleansing data …
… and that’s when they’re not
searching for relevant data, in
numerous places, in different formats …
… while ensuring their
selected data is sufficiently
anonymized to protect
subjects’ privacy
What they’d rather be doing:
Training models and interpreting
results for useful insights
Data Science Is Just Like Application Development. (Not!)
DevOps: CI/CD Process Flow
• Focus: Capturing, retaining, and reporting on data
• Errors are relatively, if not immediately, apparent
• Worst case: Roll back to a prior version of the
application and its objects within the database*
* Assuming you’ve planned for that eventuality!
Credit:
Giulano
Ligouri
(@ingliguori)
Data Science: Data > Useful Model(s)
• Focus: Accurate (and thus useful) models
• Machine Learning / AI involves extremely complex
mathematics that devour computing cycles
• Worst case: A perfect model is now utterly inaccurate!
• Underfit: Poor initial training data results in bad model
precisely when it’s most needed
• Overfit: Good initial training data yields a good model initially
… and then new, never-before-seen data screws up everything
Who Said AI/ML Was Easy?
From a recent
seminar with
Intel’s AI/ML team
It turns out that most
of the time, technology
isn’t the cause of
project failure; rather,
the human dimension
is often the root cause
Of course, it’s more complicated than this. Check out my recent blog post for deeper insights
CREATE TABLE t_patients (
pa_id NUMBER NOT NULL
,pa_first_name VARCHAR2(40) NOT NULL
,pa_last_name VARCHAR2(40) NOT NULL
,pa_middle_initial CHAR(01) NOT NULL
,pa_sex CHAR(01) NOT NULL
. . .
);
The Scourge of Bad Data (1)
What should be the CHECK
constraint for this column?
(M)ale and (F)emale
are obvious choices …
… but how do we classify trans-sexual
people, or those who don’t want to
reveal their sex at all?
Note: We haven’t even talked about the concept of gender yet.
The Scourge of Bad Data (2)
An IT professional wanted to
mess with California’s
Automatic License Plate
Reader system … so he
registered his vanity plate as
the word NULL
After he paid the ticket, the 3rd party
administrator of the ticket fines collection
system apparently connected his personal
details to all plates which LEOs had
registered as missing or invalid
The next year, he got a $35 ticket
when he tried to renew his
registration … because NULL was
no longer acceptable $12,000 in fines later, he
realized the joke was on him
The Scourge of Bad Data (3)
Any decent clustering ML
algorithm would likely produce
findings like this when looking for
unseen patterns within features
The reason for same dates? Values entered
for birth date (1/1/00) and registration date
(1/1/18) from some municipalities’ voting
records during conversion to a centralized
voter registration system in 2002
And the duplicate phone numbers? They
turned out to match a City of Racine office
telephone number that had been entered by
default because Racine’s voting registration
system required a non-NULL value
So What Does a DE Do, Exactly?
Giuliano Liguori, a well-known
digital transformation leader,
recently described the differences
between Data Scientists, Data
Engineers, and Data Analysts
So What Does a DE Do, Exactly?
SQL
Excel
Tableau
Python
Cloud
R
Distributed
Computing
Guess what? If you’re a
DBA or Developer, you’re
already doing most of the
work of a Data Engineer!
Notice what’s at dead center of
these skillsets? That’s right. The
sharpest tool you already know.
What Current DE Skills Do I Need?
Note: These are only my impressions of what skills are typically needed across a wide spectrum.
So what skills do your Data Science team really need? Ask them.
Understand statistics
& probability
Know how your DS team
extracts & processes data
(PANDA, etc.)
Learn to clean &
transform data before
your DS team needs to
Grasp key metrics
of model success
Remember all that high school
math you asked your teacher if
you’d ever really use in real life?
Yeah. It’s this stuff.
Yep, this means learning at least
one other new language: Python
How Do You Get To Carnegie Hall? Practice, Practice, Practice.
If you’re still a “core” DBA, don’t fret! You can start practicing
all the skills you’ll need to become a Data Engineer
It’s easy to leverage the extremely powerful
Machine Learning (ML) algorithms and Analytic
functions already within the Oracle database …
… because sometimes the only way
to acquire the skills for a new career
vector is to read > learn > do > teach
Check out the newest and latest features of Autonomous Database, including
AutoML, OML4Py, OML4SQL, Property Graph support, and Graph Studio UI
Configuring Your OML Environment (1)
Request new ML User creation
1
Configuring Your OML Environment (1)
Request new ML User creation
1 Specify username, password, and details
2
Leveraging DBMS_DATA_MINING (1)
Access Machine
Learning tools …
1
Leveraging DBMS_DATA_MINING (1)
Access Machine
Learning tools …
1
… and choose from a number of available
data mining examples and templates
2
AutoML: Let the Database Decide!
Check out the summary of all the latest AutoML enhancements!
This makes it easier for
“citizen data scientists”
to apply the power of
ML & Analytics …
… the new AutoML
interface makes selection
of the proper algorithms
a snap …
… and many more new
features, including
Graph Studio
Building a Data Source for AutoML to Devour
CREATE TABLE t_smartmeter_business_profiles AS
SELECT
sm_id
,CD.cd_minority_owned
,CD.cd_family_generations
,CD.cd_years_in_business
,CD.cd_locale_ownership
,CF.pct_profit_margin
,CF.avg_credit_score
,SM.avg_kwh_used
,SM.avg_solar_kwh
,SM.pct_solar
,SM.solar_superuser
FROM
t_customer_creditscoring CF
. . .
We’re drawing on
data summarized
from a Hybrid
Partitioned table
containing financial
statistics …
Building a Data Source for AutoML to Devour
CREATE TABLE t_smartmeter_business_profiles AS
SELECT
sm_id
,CD.cd_minority_owned
,CD.cd_family_generations
,CD.cd_years_in_business
,CD.cd_locale_ownership
,CF.pct_profit_margin
,CF.avg_credit_score
,SM.avg_kwh_used
,SM.avg_solar_kwh
,SM.pct_solar
,SM.solar_superuser
FROM
t_customer_creditscoring CF
. . .
We’re drawing on
data summarized
from a Hybrid
Partitioned table
containing financial
statistics …
. . .
,t_customer_demographics CD
,(SELECT
sm_id
,ROUND(AVG(smr_kwh_used),2) AS avg_kwh_used
,ROUND(AVG(smr_solar_kwh),2) AS avg_solar_kwh
,ROUND(AVG(smr_solar_kwh) / AVG(smr_kwh_used) ,2) AS pct_solar
,CASE
WHEN ROUND(AVG(smr_solar_kwh) / AVG(smr_kwh_used) ,2) >= 0.15
THEN 1 ELSE 0
END AS solar_superuser
FROM
t_smartmeters
,t_meter_readings
WHERE smr_id = sm_id
GROUP BY sm_id
ORDER BY sm_id) SM
WHERE SM.sm_id = CF.cf_id
AND SM.sm_id = CD.cd_id
ORDER BY sm_id;
… as well as customer
demographics and
solar energy usage data
Regression Experiments with AutoML (1)
First, select an
appropriate data
source
1
Regression Experiments with AutoML (1)
First, select an
appropriate data
source
1
AutoML automatically
builds a list of
potential features and
their key metrics
2
Regression Experiments with AutoML (2)
Review settings
for prediction
type, run time,
model metric, and
ML algorithms to
apply
3
Start the experiment, choosing either
speed or accuracy
4
Regression Experiments with AutoML (3)
AutoML now
finishes any
sampling
needed and
moves on to
feature
selection
5
Regression Experiments with AutoML (3)
AutoML now
finishes any
sampling
needed and
moves on to
feature
selection
5
Next, AutoML begins building the selected models
6
Regression Experiments with AutoML (4)
Model generation is
complete! On to
Feature Prediction
Impact assessment …
7
Regression Experiments with AutoML (5)
Regression(s) complete!
Now let’s transform the
Neural Network model into
a Zeppelin notebook, with
just a few mouse clicks
8
Transform an AutoML Experiment into a NoteBook (1)
From the Leader Board,
select one of the algorithms
and click on Create
Notebook
1
Transform an AutoML Experiment into a NoteBook (1)
From the Leader Board,
select one of the algorithms
and click on Create
Notebook
1
Name the new notebook
2
Transform an AutoML Experiment into a NoteBook (2)
The new notebook is ready.
Click the link to start
building paragraphs and
retrieving data
3
Transform an AutoML Experiment into a NoteBook (2)
The new notebook is ready.
Click the link to start
building paragraphs and
retrieving data
3
Don’t know Python? No
worries! The new
notebook uses OML4Py
to construct paragraphs
for data retrieval and
modeling
4
Transform an AutoML Experiment into a NoteBook (3)
Et voila! Here’s your
first results from a
notebook completely
generated via
AutoML!
5
How Do I Keep My DE Career Relevant?
How did you keep your Developer / DBA career relevant?
How is this any different?
✓ Associate with other
DEs, and help uplift
others to DE status
✓ Attend conferences and
training sessions on
latest industry trends
✓ Consider certifying
your hard-won,
newly-acquired skills
They call it life-long learning for
a reason - it never, ever stops!
Are There Any DE Professional Organizations? Maybe.
American Statistical Association (ASA)
Offers a wide array of meetings, publications, and training as well as the vaunted
PStat and GStat accreditations
Data Science Council of America (DASCA)
Offers several different certifications in Big Data, Analytics, and Data Science
Institute for Operations Research and the Management
Sciences (INFORMS)
Offers various trainings, events, publications, and certifications
The Association of Data Scientists (ADaSci)
Based in India, they offer a Chartered Data Scientist (CDS)
certification exam and training
Further Reading In the Real World of Data Science
• AI Projects Fail All Too Often. Successful Ones Share a Common Secret
https://gestaltit.com/tech-talks/intel/intel-2021/jimthewhyguy/ai-projects-fail-all-too-often-successful-ones-share-a-common-secret/
• Machine Learning in Production: Why Is It So Hard and So Many Fail?
https://towardsdatascience.com/machine-learning-in-production-why-is-it-so-difficult-28ce74bfc732
• Fact Check-Claims about 23,000 Wisconsin voters with the same phone number and
4,000 voters registered on 1/1/1918
https://www.reuters.com/article/factcheck-wisconsin-numbers/fact-check-claims-about-23000-wisconsin-voters-with-the-same-phone-
number-and-4000-voters-registered-on-1-1-1918-missing-context-idUSL1N2RU1WC
• How a 'NULL' License Plate Landed One Hacker in Ticket Hell
https://www.wired.com/story/null-license-plate-landed-one-hacker-ticket-hell/
Useful Oracle Documentation
• What is Data Science?
https://www.oracle.com/data-science/what-is-data-science/
• Machine Learning Solutions with Oracle’s Services and Tools
https://www.oracle.com/a/ocom/docs/build-machine-learning-solutions-cloud-essentials.pdf
• Oracle Cloud Infrastructure Data Catalog
https://www.oracle.com/a/ocom/docs/ebook-cloud-infrastructure-data-catalog.pdf
• OML Algorithms “Cheat Sheet”
https://www.oracle.com/a/tech/docs/oml4sql-algorithm-cheat-sheet.pdf
• Oracle 21c Machine Learning Basics (including AutoML)
https://docs.oracle.com/en/database/oracle/machine-learning/oml4sql/21/dmcon/machine-learning-basics.html

More Related Content

PPSX
What's Your Super-Power? Mine is Machine Learning with Oracle Autonomous DB.
PDF
Data Workflows for Machine Learning - Seattle DAML
PPTX
Leveraging Open Source Automated Data Science Tools
PDF
Data Workflows for Machine Learning - SF Bay Area ML
PDF
DutchMLSchool. ML Business Perspective
PPTX
Ai & ML workshop-1.pptx ppt presentation
PPTX
Building Powerful and Intelligent Applications with Azure Machine Learning
PDF
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
What's Your Super-Power? Mine is Machine Learning with Oracle Autonomous DB.
Data Workflows for Machine Learning - Seattle DAML
Leveraging Open Source Automated Data Science Tools
Data Workflows for Machine Learning - SF Bay Area ML
DutchMLSchool. ML Business Perspective
Ai & ML workshop-1.pptx ppt presentation
Building Powerful and Intelligent Applications with Azure Machine Learning
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)

Similar to From DBA to DE: Becoming a Data Engineer (20)

PPT
Machine learning for complete beginners.ppt
PDF
Deconstructing a Machine Learning Pipeline with Virtual Data Lake
PPTX
Future of data science as a profession
PDF
Maximize Your Impact Top AI Tools Every Data Analyst Needs .pdf
PDF
The Role of Data Wrangling in Driving Hadoop Adoption
PPT
Data Science in the Real World: Making a Difference
PPTX
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
PDF
Summit Australia 2019 - Supercharge PowerPlatform with AI - Dipankar Bhattach...
PPTX
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
PPTX
Data Science Salon Miami Presentation
PPTX
A gentle introduction to relational learning
PDF
Why analytics projects fail
PDF
WHY DO SO MANY ANALYTICS PROJECTS STILL FAIL?
PPTX
Rise of the machines -- Owasp israel -- June 2014 meetup
PDF
Data science presentation
PPTX
Big Data & Machine Learning - TDC2013 Sao Paulo
PDF
Real World End to End machine Learning Pipeline
PPTX
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
PDF
OSCON 2014: Data Workflows for Machine Learning
PPTX
Why 4Segment
Machine learning for complete beginners.ppt
Deconstructing a Machine Learning Pipeline with Virtual Data Lake
Future of data science as a profession
Maximize Your Impact Top AI Tools Every Data Analyst Needs .pdf
The Role of Data Wrangling in Driving Hadoop Adoption
Data Science in the Real World: Making a Difference
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Summit Australia 2019 - Supercharge PowerPlatform with AI - Dipankar Bhattach...
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon Miami Presentation
A gentle introduction to relational learning
Why analytics projects fail
WHY DO SO MANY ANALYTICS PROJECTS STILL FAIL?
Rise of the machines -- Owasp israel -- June 2014 meetup
Data science presentation
Big Data & Machine Learning - TDC2013 Sao Paulo
Real World End to End machine Learning Pipeline
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
OSCON 2014: Data Workflows for Machine Learning
Why 4Segment
Ad

More from Jim Czuprynski (16)

PDF
Going Native: Leveraging the New JSON Native Datatype in Oracle 21c
PDF
Access Denied: Real-World Use Cases for APEX and Real Application Security
PDF
Charge Me Up! Using Oracle ML, Analytics, and APEX For Finding Optimal Charge...
PDF
Graphing Grifters: Identify & Display Patterns of Corruption With Oracle Graph
PPSX
So an Airline Pilot, a Urologist, and an IT Technologist Walk Into a Bar: Thi...
PPSX
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
PPSX
Conquer Big Data with Oracle 18c, In-Memory External Tables and Analytic Func...
PPSX
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
PPSX
An Autonomous Singularity Approaches: Force Multipliers For Overwhelmed DBAs
PPSX
Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...
PPSX
One Less Thing For DBAs to Worry About: Automatic Indexing
PPSX
Keep Your Code Low, Low, Low, Low, Low: Getting to Digitally Driven With Orac...
PPSX
Cluster, Classify, Associate, Regress: Satisfy Your Inner Data Scientist with...
PPSX
Where the %$#^ Is Everybody? Geospatial Solutions For Oracle APEX
PPSX
JSON, A Splash of SODA, and a SQL Chaser: Real-World Use Cases for Autonomous...
PPSX
Fast and Furious: Handling Edge Computing Data With Oracle 19c Fast Ingest an...
Going Native: Leveraging the New JSON Native Datatype in Oracle 21c
Access Denied: Real-World Use Cases for APEX and Real Application Security
Charge Me Up! Using Oracle ML, Analytics, and APEX For Finding Optimal Charge...
Graphing Grifters: Identify & Display Patterns of Corruption With Oracle Graph
So an Airline Pilot, a Urologist, and an IT Technologist Walk Into a Bar: Thi...
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Conquer Big Data with Oracle 18c, In-Memory External Tables and Analytic Func...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
An Autonomous Singularity Approaches: Force Multipliers For Overwhelmed DBAs
Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...
One Less Thing For DBAs to Worry About: Automatic Indexing
Keep Your Code Low, Low, Low, Low, Low: Getting to Digitally Driven With Orac...
Cluster, Classify, Associate, Regress: Satisfy Your Inner Data Scientist with...
Where the %$#^ Is Everybody? Geospatial Solutions For Oracle APEX
JSON, A Splash of SODA, and a SQL Chaser: Real-World Use Cases for Autonomous...
Fast and Furious: Handling Edge Computing Data With Oracle 19c Fast Ingest an...
Ad

Recently uploaded (20)

PPTX
Personal Development - By Knowing Oneself?
PPTX
Attitudes presentation for psychology.pptx
PPTX
Pradeep Kumar Roll no.30 Paper I.pptx....
PDF
Red Light Wali Muskurahat – A Heart-touching Hindi Story
PPTX
Learn how to use Portable Grinders Safely
PPT
cypt-cht-healthy-relationships-part1-presentation-v1.1en.ppt
PPTX
Learn numerology content and join tarot reading
PDF
Top 10 Visionary Entrepreneurs to Watch in 2025
PDF
The Power of Pausing Before You React by Meenakshi Khakat
PPTX
Travel mania in india needs to change the world
PPTX
Learn about numerology and do tarot reading
PDF
The Zeigarnik Effect by Meenakshi Khakat.pdf
PDF
My 'novel' Account of Human Possibility pdf.pdf
PPT
proper hygiene for teenagers for secondary students .ppt
PPTX
SELF ASSESSMENT -SNAPSHOT.pptx an index of yourself by Dr NIKITA SHARMA
PPTX
cấu trúc sử dụng mẫu Cause - Effects.pptx
PPTX
Presentation on interview preparation.pt
PDF
Elle Lalli on The Role of Emotional Intelligence in Entrepreneurship
PPTX
How to Deal with Imposter Syndrome for Personality Development?
PPTX
PERDEV-LESSON-3 DEVELOPMENTMENTAL STAGES.pptx
Personal Development - By Knowing Oneself?
Attitudes presentation for psychology.pptx
Pradeep Kumar Roll no.30 Paper I.pptx....
Red Light Wali Muskurahat – A Heart-touching Hindi Story
Learn how to use Portable Grinders Safely
cypt-cht-healthy-relationships-part1-presentation-v1.1en.ppt
Learn numerology content and join tarot reading
Top 10 Visionary Entrepreneurs to Watch in 2025
The Power of Pausing Before You React by Meenakshi Khakat
Travel mania in india needs to change the world
Learn about numerology and do tarot reading
The Zeigarnik Effect by Meenakshi Khakat.pdf
My 'novel' Account of Human Possibility pdf.pdf
proper hygiene for teenagers for secondary students .ppt
SELF ASSESSMENT -SNAPSHOT.pptx an index of yourself by Dr NIKITA SHARMA
cấu trúc sử dụng mẫu Cause - Effects.pptx
Presentation on interview preparation.pt
Elle Lalli on The Role of Emotional Intelligence in Entrepreneurship
How to Deal with Imposter Syndrome for Personality Development?
PERDEV-LESSON-3 DEVELOPMENTMENTAL STAGES.pptx

From DBA to DE: Becoming a Data Engineer

  • 1. From DBA to DE: Becoming a Data Engineer Jim Czuprynski Zero Defect Computing, Inc.
  • 2. Who Am I, and What Am I Doing Here? ➢E-mail me at jim@jimthewhyguy.com ➢Follow me on Twitter (@JimTheWhyGuy) ➢Connect with me on LinkedIn (Jim Czuprynski) Traveler & public speaker Summers: Wisconsin Winters: Illinois Cyclist XC skier Avid amateur bird watcher Oldest dude in martial arts class
  • 3. Jim Czuprynski Liron Amitzi https://www.beyondtechskills.com The podcast that talks about everything tech – except tech.TM
  • 4. What Does a Modern Oracle DBA Spend Her Time On? Protecting database health, recoverability and security Tuning queries for optimal performance and efficiency Building flexible yet resilient data models, thus ensuring data is accurate and trustworthy Keeping data sources as pristine as possible to refresh data domains efficiently
  • 5. Not Everyone Can Be A Data Scientist. Thank Goodness. Data scientists report that they typically spend as much as 90% of their time cleansing data … … and that’s when they’re not searching for relevant data, in numerous places, in different formats … … while ensuring their selected data is sufficiently anonymized to protect subjects’ privacy What they’d rather be doing: Training models and interpreting results for useful insights
  • 6. Data Science Is Just Like Application Development. (Not!) DevOps: CI/CD Process Flow • Focus: Capturing, retaining, and reporting on data • Errors are relatively, if not immediately, apparent • Worst case: Roll back to a prior version of the application and its objects within the database* * Assuming you’ve planned for that eventuality! Credit: Giulano Ligouri (@ingliguori) Data Science: Data > Useful Model(s) • Focus: Accurate (and thus useful) models • Machine Learning / AI involves extremely complex mathematics that devour computing cycles • Worst case: A perfect model is now utterly inaccurate! • Underfit: Poor initial training data results in bad model precisely when it’s most needed • Overfit: Good initial training data yields a good model initially … and then new, never-before-seen data screws up everything
  • 7. Who Said AI/ML Was Easy? From a recent seminar with Intel’s AI/ML team It turns out that most of the time, technology isn’t the cause of project failure; rather, the human dimension is often the root cause Of course, it’s more complicated than this. Check out my recent blog post for deeper insights
  • 8. CREATE TABLE t_patients ( pa_id NUMBER NOT NULL ,pa_first_name VARCHAR2(40) NOT NULL ,pa_last_name VARCHAR2(40) NOT NULL ,pa_middle_initial CHAR(01) NOT NULL ,pa_sex CHAR(01) NOT NULL . . . ); The Scourge of Bad Data (1) What should be the CHECK constraint for this column? (M)ale and (F)emale are obvious choices … … but how do we classify trans-sexual people, or those who don’t want to reveal their sex at all? Note: We haven’t even talked about the concept of gender yet.
  • 9. The Scourge of Bad Data (2) An IT professional wanted to mess with California’s Automatic License Plate Reader system … so he registered his vanity plate as the word NULL After he paid the ticket, the 3rd party administrator of the ticket fines collection system apparently connected his personal details to all plates which LEOs had registered as missing or invalid The next year, he got a $35 ticket when he tried to renew his registration … because NULL was no longer acceptable $12,000 in fines later, he realized the joke was on him
  • 10. The Scourge of Bad Data (3) Any decent clustering ML algorithm would likely produce findings like this when looking for unseen patterns within features The reason for same dates? Values entered for birth date (1/1/00) and registration date (1/1/18) from some municipalities’ voting records during conversion to a centralized voter registration system in 2002 And the duplicate phone numbers? They turned out to match a City of Racine office telephone number that had been entered by default because Racine’s voting registration system required a non-NULL value
  • 11. So What Does a DE Do, Exactly? Giuliano Liguori, a well-known digital transformation leader, recently described the differences between Data Scientists, Data Engineers, and Data Analysts
  • 12. So What Does a DE Do, Exactly? SQL Excel Tableau Python Cloud R Distributed Computing Guess what? If you’re a DBA or Developer, you’re already doing most of the work of a Data Engineer! Notice what’s at dead center of these skillsets? That’s right. The sharpest tool you already know.
  • 13. What Current DE Skills Do I Need? Note: These are only my impressions of what skills are typically needed across a wide spectrum. So what skills do your Data Science team really need? Ask them. Understand statistics & probability Know how your DS team extracts & processes data (PANDA, etc.) Learn to clean & transform data before your DS team needs to Grasp key metrics of model success Remember all that high school math you asked your teacher if you’d ever really use in real life? Yeah. It’s this stuff. Yep, this means learning at least one other new language: Python
  • 14. How Do You Get To Carnegie Hall? Practice, Practice, Practice. If you’re still a “core” DBA, don’t fret! You can start practicing all the skills you’ll need to become a Data Engineer It’s easy to leverage the extremely powerful Machine Learning (ML) algorithms and Analytic functions already within the Oracle database … … because sometimes the only way to acquire the skills for a new career vector is to read > learn > do > teach Check out the newest and latest features of Autonomous Database, including AutoML, OML4Py, OML4SQL, Property Graph support, and Graph Studio UI
  • 15. Configuring Your OML Environment (1) Request new ML User creation 1
  • 16. Configuring Your OML Environment (1) Request new ML User creation 1 Specify username, password, and details 2
  • 17. Leveraging DBMS_DATA_MINING (1) Access Machine Learning tools … 1
  • 18. Leveraging DBMS_DATA_MINING (1) Access Machine Learning tools … 1 … and choose from a number of available data mining examples and templates 2
  • 19. AutoML: Let the Database Decide! Check out the summary of all the latest AutoML enhancements! This makes it easier for “citizen data scientists” to apply the power of ML & Analytics … … the new AutoML interface makes selection of the proper algorithms a snap … … and many more new features, including Graph Studio
  • 20. Building a Data Source for AutoML to Devour CREATE TABLE t_smartmeter_business_profiles AS SELECT sm_id ,CD.cd_minority_owned ,CD.cd_family_generations ,CD.cd_years_in_business ,CD.cd_locale_ownership ,CF.pct_profit_margin ,CF.avg_credit_score ,SM.avg_kwh_used ,SM.avg_solar_kwh ,SM.pct_solar ,SM.solar_superuser FROM t_customer_creditscoring CF . . . We’re drawing on data summarized from a Hybrid Partitioned table containing financial statistics …
  • 21. Building a Data Source for AutoML to Devour CREATE TABLE t_smartmeter_business_profiles AS SELECT sm_id ,CD.cd_minority_owned ,CD.cd_family_generations ,CD.cd_years_in_business ,CD.cd_locale_ownership ,CF.pct_profit_margin ,CF.avg_credit_score ,SM.avg_kwh_used ,SM.avg_solar_kwh ,SM.pct_solar ,SM.solar_superuser FROM t_customer_creditscoring CF . . . We’re drawing on data summarized from a Hybrid Partitioned table containing financial statistics … . . . ,t_customer_demographics CD ,(SELECT sm_id ,ROUND(AVG(smr_kwh_used),2) AS avg_kwh_used ,ROUND(AVG(smr_solar_kwh),2) AS avg_solar_kwh ,ROUND(AVG(smr_solar_kwh) / AVG(smr_kwh_used) ,2) AS pct_solar ,CASE WHEN ROUND(AVG(smr_solar_kwh) / AVG(smr_kwh_used) ,2) >= 0.15 THEN 1 ELSE 0 END AS solar_superuser FROM t_smartmeters ,t_meter_readings WHERE smr_id = sm_id GROUP BY sm_id ORDER BY sm_id) SM WHERE SM.sm_id = CF.cf_id AND SM.sm_id = CD.cd_id ORDER BY sm_id; … as well as customer demographics and solar energy usage data
  • 22. Regression Experiments with AutoML (1) First, select an appropriate data source 1
  • 23. Regression Experiments with AutoML (1) First, select an appropriate data source 1 AutoML automatically builds a list of potential features and their key metrics 2
  • 24. Regression Experiments with AutoML (2) Review settings for prediction type, run time, model metric, and ML algorithms to apply 3 Start the experiment, choosing either speed or accuracy 4
  • 25. Regression Experiments with AutoML (3) AutoML now finishes any sampling needed and moves on to feature selection 5
  • 26. Regression Experiments with AutoML (3) AutoML now finishes any sampling needed and moves on to feature selection 5 Next, AutoML begins building the selected models 6
  • 27. Regression Experiments with AutoML (4) Model generation is complete! On to Feature Prediction Impact assessment … 7
  • 28. Regression Experiments with AutoML (5) Regression(s) complete! Now let’s transform the Neural Network model into a Zeppelin notebook, with just a few mouse clicks 8
  • 29. Transform an AutoML Experiment into a NoteBook (1) From the Leader Board, select one of the algorithms and click on Create Notebook 1
  • 30. Transform an AutoML Experiment into a NoteBook (1) From the Leader Board, select one of the algorithms and click on Create Notebook 1 Name the new notebook 2
  • 31. Transform an AutoML Experiment into a NoteBook (2) The new notebook is ready. Click the link to start building paragraphs and retrieving data 3
  • 32. Transform an AutoML Experiment into a NoteBook (2) The new notebook is ready. Click the link to start building paragraphs and retrieving data 3 Don’t know Python? No worries! The new notebook uses OML4Py to construct paragraphs for data retrieval and modeling 4
  • 33. Transform an AutoML Experiment into a NoteBook (3) Et voila! Here’s your first results from a notebook completely generated via AutoML! 5
  • 34. How Do I Keep My DE Career Relevant? How did you keep your Developer / DBA career relevant? How is this any different? ✓ Associate with other DEs, and help uplift others to DE status ✓ Attend conferences and training sessions on latest industry trends ✓ Consider certifying your hard-won, newly-acquired skills They call it life-long learning for a reason - it never, ever stops!
  • 35. Are There Any DE Professional Organizations? Maybe. American Statistical Association (ASA) Offers a wide array of meetings, publications, and training as well as the vaunted PStat and GStat accreditations Data Science Council of America (DASCA) Offers several different certifications in Big Data, Analytics, and Data Science Institute for Operations Research and the Management Sciences (INFORMS) Offers various trainings, events, publications, and certifications The Association of Data Scientists (ADaSci) Based in India, they offer a Chartered Data Scientist (CDS) certification exam and training
  • 36. Further Reading In the Real World of Data Science • AI Projects Fail All Too Often. Successful Ones Share a Common Secret https://gestaltit.com/tech-talks/intel/intel-2021/jimthewhyguy/ai-projects-fail-all-too-often-successful-ones-share-a-common-secret/ • Machine Learning in Production: Why Is It So Hard and So Many Fail? https://towardsdatascience.com/machine-learning-in-production-why-is-it-so-difficult-28ce74bfc732 • Fact Check-Claims about 23,000 Wisconsin voters with the same phone number and 4,000 voters registered on 1/1/1918 https://www.reuters.com/article/factcheck-wisconsin-numbers/fact-check-claims-about-23000-wisconsin-voters-with-the-same-phone- number-and-4000-voters-registered-on-1-1-1918-missing-context-idUSL1N2RU1WC • How a 'NULL' License Plate Landed One Hacker in Ticket Hell https://www.wired.com/story/null-license-plate-landed-one-hacker-ticket-hell/
  • 37. Useful Oracle Documentation • What is Data Science? https://www.oracle.com/data-science/what-is-data-science/ • Machine Learning Solutions with Oracle’s Services and Tools https://www.oracle.com/a/ocom/docs/build-machine-learning-solutions-cloud-essentials.pdf • Oracle Cloud Infrastructure Data Catalog https://www.oracle.com/a/ocom/docs/ebook-cloud-infrastructure-data-catalog.pdf • OML Algorithms “Cheat Sheet” https://www.oracle.com/a/tech/docs/oml4sql-algorithm-cheat-sheet.pdf • Oracle 21c Machine Learning Basics (including AutoML) https://docs.oracle.com/en/database/oracle/machine-learning/oml4sql/21/dmcon/machine-learning-basics.html