SlideShare a Scribd company logo
Welcome
For Data Scientists
Advanced SQL
Jean Joseph
Data Engineer/DBA
Blog : bigdatadriven.org
Email: jean.joseph@bigdatadriven.org
Twitter: @garella79/@cloudatadriven
LinkedIn: https://www.linkedin.com/in/jeandjoseph/
In IT: For over 18 plus years
From: New Jersey
Original From: Haiti
Overview
Brief intro to SQL. The five major things to
know in RDBMS
Data preparation in
SQL SQL advanced filtering
preparing data for use
with analytics tools
Key takeaways
What is
RDBMS?
• Relational Data Management System
• Tabular
• Row(s), Colum(s)
• Objects (Tables, Views, Synonyms,
Functions, Procedures,..)
• Normalization – (OLTP)
• De-Normalization (OLAP)
• ACID
What Is SQL?
• Stand for Query Structure Language
• SQL lets you Control, Create, Modify
object(s) and manipulate data
What
Can We
Do With
SQL?
DDL
CREATE, DROP, TRUNCATE, ALTER, COMMENT, RENAME
DQL
SELECT
DML INSERT, UPDATE, DELETE
DCL GRANT, REVOKE
TCL
COMMIT, ROLLBACK, SAVEPOINT, SET
Why SQL?
CRUD
Data Scientist Should Master
Type Of SQL
Joins
APPLY (Transact-SQL).
• CROSS
• OUTTER
PIVOT (Transact-SQL).
UNION [ALL]
EXCEPT
INTERCECT
Join Data Set
Data Preparation
COLLECTING DATA CLEANING DATA RE-STRUCTING DATA
Position Character Set
Transformation Soundex
SQL Functions To Prep Data
• CHARINDEX
• PATINDEX
• LEN
• STUFF
• STRING_AGG
• SUBSTRING.
• STRING_SPLIT
• STRING_ESCAPE
• TRANSLATE
• CONCAT_WS
• CONCAT
• LEFT
• RIGHT
• LOWER
• UPPER
• LEN
• TRIM
• REPLACE
• REVERSE
• REPLICATE
• ASCII
• CHAR
• NCHAR
• UNICODE
• DIFFERENCE
• SOUNDEX
SQL Functions To Prep Data
AGGREGATE vs WINDOWING FUNCTIONS
AGGREGATE FUNCTIONS:
• which operate on an entire data set or table and are used with a GROUP BY clause.
WINDOWING FUNCTIONS:
• do not cause rows to become grouped into a single output row, the rows retain their
separate identities an aggregated value will be added to each row.
Types of Window functions:
• Aggregate Window Functions
• SUM(), MAX(), MIN(), AVG(). COUNT()
• Ranking Window Functions
• RANK(), DENSE_RANK(), ROW_NUMBER(),
NTILE()
• Value Window Functions
• LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE()
WINDOW_FUNCTION ( [ ALL ] expression )
OVER ( [ PARTITION BY partition_list ]
[ ORDER BY order_list] )
Demo
Preparing Data Using SQL
CHARINDEX ( expressionToFind ,
expressionToSearch
[ , start_location ]
)
Data Prep Position Function - CHARINDEX
String: This is a great event
Parameter Description
string Required. The string to extract from
start
Required. The start position. The
first position in string is 1
length
Required. The number of characters
to extract. Must be a positive
number
SUBSTRING(string, start, length)
Data Prep Position Function
PATINDEX
PATINDEX ( '%pattern%' , expression )
%pattern% Required. The pattern to find. It MUST be surrounded
by %.
• % - Match any string of any length (including 0 length)
• _ - Match one single character
• [] - Match any characters in the brackets, e.g. [xyz]
• [^] - Match any character not in the brackets, e.g. [^xyz]
• | | string | Required. The string to be searched |
Find any string that contain
big, and end with driven.org'
Data Prep Position Function - TRING_AGG
STRING_AGG ( input_string, separator ) [ order_clause ]
v input_string is any type that can be converted VARCHAR and NVARCHAR when
concatenation.
v separator is the separator for the result string. It can be a literal or variable.
v order_clause specifies the sort order of concatenated results using WITHIN
GROUP clause:
WITHIN GROUP ( ORDER BY expression [ ASC | DESC ] )
Data Prep Position Function - STRING_SPLIT
STRING_SPLIT(string, separator)
Analytic
• CUME_DIST (Transact-SQL)
• FIRST_VALUE (Transact-SQL)
• LAG (Transact-SQL)
• LAST_VALUE (Transact-SQL)
• LEAD (Transact-SQL)
• PERCENT_RANK (Transact-SQL)
• PERCENTILE_CONT (Transact-SQL)
• PERCENTILE_DISC (Transact-SQL)
Aggregate
• APPROX_COUNT_DISTINCT()
• AVG ()
• CHECKSUM_AGG ()
• COUNT ()
• COUNT_BIG ()
• GROUPING ()
• GROUPING_ID
• MAX ()
• MIN ()
• STDEV ()
• STDEVP ()
• SUM ()
• VAR ()
• VARP ()
Windowing Function – Framing
• Rows/Range
• Rows is in memory
• Range is in Tempdb
• Keywords
• Preceding
• Following
• Unbounded
• Current
• Ranking
• ROW_NUMBER()
• RANK()
• DENSE_RANK
• NTILE
SQLAggregate & Analytical Functions
Demo
SQLAggregate & Analytical
Business Request:
Provide the Total sales Due, Total Average Sales Orders, Total
Number of Sales Orders and Total Sales Rank
Orders for each year including all fees (Tax, Shipping, ..).
Task:
• Get all sales orders for each year
• To calculate the
• Sum of Total Due by Year.
• Total AVG of Sales Orders by Year.
• Total Number of Sales Orders by Year.
• How well products are selling relative to other years.
Hint: Each year --> GROUP BY (AGGREGATE FUNCTIONS)
Business Request: RUNNING TOTAL
Ø Provide the Daily Running Total Due on
Sales Orders and include CustomerID,
SalesOrderID, OrderDate for the period
of 2014-06-01 onward.
Ø Order by SalesOrderID, OrderDate
Business Request:
The Marketing Team has asked you to return
the first three (3) orders, plus the close price,
Total Order Due, Total Orders for every
customer that purchased more than 15 times
from us.
Tasks:
q Find all orders details per customer
q return only the first 3 orders per customer
with:
ü Close Price
ü Total Orders
ü where Total Order Counts is greater
than 15
Business User has asked you to return all orders from the
SalesOrderHeader table for any customer who had over
$10.000 in purchases for their first three transactions
Task:
Ø Find the first three orders per customer
Ø Aggregate the first three orders
Ø Return all orders for those customers
Ø Return all customers with over $10.000
Business Request:
Find the summary of the first and last close price for each day of every months of the year including every
single details.
Business wants this display
To be like this format
Calculate the total due for each month, and get the subtotal for each
year by region and territory.
Advanced SQL For Data Scientists
Advanced SQL For Data Scientists
The Five Major Things To know In
RDBMS
• CRUD
• ACID
• TCL
• Query Optimizer
• When To Use Index
Bonus
• Exception
Key
Takeaways
Deep understanding of ACID.
Master DQL & DML Command.
Know when to use TCL Command
Feedback
Your feedback is important to us.
Don’t forget to rate and review
the sessions.
Jean Joseph
Data Engineer/DBA
Blog : bigdatadriven.org
Email: jean.joseph@bigdatadriven.org
Twitter: @garella79/@cloudatadriven
Thank You So much For Your
Participation!

More Related Content

PDF
The Apache Spark File Format Ecosystem
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
High Performance PL/SQL
PDF
Airflow presentation
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Delta Lake Cheat Sheet.pdf
PDF
[APJ] Common Table Expressions (CTEs) in SQL
 
PDF
Understanding and Improving Code Generation
The Apache Spark File Format Ecosystem
Efficient Data Storage for Analytics with Apache Parquet 2.0
High Performance PL/SQL
Airflow presentation
Tame the small files problem and optimize data layout for streaming ingestion...
Delta Lake Cheat Sheet.pdf
[APJ] Common Table Expressions (CTEs) in SQL
 
Understanding and Improving Code Generation

What's hot (20)

PDF
DOAG - Oracle Database Locking Mechanism Demystified
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Greenplum Architecture
PDF
Facebook Messages & HBase
PPT
Parquet overview
PDF
Parquet performance tuning: the missing guide
PDF
Kylin and Druid Presentation
PDF
Beyond SQL: Speeding up Spark with DataFrames
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PPTX
Webinar - Desarrollo con Oracle Application Express (APEX): demostración prác...
PDF
Analytical Queries with Hive: SQL Windowing and Table Functions
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
PPTX
SQL Tuning 101
PPTX
Local Secondary Indexes in Apache Phoenix
PPTX
Cassandra vs. ScyllaDB: Evolutionary Differences
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PDF
監査ログをもっと身近に!〜統合監査のすすめ〜
PDF
Oracle db performance tuning
DOAG - Oracle Database Locking Mechanism Demystified
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Greenplum Architecture
Facebook Messages & HBase
Parquet overview
Parquet performance tuning: the missing guide
Kylin and Druid Presentation
Beyond SQL: Speeding up Spark with DataFrames
Evening out the uneven: dealing with skew in Flink
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Common Strategies for Improving Performance on Your Delta Lakehouse
Webinar - Desarrollo con Oracle Application Express (APEX): demostración prác...
Analytical Queries with Hive: SQL Windowing and Table Functions
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
SQL Tuning 101
Local Secondary Indexes in Apache Phoenix
Cassandra vs. ScyllaDB: Evolutionary Differences
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
監査ログをもっと身近に!〜統合監査のすすめ〜
Oracle db performance tuning
Ad

Similar to Advanced SQL For Data Scientists (20)

DOCX
Sql interview prep
PPTX
SQL.pptx
PDF
advance-sqaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal.pdf
PPT
dbs class 7.ppt
PPT
INTRODUCTION TO SQL QUERIES REALTED BRIEF
PDF
Sql wksht-3
PPTX
Server Query Language – Getting Started.pptx
PPTX
PPTX
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
PPS
03 qmds2005 session03
PDF
_Super_Study_Guide__Data_Science_Tools_1620233377.pdf
PPTX
SQL (Basic to Intermediate Customized 8 Hours)
PDF
Latin America Tour 2019 - pattern matching
PPT
Review of SQL
PDF
Sql wksht-5
PPSX
Analytic & Windowing functions in oracle
PDF
MODULE 1.pdf foundations of data science for final
PPT
Advanced Sql Training
PPTX
SQL Tutorial for Marketers
PPT
SQL.ppt
Sql interview prep
SQL.pptx
advance-sqaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal.pdf
dbs class 7.ppt
INTRODUCTION TO SQL QUERIES REALTED BRIEF
Sql wksht-3
Server Query Language – Getting Started.pptx
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
03 qmds2005 session03
_Super_Study_Guide__Data_Science_Tools_1620233377.pdf
SQL (Basic to Intermediate Customized 8 Hours)
Latin America Tour 2019 - pattern matching
Review of SQL
Sql wksht-5
Analytic & Windowing functions in oracle
MODULE 1.pdf foundations of data science for final
Advanced Sql Training
SQL Tutorial for Marketers
SQL.ppt
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Introduction to Knowledge Engineering Part 1
PDF
Business Analytics and business intelligence.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Mega Projects Data Mega Projects Data
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Computer network topology notes for revision
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Lecture1 pattern recognition............
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
annual-report-2024-2025 original latest.
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Knowledge Engineering Part 1
Business Analytics and business intelligence.pdf
ISS -ESG Data flows What is ESG and HowHow
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Reliability_Chapter_ presentation 1221.5784
Supervised vs unsupervised machine learning algorithms
Mega Projects Data Mega Projects Data
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Data_Analytics_and_PowerBI_Presentation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Computer network topology notes for revision
Miokarditis (Inflamasi pada Otot Jantung)
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Lecture1 pattern recognition............
Business Ppt On Nestle.pptx huunnnhhgfvu
annual-report-2024-2025 original latest.
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

Advanced SQL For Data Scientists

  • 3. Jean Joseph Data Engineer/DBA Blog : bigdatadriven.org Email: jean.joseph@bigdatadriven.org Twitter: @garella79/@cloudatadriven LinkedIn: https://www.linkedin.com/in/jeandjoseph/ In IT: For over 18 plus years From: New Jersey Original From: Haiti
  • 4. Overview Brief intro to SQL. The five major things to know in RDBMS Data preparation in SQL SQL advanced filtering preparing data for use with analytics tools Key takeaways
  • 5. What is RDBMS? • Relational Data Management System • Tabular • Row(s), Colum(s) • Objects (Tables, Views, Synonyms, Functions, Procedures,..) • Normalization – (OLTP) • De-Normalization (OLAP) • ACID
  • 6. What Is SQL? • Stand for Query Structure Language • SQL lets you Control, Create, Modify object(s) and manipulate data What Can We Do With SQL? DDL CREATE, DROP, TRUNCATE, ALTER, COMMENT, RENAME DQL SELECT DML INSERT, UPDATE, DELETE DCL GRANT, REVOKE TCL COMMIT, ROLLBACK, SAVEPOINT, SET Why SQL? CRUD Data Scientist Should Master
  • 7. Type Of SQL Joins APPLY (Transact-SQL). • CROSS • OUTTER PIVOT (Transact-SQL). UNION [ALL] EXCEPT INTERCECT Join Data Set
  • 8. Data Preparation COLLECTING DATA CLEANING DATA RE-STRUCTING DATA
  • 9. Position Character Set Transformation Soundex SQL Functions To Prep Data • CHARINDEX • PATINDEX • LEN • STUFF • STRING_AGG • SUBSTRING. • STRING_SPLIT • STRING_ESCAPE • TRANSLATE • CONCAT_WS • CONCAT • LEFT • RIGHT • LOWER • UPPER • LEN • TRIM • REPLACE • REVERSE • REPLICATE • ASCII • CHAR • NCHAR • UNICODE • DIFFERENCE • SOUNDEX SQL Functions To Prep Data
  • 10. AGGREGATE vs WINDOWING FUNCTIONS AGGREGATE FUNCTIONS: • which operate on an entire data set or table and are used with a GROUP BY clause. WINDOWING FUNCTIONS: • do not cause rows to become grouped into a single output row, the rows retain their separate identities an aggregated value will be added to each row. Types of Window functions: • Aggregate Window Functions • SUM(), MAX(), MIN(), AVG(). COUNT() • Ranking Window Functions • RANK(), DENSE_RANK(), ROW_NUMBER(), NTILE() • Value Window Functions • LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE() WINDOW_FUNCTION ( [ ALL ] expression ) OVER ( [ PARTITION BY partition_list ] [ ORDER BY order_list] )
  • 12. CHARINDEX ( expressionToFind , expressionToSearch [ , start_location ] ) Data Prep Position Function - CHARINDEX String: This is a great event Parameter Description string Required. The string to extract from start Required. The start position. The first position in string is 1 length Required. The number of characters to extract. Must be a positive number SUBSTRING(string, start, length)
  • 13. Data Prep Position Function PATINDEX PATINDEX ( '%pattern%' , expression ) %pattern% Required. The pattern to find. It MUST be surrounded by %. • % - Match any string of any length (including 0 length) • _ - Match one single character • [] - Match any characters in the brackets, e.g. [xyz] • [^] - Match any character not in the brackets, e.g. [^xyz] • | | string | Required. The string to be searched | Find any string that contain big, and end with driven.org'
  • 14. Data Prep Position Function - TRING_AGG STRING_AGG ( input_string, separator ) [ order_clause ] v input_string is any type that can be converted VARCHAR and NVARCHAR when concatenation. v separator is the separator for the result string. It can be a literal or variable. v order_clause specifies the sort order of concatenated results using WITHIN GROUP clause: WITHIN GROUP ( ORDER BY expression [ ASC | DESC ] )
  • 15. Data Prep Position Function - STRING_SPLIT STRING_SPLIT(string, separator)
  • 16. Analytic • CUME_DIST (Transact-SQL) • FIRST_VALUE (Transact-SQL) • LAG (Transact-SQL) • LAST_VALUE (Transact-SQL) • LEAD (Transact-SQL) • PERCENT_RANK (Transact-SQL) • PERCENTILE_CONT (Transact-SQL) • PERCENTILE_DISC (Transact-SQL) Aggregate • APPROX_COUNT_DISTINCT() • AVG () • CHECKSUM_AGG () • COUNT () • COUNT_BIG () • GROUPING () • GROUPING_ID • MAX () • MIN () • STDEV () • STDEVP () • SUM () • VAR () • VARP () Windowing Function – Framing • Rows/Range • Rows is in memory • Range is in Tempdb • Keywords • Preceding • Following • Unbounded • Current • Ranking • ROW_NUMBER() • RANK() • DENSE_RANK • NTILE SQLAggregate & Analytical Functions
  • 18. Business Request: Provide the Total sales Due, Total Average Sales Orders, Total Number of Sales Orders and Total Sales Rank Orders for each year including all fees (Tax, Shipping, ..). Task: • Get all sales orders for each year • To calculate the • Sum of Total Due by Year. • Total AVG of Sales Orders by Year. • Total Number of Sales Orders by Year. • How well products are selling relative to other years. Hint: Each year --> GROUP BY (AGGREGATE FUNCTIONS)
  • 19. Business Request: RUNNING TOTAL Ø Provide the Daily Running Total Due on Sales Orders and include CustomerID, SalesOrderID, OrderDate for the period of 2014-06-01 onward. Ø Order by SalesOrderID, OrderDate
  • 20. Business Request: The Marketing Team has asked you to return the first three (3) orders, plus the close price, Total Order Due, Total Orders for every customer that purchased more than 15 times from us. Tasks: q Find all orders details per customer q return only the first 3 orders per customer with: ü Close Price ü Total Orders ü where Total Order Counts is greater than 15
  • 21. Business User has asked you to return all orders from the SalesOrderHeader table for any customer who had over $10.000 in purchases for their first three transactions Task: Ø Find the first three orders per customer Ø Aggregate the first three orders Ø Return all orders for those customers Ø Return all customers with over $10.000
  • 22. Business Request: Find the summary of the first and last close price for each day of every months of the year including every single details.
  • 23. Business wants this display To be like this format Calculate the total due for each month, and get the subtotal for each year by region and territory.
  • 26. The Five Major Things To know In RDBMS • CRUD • ACID • TCL • Query Optimizer • When To Use Index Bonus • Exception
  • 27. Key Takeaways Deep understanding of ACID. Master DQL & DML Command. Know when to use TCL Command
  • 28. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 29. Jean Joseph Data Engineer/DBA Blog : bigdatadriven.org Email: jean.joseph@bigdatadriven.org Twitter: @garella79/@cloudatadriven Thank You So much For Your Participation!