SlideShare a Scribd company logo
SQL Bits - the great data heist
Manchester 2019
An R primer for SQL folks
Thomas Hütter
Thomas Hütter, Diplom-Betriebswirt
• Application developer, consultant, accidental DBA, author
• Worked at consultancies, ISVs, end user companies
• Speaker at SQL events around Europe
• SQL Server > 6.5, Dynamics Nav > 3.01, R > 3.1.2
@DerFredo https://twitter.com/DerFredo
de.linkedin.com/in/derfredo
www.xing.com/profile/Thomas_Huetter
An R primer for SQL folks
Agenda
• History: what is R, how did R come to be, 

what does the R ecosystem look like today
• Introduction: R IDE, RStudio, basic data types / objects,

packages, in-/output, data analysis, visualization
• Business case demo:
• Extracting ‘sales’ data from a Nav DB on SQL Server
• Basic analysis and visualization
• Advanced visualization using the Shiny framework
• Example: data science going wrong, round-up, resources
• This is an introductory walk-through, no deep dive - 

so no fancy predictions, regression, big data science :-(
History: R - then and now
• Programming language for statistical computing, analysis and visualization,

widely used by statisticians, data miners, analysts, data scientists
• Created by Ross Ihaka and Robert Gentleman, Uni Auckland, in 1993 

as an open source implementation of the (1970s) S language
• GNU project, maintained by the R Foundation for Statistical Computing,
compiled builds for Mac OS, Linux, Windows, supported by R Consortium
• Extensible through user-created packages, > 13700 available on CRAN
• Commercial support, e.g. since 2007 by Revolution Analytics, 

acquired by Microsoft in 2015, now provide Microsoft R Open, R Server
• IDEs: R.App, RStudio, R Tools for Visual Studio (deprecated from VS 2019)
• Support for R now in SQL Server, Power BI, Azure ML, Data science VM
Introduction: data objects
• Data types
- numeric, integer, complex
- character
- logical
- factor
- Posix types for date/time

- NA = Not available
• Data structures
- vector: 1 dim, 1 data type
- matrix: 2 dim rect, 1 data type
- list: collection of other objects
- table: > 2 dimensions
- data frame

2 dim rect, cols = vectors

DemoBasics1
Introduction: packages
•Extensions to the R base system, containing code, data, documentation.

Key factor to the success of R; flexible, user contributable. -> CRAN
•installed.packages() lists all installed packages incl. versions,
dependencies, license and other info
•search() lists currently attached packages
•install.packages() downloads and installs packages
•library() loads/attaches packages, also require()
•Hadley Wickham, chief scientist at RStudio, professor of statistics

„Tidyverse“: dplyr, tidyr, lubridate, readr, httr, ggplot2 

+ many more: hadley.nz
DemoBasics2
Introduction: basic data in-/output
• Generic functions read.table and write.table
- read.csv / read.csv2 comma/semicolon delimited
- read.delim / read.delim2 Tab delimited, decimal point/comma
- read.fwf fixed width format
• Some additional I/O packages
- reader functions flexibly load multiple formats fast
- foreign reads data from Minitab, S, SAS, SPSS, Stata, dBase…
- DBI/ODBC database access via ODBC
- xlsx and readxl read and write Excel 97/XP/200X files
- XML reads XML and tables from http web sites
Introduction: basic data analysis + visualization
• Analyzing (numeric) data:

str() structure = data types and ranges

summary() Min, max, mean, median, quartiles;

for factors: count of levels

head()/tail() shows top/bottom n rows (default = 6)
• Distribution of values:

hist() shows frequency distribution, 

boxplot() for min, max, quartiles, outliers,
mosaicplot() contingency mosaic
DemoBasics3
Continued… data analysis + visualization
• Libraries: tidy for data tidying/reshaping, ggplot2 implements 

grammar of graphics, raster for geo data
• apply() family of functions applies functions to the margins of 

an array or a matrix
• gather()/spread() convert between wide/long format
• ggplot() very powerful plot function, plots point, line or bar 

geometrics etc with versatile parameters
DemoBasics4
Business case demo
• We are the distributor for all German petrol stations,

with two subsidiaries: NorthTank and SouthFuel
• Business calls „We need some analysis of our 2015 Diesel sales“, 

preferably some visualizations, and „maybe something is wrong…“
• Of interest: distribution by post code zones
• Source: Dynamics Nav ERP database, on the customer card (table
„Customer“) there’s a field called „Sales (LCY)“ (= Local currency)
• Publicly available shape- and data files for post code zones



Extracting data & first analysis
• Using ODBC and the DBI package

(also available: JDBC, RODBC and others)
• dbConnect() to establish a connection, 

then dbGetQuery() to query the database
• Calculate aggregates (sums) using ddply()
• Bar plot: ggplot() + geom_bar()
• Line diagram: ggplot() + geom_line()
Analysis & visualization
• Calculate intervals for sales sums: cut()
• libraries raster, rgeos for visualizing geospatial data
• shapefiles: open vector data format for GIS software,
describes points, lines or polygons in these files:

.shp shapes, .shx shape index, .dbf attributes, .prj projection
• merge shape and sales data: merge()
• plot maps, colouring post code zones according to sales
DemoTankData
Use of Shiny framework
• Framework for interactive web applications in R

apps consist of server.R and ui.R or just app.R
• ui defines screen appearance & controls
• server handles any data processing, plotting etc.
• apps can be run in web browser

DemoShiny/app
Example: data science going wrong?
• Anscombe’s quartet:
• 4 data sets, each with 11 completely different x-y pairs
• yet nearly identical statistical properties
- Mean of x = 9
- Mean of y = 7.5
- Correlation between x and y = 0.816
- Linear regression y = 3 + 0.5 x
Anscombe
Round-up / conclusions
• With R, a lot is possible in terms of analysis and visualization
• There’s probably always a package for that

But please:
• Know your data
• Look at your data
▪ Think - does it make sense?
• Consider the influence of outliers
• Don’t blindly rely on R ‘doing the trick’
Resources online
• https://en.wikipedia.org/wiki/R_(programming_language)
• https://www.r-project.org/ -> Mirrors of CRAN = Comprehensive R Archive Network
• https://www.r-consortium.org/
• http://www.r-bloggers.com/
• www.kdnuggets.com
• www.rseek.org Pimped Google search for R-related subjects

• Twitter hashtag #rstats
• LinkedIn groups R Developers und Users Group, R Programming, The R Project for…

• www.swirlstats.com „Learn R, in R“
• www.coursera.org Data Science specialization (10 courses) MOOC
• www.edx.org
Resources offline
• Beginning R, The statistical programming language

Dr. Mark Gardener, Wrox/Wiley, ISBN 978-1118164303
• R Cookbook, Paul Teetor, O’Reilly, ISBN 978-0596809157
• R Graphics Cookbook, Winston Chang, O’Reilly, 

ISBN 978-1449316952
• R in a Nutshell, Joseph Adler, O’Reilly, ISBN 978-1449312084
• Practical Data Science with R, Nina Zumel + John Mount,

Manning publications, ISBN 978-1617291562
Credits
• Titanic data set: www.kaggle.com/c/titanic/data
• SQL Database structure:

mbs.microsoft.com Dynamics Nav 2016 demo database
• Customer and „sales“ data: www.tankerkoenig.de (license CC BY 4.0)
• Shape files:

- www.suche-postleitzahl.org (Open database license, 

© OpenStreetMap contributors)

- Bundesamt für Kartographie und Geodäsie, Frankfurt am Main, 2011
• Some icons made by:

http://www.flaticon.com/authors/hanan (license CC BY 3.0)
• Anscombe’s quartet: Francis J. Anscombe 1973
An R primer for SQL folks
Time for some Q & A:
That is: questions that might be of common interest,

and their answers might fit into the remaining time
For slides and scripts: follow link on final slide

or check the SQLBits homepage ;-)
An R primer for SQL folks
Thank you for your interest & keep in touch:

@DerFredo https://twitter.com/DerFredo
de.linkedin.com/in/derfredo
www.xing.com/profile/Thomas_Huetter
Slides and scripts to this presentation will be at

https://github.com/SQLThomas/Conferences/tree/master/Bits2019

More Related Content

PPTX
Data Analytics with R and SQL Server
PDF
Essentials of R
PDF
Introduction to Microsoft R Services
PDF
pandas: a Foundational Python Library for Data Analysis and Statistics
PDF
Intro to R statistic programming
PPTX
Why R? A Brief Introduction to the Open Source Statistics Platform
PDF
Microsoft and Revolution Analytics -- what's the add-value? 20150629
PPTX
R programming Language , Rahul Singh
Data Analytics with R and SQL Server
Essentials of R
Introduction to Microsoft R Services
pandas: a Foundational Python Library for Data Analysis and Statistics
Intro to R statistic programming
Why R? A Brief Introduction to the Open Source Statistics Platform
Microsoft and Revolution Analytics -- what's the add-value? 20150629
R programming Language , Rahul Singh

What's hot (20)

PDF
Open statistics Belgium
PDF
Introduction to R for data science
PPTX
PDF
Apache Spark — Fundamentals and MLlib
PPTX
R program
PDF
1.3 introduction to R language, importing dataset in r, data exploration in r
PPTX
Apache Spark GraphX highlights.
PDF
Structured Data Challenges in Finance and Statistics
PPTX
LD4KD 2015 - Demos and tools
PDF
SciPy 2011 pandas lightning talk
PPTX
Be A Hero: Transforming GoPro Analytics Data Pipeline
PDF
Prague Hacks 2015
PPTX
Rattle Graphical Interface for R Language
PDF
Spatial Data Science with R
PPTX
The Very ^ 2 Basics of R
PDF
R statistics with mongo db
PPTX
Hadoop World 2010 - BAH - Fuzzy Table
PPTX
Spark meetup v2.0.5
PDF
Evolution of the Graph Schema
PPTX
Using C# with U-SQL (SQLBits 2016)
Open statistics Belgium
Introduction to R for data science
Apache Spark — Fundamentals and MLlib
R program
1.3 introduction to R language, importing dataset in r, data exploration in r
Apache Spark GraphX highlights.
Structured Data Challenges in Finance and Statistics
LD4KD 2015 - Demos and tools
SciPy 2011 pandas lightning talk
Be A Hero: Transforming GoPro Analytics Data Pipeline
Prague Hacks 2015
Rattle Graphical Interface for R Language
Spatial Data Science with R
The Very ^ 2 Basics of R
R statistics with mongo db
Hadoop World 2010 - BAH - Fuzzy Table
Spark meetup v2.0.5
Evolution of the Graph Schema
Using C# with U-SQL (SQLBits 2016)
Ad

Similar to An R primer for SQL folks (20)

PPTX
R training at Aimia
PPTX
Introduction To R
PDF
Introduction to Data Mining with R and Data Import/Export in R
PDF
Executive Intro to R
PDF
R programming for data science
PPTX
R at Microsoft
PDF
GET STARTED WITH R FOR DATA SCIENCE
PDF
R - the language
PPTX
R and Rcmdr Statistical Software
PDF
Big Data Analytics with R
PPTX
R_L1-Aug-2022.pptx
PPTX
Big data analytics with R tool.pptx
PPTX
Data Analytic s (Unit -1).pRESENTATION .PPT
PPTX
R for data analytics
PPTX
Introduction to basic statistics
 
PDF
In-Database Analytics Deep Dive with Teradata and Revolution
PDF
Overview of tools for data analysis and visualisation (2021)
PPT
An introduction to R is a document useful
PPTX
Overview data analyis and visualisation tools 2020
PDF
Data Visualization in R (Graph, Trend, etc)
R training at Aimia
Introduction To R
Introduction to Data Mining with R and Data Import/Export in R
Executive Intro to R
R programming for data science
R at Microsoft
GET STARTED WITH R FOR DATA SCIENCE
R - the language
R and Rcmdr Statistical Software
Big Data Analytics with R
R_L1-Aug-2022.pptx
Big data analytics with R tool.pptx
Data Analytic s (Unit -1).pRESENTATION .PPT
R for data analytics
Introduction to basic statistics
 
In-Database Analytics Deep Dive with Teradata and Revolution
Overview of tools for data analysis and visualisation (2021)
An introduction to R is a document useful
Overview data analyis and visualisation tools 2020
Data Visualization in R (Graph, Trend, etc)
Ad

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
1_Introduction to advance data techniques.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Business Analytics and business intelligence.pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
Galatica Smart Energy Infrastructure Startup Pitch Deck
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
.pdf is not working space design for the following data for the following dat...
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to Knowledge Engineering Part 1
Miokarditis (Inflamasi pada Otot Jantung)
IB Computer Science - Internal Assessment.pptx
Supervised vs unsupervised machine learning algorithms
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
IBA_Chapter_11_Slides_Final_Accessible.pptx
Reliability_Chapter_ presentation 1221.5784
1_Introduction to advance data techniques.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction-to-Cloud-ComputingFinal.pptx
Database Infoormation System (DBIS).pptx
ISS -ESG Data flows What is ESG and HowHow
Business Analytics and business intelligence.pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu

An R primer for SQL folks

  • 1. SQL Bits - the great data heist Manchester 2019 An R primer for SQL folks Thomas Hütter
  • 2. Thomas Hütter, Diplom-Betriebswirt • Application developer, consultant, accidental DBA, author • Worked at consultancies, ISVs, end user companies • Speaker at SQL events around Europe • SQL Server > 6.5, Dynamics Nav > 3.01, R > 3.1.2 @DerFredo https://twitter.com/DerFredo de.linkedin.com/in/derfredo www.xing.com/profile/Thomas_Huetter An R primer for SQL folks
  • 3. Agenda • History: what is R, how did R come to be, 
 what does the R ecosystem look like today • Introduction: R IDE, RStudio, basic data types / objects,
 packages, in-/output, data analysis, visualization • Business case demo: • Extracting ‘sales’ data from a Nav DB on SQL Server • Basic analysis and visualization • Advanced visualization using the Shiny framework • Example: data science going wrong, round-up, resources • This is an introductory walk-through, no deep dive - 
 so no fancy predictions, regression, big data science :-(
  • 4. History: R - then and now • Programming language for statistical computing, analysis and visualization,
 widely used by statisticians, data miners, analysts, data scientists • Created by Ross Ihaka and Robert Gentleman, Uni Auckland, in 1993 
 as an open source implementation of the (1970s) S language • GNU project, maintained by the R Foundation for Statistical Computing, compiled builds for Mac OS, Linux, Windows, supported by R Consortium • Extensible through user-created packages, > 13700 available on CRAN • Commercial support, e.g. since 2007 by Revolution Analytics, 
 acquired by Microsoft in 2015, now provide Microsoft R Open, R Server • IDEs: R.App, RStudio, R Tools for Visual Studio (deprecated from VS 2019) • Support for R now in SQL Server, Power BI, Azure ML, Data science VM
  • 5. Introduction: data objects • Data types - numeric, integer, complex - character - logical - factor - Posix types for date/time
 - NA = Not available • Data structures - vector: 1 dim, 1 data type - matrix: 2 dim rect, 1 data type - list: collection of other objects - table: > 2 dimensions - data frame
 2 dim rect, cols = vectors
 DemoBasics1
  • 6. Introduction: packages •Extensions to the R base system, containing code, data, documentation.
 Key factor to the success of R; flexible, user contributable. -> CRAN •installed.packages() lists all installed packages incl. versions, dependencies, license and other info •search() lists currently attached packages •install.packages() downloads and installs packages •library() loads/attaches packages, also require() •Hadley Wickham, chief scientist at RStudio, professor of statistics
 „Tidyverse“: dplyr, tidyr, lubridate, readr, httr, ggplot2 
 + many more: hadley.nz DemoBasics2
  • 7. Introduction: basic data in-/output • Generic functions read.table and write.table - read.csv / read.csv2 comma/semicolon delimited - read.delim / read.delim2 Tab delimited, decimal point/comma - read.fwf fixed width format • Some additional I/O packages - reader functions flexibly load multiple formats fast - foreign reads data from Minitab, S, SAS, SPSS, Stata, dBase… - DBI/ODBC database access via ODBC - xlsx and readxl read and write Excel 97/XP/200X files - XML reads XML and tables from http web sites
  • 8. Introduction: basic data analysis + visualization • Analyzing (numeric) data:
 str() structure = data types and ranges
 summary() Min, max, mean, median, quartiles;
 for factors: count of levels
 head()/tail() shows top/bottom n rows (default = 6) • Distribution of values:
 hist() shows frequency distribution, 
 boxplot() for min, max, quartiles, outliers, mosaicplot() contingency mosaic DemoBasics3
  • 9. Continued… data analysis + visualization • Libraries: tidy for data tidying/reshaping, ggplot2 implements 
 grammar of graphics, raster for geo data • apply() family of functions applies functions to the margins of 
 an array or a matrix • gather()/spread() convert between wide/long format • ggplot() very powerful plot function, plots point, line or bar 
 geometrics etc with versatile parameters DemoBasics4
  • 10. Business case demo • We are the distributor for all German petrol stations,
 with two subsidiaries: NorthTank and SouthFuel • Business calls „We need some analysis of our 2015 Diesel sales“, 
 preferably some visualizations, and „maybe something is wrong…“ • Of interest: distribution by post code zones • Source: Dynamics Nav ERP database, on the customer card (table „Customer“) there’s a field called „Sales (LCY)“ (= Local currency) • Publicly available shape- and data files for post code zones
 

  • 11. Extracting data & first analysis • Using ODBC and the DBI package
 (also available: JDBC, RODBC and others) • dbConnect() to establish a connection, 
 then dbGetQuery() to query the database • Calculate aggregates (sums) using ddply() • Bar plot: ggplot() + geom_bar() • Line diagram: ggplot() + geom_line()
  • 12. Analysis & visualization • Calculate intervals for sales sums: cut() • libraries raster, rgeos for visualizing geospatial data • shapefiles: open vector data format for GIS software, describes points, lines or polygons in these files:
 .shp shapes, .shx shape index, .dbf attributes, .prj projection • merge shape and sales data: merge() • plot maps, colouring post code zones according to sales DemoTankData
  • 13. Use of Shiny framework • Framework for interactive web applications in R
 apps consist of server.R and ui.R or just app.R • ui defines screen appearance & controls • server handles any data processing, plotting etc. • apps can be run in web browser
 DemoShiny/app
  • 14. Example: data science going wrong? • Anscombe’s quartet: • 4 data sets, each with 11 completely different x-y pairs • yet nearly identical statistical properties - Mean of x = 9 - Mean of y = 7.5 - Correlation between x and y = 0.816 - Linear regression y = 3 + 0.5 x Anscombe
  • 15. Round-up / conclusions • With R, a lot is possible in terms of analysis and visualization • There’s probably always a package for that
 But please: • Know your data • Look at your data ▪ Think - does it make sense? • Consider the influence of outliers • Don’t blindly rely on R ‘doing the trick’
  • 16. Resources online • https://en.wikipedia.org/wiki/R_(programming_language) • https://www.r-project.org/ -> Mirrors of CRAN = Comprehensive R Archive Network • https://www.r-consortium.org/ • http://www.r-bloggers.com/ • www.kdnuggets.com • www.rseek.org Pimped Google search for R-related subjects
 • Twitter hashtag #rstats • LinkedIn groups R Developers und Users Group, R Programming, The R Project for…
 • www.swirlstats.com „Learn R, in R“ • www.coursera.org Data Science specialization (10 courses) MOOC • www.edx.org
  • 17. Resources offline • Beginning R, The statistical programming language
 Dr. Mark Gardener, Wrox/Wiley, ISBN 978-1118164303 • R Cookbook, Paul Teetor, O’Reilly, ISBN 978-0596809157 • R Graphics Cookbook, Winston Chang, O’Reilly, 
 ISBN 978-1449316952 • R in a Nutshell, Joseph Adler, O’Reilly, ISBN 978-1449312084 • Practical Data Science with R, Nina Zumel + John Mount,
 Manning publications, ISBN 978-1617291562
  • 18. Credits • Titanic data set: www.kaggle.com/c/titanic/data • SQL Database structure:
 mbs.microsoft.com Dynamics Nav 2016 demo database • Customer and „sales“ data: www.tankerkoenig.de (license CC BY 4.0) • Shape files:
 - www.suche-postleitzahl.org (Open database license, 
 © OpenStreetMap contributors)
 - Bundesamt für Kartographie und Geodäsie, Frankfurt am Main, 2011 • Some icons made by:
 http://www.flaticon.com/authors/hanan (license CC BY 3.0) • Anscombe’s quartet: Francis J. Anscombe 1973
  • 19. An R primer for SQL folks Time for some Q & A: That is: questions that might be of common interest,
 and their answers might fit into the remaining time For slides and scripts: follow link on final slide
 or check the SQLBits homepage ;-)
  • 20. An R primer for SQL folks Thank you for your interest & keep in touch:
 @DerFredo https://twitter.com/DerFredo de.linkedin.com/in/derfredo www.xing.com/profile/Thomas_Huetter Slides and scripts to this presentation will be at
 https://github.com/SQLThomas/Conferences/tree/master/Bits2019