SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
dplyr Interfaces to Large-Scale Data
Ian Cook
@ianmcook
ian@cloudera.com
2© Cloudera, Inc. All rights reserved.
Mission for Cloudera: Provide a platform for data analysts, data scientists to
efficiently query, analyze, model large-scale data in clusters, cloud storage
• By distributing Apache Spark, Apache Impala, other tools
• By enabling productive use of these tools
Python and R users often have difficulty moving from smaller data to large-scale
distributed data
• Familiar packages, methods don’t work the same way on distributed data
Context
3© Cloudera, Inc. All rights reserved.
Poll question
4© Cloudera, Inc. All rights reserved.
]
SQLPySpark
SparkR
SQL
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API
5© Cloudera, Inc. All rights reserved.
Poll question
6© Cloudera, Inc. All rights reserved.
]
SQLPySpark
SparkR
SQL
dplyr
7© Cloudera, Inc. All rights reserved.
dplyr provides a set of verbs that perform common data manipulation steps
• select() to select columns
• filter() to filter rows
• arrange() to order rows
• mutate() to create new columns
• summarise() to aggregate
• group_by() to perform operations by group
dplyr works on local data and with remote data sources
• For remote sources, dplyr commands are translated into SQL
dplyr
8© Cloudera, Inc. All rights reserved.
Poll question
9© Cloudera, Inc. All rights reserved.
Demonstration
Example code at
github.com/ianmcook/dplyr-examples
10© Cloudera, Inc. All rights reserved.
dplyr SQL backends
dplyr
↕
dbplyr
↕
dplyr SQL backend package*
↕
DBI
↕
DBI-compatible interface package
↕
database driver or connector
↕
database/engine
* optional
11© Cloudera, Inc. All rights reserved.
• Provides a SQL backend to dplyr for Spark
• Also exposes the MLlib API and a subset of the Spark DataFrames API
• Developed by RStudio
spark.rstudio.com
sparklyr
12© Cloudera, Inc. All rights reserved.
• Provides a SQL backend to dplyr for Impala
• Uses ODBC or JDBC to connect to Impala
• Developed at Cloudera
tiny.cloudera.com/implyr
implyr
implyr
13© Cloudera, Inc. All rights reserved.
Five tips for using dplyr
with SQL data sources
14© Cloudera, Inc. All rights reserved.
Use show_query()
1
15© Cloudera, Inc. All rights reserved.
filter() early
arrange() late
2
16© Cloudera, Inc. All rights reserved.
Check your data types
3
17© Cloudera, Inc. All rights reserved.
Know your SQL engine
4
18© Cloudera, Inc. All rights reserved.
Know when to collect()
5
19© Cloudera, Inc. All rights reserved.
Questions?
Ian Cook
@ianmcook
ian@cloudera.com
20© Cloudera, Inc. All rights reserved.
Cloudera Data Science Workbench
More information
tiny.cloudera.com/cdsw
OnDemand training
tiny.cloudera.com/cdsw-training

More Related Content

PDF
Data Science and Machine Learning for the Enterprise
PPTX
Road to Cloudera certification
PPTX
Unlock Hadoop Success with Cloudera Navigator Optimizer
PPTX
Supercharge Splunk with Cloudera

PPTX
Solr consistency and recovery internals
PPTX
Big Data Fundamentals
PPTX
A deep dive into running data analytic workloads in the cloud
PPTX
Data Science and CDSW
Data Science and Machine Learning for the Enterprise
Road to Cloudera certification
Unlock Hadoop Success with Cloudera Navigator Optimizer
Supercharge Splunk with Cloudera

Solr consistency and recovery internals
Big Data Fundamentals
A deep dive into running data analytic workloads in the cloud
Data Science and CDSW

What's hot (20)

PPTX
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
PPTX
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
PPTX
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
PPT
A Community Approach to Fighting Cyber Threats
PPTX
Part 3: Models in Production: A Look From Beginning to End
PPTX
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
PPTX
Part 1: Lambda Architectures: Simplified by Apache Kudu
PPTX
Spark One Platform Webinar
PPTX
Cloudera Altus: Big Data in the Cloud Made Easy
PPTX
How Data Drives Business at Choice Hotels
PPTX
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
PDF
Hadoop on Cloud: Why and How?
PPTX
Data Science at Scale Using Apache Spark and Apache Hadoop
PPTX
Analyzing Hadoop Data Using Sparklyr

PPTX
Multi-Tenant Operations with Cloudera 5.7 & BT
PDF
One Hadoop, Multiple Clouds - NYC Big Data Meetup
PPTX
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
PPTX
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
PPTX
Hadoop Hadoop & Spark meetup - Altiscale
PPTX
How Big Data Can Enable Analytics from the Cloud (Technical Workshop)
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
A Community Approach to Fighting Cyber Threats
Part 3: Models in Production: A Look From Beginning to End
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Part 1: Lambda Architectures: Simplified by Apache Kudu
Spark One Platform Webinar
Cloudera Altus: Big Data in the Cloud Made Easy
How Data Drives Business at Choice Hotels
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Hadoop on Cloud: Why and How?
Data Science at Scale Using Apache Spark and Apache Hadoop
Analyzing Hadoop Data Using Sparklyr

Multi-Tenant Operations with Cloudera 5.7 & BT
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Hadoop & Spark meetup - Altiscale
How Big Data Can Enable Analytics from the Cloud (Technical Workshop)
Ad

Similar to Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data (20)

PDF
dplyr Interfaces to Large-Scale Data
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

PDF
Data Science Languages and Industry Analytics
PPTX
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
PDF
Applications on Hadoop
PPTX
Twitter with hadoop for oow
PPTX
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
PPTX
Power of the AWR Warehouse- HotSos Symposium 2015
PPTX
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
PPTX
Building Efficient Pipelines in Apache Spark
PDF
OOW-TBE-12c-CON7307-Sharable
PPTX
Oracle Database Cloud Service
PDF
Oracle NoSQL Database release 3.0 overview
PDF
PySpark Best Practices
PPTX
Impala 2.0 - The Best Analytic Database for Hadoop
PDF
Turning Relational Database Tables into Hadoop Datasources by Kuassi Mensah
PPTX
Apache Spark in Scientific Applciations
PPTX
Apache Spark in Scientific Applications
PDF
Session 203 iouc summit database
dplyr Interfaces to Large-Scale Data
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Data Science Languages and Industry Analytics
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Applications on Hadoop
Twitter with hadoop for oow
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Power of the AWR Warehouse- HotSos Symposium 2015
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Building Efficient Pipelines in Apache Spark
OOW-TBE-12c-CON7307-Sharable
Oracle Database Cloud Service
Oracle NoSQL Database release 3.0 overview
PySpark Best Practices
Impala 2.0 - The Best Analytic Database for Hadoop
Turning Relational Database Tables into Hadoop Datasources by Kuassi Mensah
Apache Spark in Scientific Applciations
Apache Spark in Scientific Applications
Session 203 iouc summit database
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
Teaching material agriculture food technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Electronic commerce courselecture one. Pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Big Data Technologies - Introduction.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation_ Review paper, used for researhc scholars
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Network Security Unit 5.pdf for BCA BBA.
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Weekly Chronicles - August'25 Week I
Teaching material agriculture food technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Digital-Transformation-Roadmap-for-Companies.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Electronic commerce courselecture one. Pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Understanding_Digital_Forensics_Presentation.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
MYSQL Presentation for SQL database connectivity
NewMind AI Monthly Chronicles - July 2025
Big Data Technologies - Introduction.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars

Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data

  • 1. 1© Cloudera, Inc. All rights reserved. dplyr Interfaces to Large-Scale Data Ian Cook @ianmcook ian@cloudera.com
  • 2. 2© Cloudera, Inc. All rights reserved. Mission for Cloudera: Provide a platform for data analysts, data scientists to efficiently query, analyze, model large-scale data in clusters, cloud storage • By distributing Apache Spark, Apache Impala, other tools • By enabling productive use of these tools Python and R users often have difficulty moving from smaller data to large-scale distributed data • Familiar packages, methods don’t work the same way on distributed data Context
  • 3. 3© Cloudera, Inc. All rights reserved. Poll question
  • 4. 4© Cloudera, Inc. All rights reserved. ] SQLPySpark SparkR SQL SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API
  • 5. 5© Cloudera, Inc. All rights reserved. Poll question
  • 6. 6© Cloudera, Inc. All rights reserved. ] SQLPySpark SparkR SQL dplyr
  • 7. 7© Cloudera, Inc. All rights reserved. dplyr provides a set of verbs that perform common data manipulation steps • select() to select columns • filter() to filter rows • arrange() to order rows • mutate() to create new columns • summarise() to aggregate • group_by() to perform operations by group dplyr works on local data and with remote data sources • For remote sources, dplyr commands are translated into SQL dplyr
  • 8. 8© Cloudera, Inc. All rights reserved. Poll question
  • 9. 9© Cloudera, Inc. All rights reserved. Demonstration Example code at github.com/ianmcook/dplyr-examples
  • 10. 10© Cloudera, Inc. All rights reserved. dplyr SQL backends dplyr ↕ dbplyr ↕ dplyr SQL backend package* ↕ DBI ↕ DBI-compatible interface package ↕ database driver or connector ↕ database/engine * optional
  • 11. 11© Cloudera, Inc. All rights reserved. • Provides a SQL backend to dplyr for Spark • Also exposes the MLlib API and a subset of the Spark DataFrames API • Developed by RStudio spark.rstudio.com sparklyr
  • 12. 12© Cloudera, Inc. All rights reserved. • Provides a SQL backend to dplyr for Impala • Uses ODBC or JDBC to connect to Impala • Developed at Cloudera tiny.cloudera.com/implyr implyr implyr
  • 13. 13© Cloudera, Inc. All rights reserved. Five tips for using dplyr with SQL data sources
  • 14. 14© Cloudera, Inc. All rights reserved. Use show_query() 1
  • 15. 15© Cloudera, Inc. All rights reserved. filter() early arrange() late 2
  • 16. 16© Cloudera, Inc. All rights reserved. Check your data types 3
  • 17. 17© Cloudera, Inc. All rights reserved. Know your SQL engine 4
  • 18. 18© Cloudera, Inc. All rights reserved. Know when to collect() 5
  • 19. 19© Cloudera, Inc. All rights reserved. Questions? Ian Cook @ianmcook ian@cloudera.com
  • 20. 20© Cloudera, Inc. All rights reserved. Cloudera Data Science Workbench More information tiny.cloudera.com/cdsw OnDemand training tiny.cloudera.com/cdsw-training