SlideShare a Scribd company logo
Data Lake to GPUs
@blazingdb@blazingdb
CPUs can no longer handle the growing data demands
of data science workloads
Slow Process Suboptimal Infrastructure
Hundreds to tens of thousands of
CPU servers are needed in data
centers.
Preparing data and training models
can take days or even weeks.
@blazingdb@blazingdb
GPUs are well known for accelerating the training of
machine learning and deep learning models.
Deep Learning
(Neural Networks)
Machine Learning
Performance
improvements
increase at scale.
40x Improvement
over CPU.
@blazingdb@blazingdb
An end-to-end analytics solution on GPUs.
Expertise:
· GPU DBMS
· GPU Columnar Analytics
· Data Lakes
Expertise:
· CUDA
· Machine Learning
· Deep Learning
Expertise:
· Python
· Data Science
· Machine Learning
Expertise:
· Graph Analytics
· Visualization
· Apache Arrow
@blazingdb@blazingdb
RAPIDS, the end-to-end GPU analytics ecosystem
cuDF
Data Preparation
cuML
Machine Learning
cuGRAPH
Graph Analytics
Model TrainingData Preparation Visualization
A set of open source libraries for GPU
accelerating data preparation and
machine learning.
In GPU Memory
@blazingdb@blazingdb
RAPIDS, the end-to-end GPU analytics ecosystem
cuDF
Data Preparation
cuML
Machine Learning
cuGRAPH
Graph Analytics
Model TrainingData Preparation Visualization
A set of open source libraries for GPU
accelerating data preparation and
machine learning.
In GPU Memory · Cross language · Zero-copy reads · Columnar
Apache Arrow
@blazingdb@blazingdb
RAPIDS, the end-to-end GPU analytics ecosystem
cuDF
Data Preparation
cuML
Machine Learning
cuGRAPH
Graph Analytics
Model TrainingData Preparation Visualization
A set of open source libraries for GPU
accelerating data preparation and
machine learning.
In GPU Memory · GPU Compute Kernels · Pandas-like API
CUDA DataFrame (cuDF)
cuDF
Data Preparation
@blazingdb@blazingdb
cuDF
Data Preparation
BlazingSQL, GPU accelerated SQL engine
cuML
Machine Learning
cuGRAPH
Graph Analytics
A SQL engine built on RAPIDS. Query
enterprise data lakes lightning fast with
full interoperability with RAPIDS stack.
In GPU Memory
from pyblazing import BlazingContext
bc = BlazingContext()
#Register Filesystem
bc.hdfs('data', host='129.13.0.12',
port=54310)
# Create Table
bc.table('performance',
file_type='parquet',
path='hdfs://data/performance/')
#Execute Query
result_gdf = bc.sql('SELECT * FROM
performance WHERE
YEAR(maturity_date)>2005')
print(result_gdf)
@blazingdb@blazingdb
BlazingSQL + Graphistry Netflow Analysis
Visually analyze the VAST netflow data set inside Graphistry in order
to quickly detect anomalous events.
ETL VisualizationNetflow Data
65M Events
2 Weeks
1,440 Devices
@blazingdb@blazingdb
Benchmarks
Netflow Demo Timings (Load & ETL)
@blazingdb@blazingdb
Benchmarks
Netflow Demo Timings (ETL Only)
@blazingdb@blazingdb
Upcoming BlazingSQL Releases
Use the PyBlazing
connection to execute SQL
queries on GDFs that are
loaded by the cuDF API
Integrate FileSystem API,
adding the ability to
directly query flat files
(Apache Parquet & CSV)
inside distributed file
systems.
SQL queries are fanned
out across multiple GPUs
and servers.
String support and string
operation support.
External Dependency
Query
GDFs
Direct Query
Flat Files
Distributed
Scheduler
String
Support
Physical Plan
Optimizer
Partition culling for where
clauses and joins.
VO.1 VO.2 VO.3 VO.4 VO.5
@blazingdb@blazingdb
Scale Up & Out
~400 GB
~200 GB
5 GB
32 GB
~10 TB
Distributed
VO.4
Physical Plan
VO.5
Scheduler Optimizer
GCP
1xT4 GPU
DGX-1
8xV100 GPU
DGX-2
16xV100 GPU
DGX-2
16xV100 GPU
GCP
4xT4 GPU
@blazingdb@blazingdb
Get Started
BlazingSQL is quick to get up and running using either
DockerHub or Conda:

More Related Content

PDF
BlazingSQL + RAPIDS AI at GTC San Jose 2019
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PPTX
Greenplum-Spark November 2018
PDF
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
PDF
GPU-Accelerating A Deep Learning Anomaly Detection Platform
PPTX
Graphite
PDF
Greenplum for Kubernetes - Greenplum Summit 2019
PDF
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...
BlazingSQL + RAPIDS AI at GTC San Jose 2019
Stage Level Scheduling Improving Big Data and AI Integration
Greenplum-Spark November 2018
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
GPU-Accelerating A Deep Learning Anomaly Detection Platform
Graphite
Greenplum for Kubernetes - Greenplum Summit 2019
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...

What's hot (20)

PDF
GPU databases - How to use them and what the future holds
PDF
Democratizing PySpark for Mobile Game Publishing
PDF
Google Cloud Platform for Data Science teams
PDF
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...
PPTX
Never late again! Job-Level deadline SLOs in YARN
PDF
Present & Future of Greenplum Database A massively parallel Postgres Database...
PDF
Data Science Across Data Sources with Apache Arrow
PDF
Dataflow shuffle service
PDF
End-to-End Data Pipelines with Apache Spark
PDF
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
PDF
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
PDF
Review of Calculation Paradigm and its Components
PDF
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
PDF
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
PDF
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
PDF
GTC Tel Aviv: Accelerate Analytics with a GPU Data Frame
PDF
Bigdata Machine Learning Platform
PDF
Stsg17 speaker yousunjeong
PDF
IEEE International Conference on Data Engineering 2015
PDF
Presto Summit 2018 - 10 - Qubole
GPU databases - How to use them and what the future holds
Democratizing PySpark for Mobile Game Publishing
Google Cloud Platform for Data Science teams
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...
Never late again! Job-Level deadline SLOs in YARN
Present & Future of Greenplum Database A massively parallel Postgres Database...
Data Science Across Data Sources with Apache Arrow
Dataflow shuffle service
End-to-End Data Pipelines with Apache Spark
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Review of Calculation Paradigm and its Components
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
GTC Tel Aviv: Accelerate Analytics with a GPU Data Frame
Bigdata Machine Learning Platform
Stsg17 speaker yousunjeong
IEEE International Conference on Data Engineering 2015
Presto Summit 2018 - 10 - Qubole
Ad

Similar to BlazingSQL & Graphistry - Netflow Demo (20)

PDF
20181116 Massive Log Processing using I/O optimized PostgreSQL
PDF
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
PDF
NVIDIA Rapids presentation
PDF
Rapids: Data Science on GPUs
PDF
RAPIDS Overview
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
PDF
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
PDF
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
PDF
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PDF
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
PDF
20201006_PGconf_Online_Large_Data_Processing
PPTX
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
PDF
pgconfasia2016 plcuda en
PDF
RAPIDS, GPUs & Python - AWS Community Day Melbourne
PDF
20170602_OSSummit_an_intelligent_storage
PDF
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
PDF
RAPIDS cuGraph – Accelerating all your Graph needs
PDF
DSDT Meetup Nov 2017
PDF
Dsdt meetup 2017 11-21
PDF
20190909_PGconf.ASIA_KaiGai
20181116 Massive Log Processing using I/O optimized PostgreSQL
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
NVIDIA Rapids presentation
Rapids: Data Science on GPUs
RAPIDS Overview
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
20201006_PGconf_Online_Large_Data_Processing
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
pgconfasia2016 plcuda en
RAPIDS, GPUs & Python - AWS Community Day Melbourne
20170602_OSSummit_an_intelligent_storage
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
RAPIDS cuGraph – Accelerating all your Graph needs
DSDT Meetup Nov 2017
Dsdt meetup 2017 11-21
20190909_PGconf.ASIA_KaiGai
Ad

Recently uploaded (20)

PDF
Foundation of Data Science unit number two notes
PPTX
1_Introduction to advance data techniques.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPT
ISS -ESG Data flows What is ESG and HowHow
PPT
Quality review (1)_presentation of this 21
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
annual-report-2024-2025 original latest.
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Lecture1 pattern recognition............
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Foundation of Data Science unit number two notes
1_Introduction to advance data techniques.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
ISS -ESG Data flows What is ESG and HowHow
Quality review (1)_presentation of this 21
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
annual-report-2024-2025 original latest.
Database Infoormation System (DBIS).pptx
Introduction to Knowledge Engineering Part 1
Qualitative Qantitative and Mixed Methods.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Lecture1 pattern recognition............
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Data_Analytics_and_PowerBI_Presentation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

BlazingSQL & Graphistry - Netflow Demo

  • 2. @blazingdb@blazingdb CPUs can no longer handle the growing data demands of data science workloads Slow Process Suboptimal Infrastructure Hundreds to tens of thousands of CPU servers are needed in data centers. Preparing data and training models can take days or even weeks.
  • 3. @blazingdb@blazingdb GPUs are well known for accelerating the training of machine learning and deep learning models. Deep Learning (Neural Networks) Machine Learning Performance improvements increase at scale. 40x Improvement over CPU.
  • 4. @blazingdb@blazingdb An end-to-end analytics solution on GPUs. Expertise: · GPU DBMS · GPU Columnar Analytics · Data Lakes Expertise: · CUDA · Machine Learning · Deep Learning Expertise: · Python · Data Science · Machine Learning Expertise: · Graph Analytics · Visualization · Apache Arrow
  • 5. @blazingdb@blazingdb RAPIDS, the end-to-end GPU analytics ecosystem cuDF Data Preparation cuML Machine Learning cuGRAPH Graph Analytics Model TrainingData Preparation Visualization A set of open source libraries for GPU accelerating data preparation and machine learning. In GPU Memory
  • 6. @blazingdb@blazingdb RAPIDS, the end-to-end GPU analytics ecosystem cuDF Data Preparation cuML Machine Learning cuGRAPH Graph Analytics Model TrainingData Preparation Visualization A set of open source libraries for GPU accelerating data preparation and machine learning. In GPU Memory · Cross language · Zero-copy reads · Columnar Apache Arrow
  • 7. @blazingdb@blazingdb RAPIDS, the end-to-end GPU analytics ecosystem cuDF Data Preparation cuML Machine Learning cuGRAPH Graph Analytics Model TrainingData Preparation Visualization A set of open source libraries for GPU accelerating data preparation and machine learning. In GPU Memory · GPU Compute Kernels · Pandas-like API CUDA DataFrame (cuDF) cuDF Data Preparation
  • 8. @blazingdb@blazingdb cuDF Data Preparation BlazingSQL, GPU accelerated SQL engine cuML Machine Learning cuGRAPH Graph Analytics A SQL engine built on RAPIDS. Query enterprise data lakes lightning fast with full interoperability with RAPIDS stack. In GPU Memory from pyblazing import BlazingContext bc = BlazingContext() #Register Filesystem bc.hdfs('data', host='129.13.0.12', port=54310) # Create Table bc.table('performance', file_type='parquet', path='hdfs://data/performance/') #Execute Query result_gdf = bc.sql('SELECT * FROM performance WHERE YEAR(maturity_date)>2005') print(result_gdf)
  • 9. @blazingdb@blazingdb BlazingSQL + Graphistry Netflow Analysis Visually analyze the VAST netflow data set inside Graphistry in order to quickly detect anomalous events. ETL VisualizationNetflow Data 65M Events 2 Weeks 1,440 Devices
  • 12. @blazingdb@blazingdb Upcoming BlazingSQL Releases Use the PyBlazing connection to execute SQL queries on GDFs that are loaded by the cuDF API Integrate FileSystem API, adding the ability to directly query flat files (Apache Parquet & CSV) inside distributed file systems. SQL queries are fanned out across multiple GPUs and servers. String support and string operation support. External Dependency Query GDFs Direct Query Flat Files Distributed Scheduler String Support Physical Plan Optimizer Partition culling for where clauses and joins. VO.1 VO.2 VO.3 VO.4 VO.5
  • 13. @blazingdb@blazingdb Scale Up & Out ~400 GB ~200 GB 5 GB 32 GB ~10 TB Distributed VO.4 Physical Plan VO.5 Scheduler Optimizer GCP 1xT4 GPU DGX-1 8xV100 GPU DGX-2 16xV100 GPU DGX-2 16xV100 GPU GCP 4xT4 GPU
  • 14. @blazingdb@blazingdb Get Started BlazingSQL is quick to get up and running using either DockerHub or Conda: