SlideShare a Scribd company logo
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI
Pipelines: Preprocess,
Visualize, and Build AI Faster
at-Scale on Intel® Architecture
Meena Arunachalam, Intel Corporation
Mike Flaxman, Omnisci
Skip Dupree, Databricks
October 2021
4
Notices and Disclaimers
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly
available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
Other names and brands may be claimed as the property of others.
5
6
Intel Confidential
Data Loading
Data Preprocessing
Feature Engineering
Create
ML & DL
Models
Deploy
7
Optimizing End-to-End AI Pipelines on Intel® Xeon® Scalable Processor
Intel AI software spanning end-to-end pipeline
Large-scale analytics partners OmniSci, Databricks
Entire end-to-end AI performant on Xeon®
8
Engineer Data Create Machine Learning &
Deep Learning Models Deploy
oneDNN oneMKL
oneDAL
Data Analytics at Scale Optimized Frameworks and Middleware
Connect AI to Big Data BigDL
oneCCL
Accelerate End-to-End Data Science and AI AI Analytics Toolkit
Optimize and Deploy Models
Automate
Low-Precision
Optimization
OpenVINO™
Toolkit
Write Once
Deploy
Anywhere
Intel® Neural
Compressor
Automate
Model Tuning
AutoML
SigOpt
oneContainer Cnvrg.io
MLOps Developer Sandbox
DevCloud
Container Repository
w/ Intel Optimizations
9
Partnerships with 100s of Industry Leading ISVs, SIs, OEMs,
and Enterprise End Users
1 OmniSci. “OmniSci and Intel Collaborate to Bring Accelerated Analytics at Scale to CPUs”. https://www.omnisci.com/company/news/omnisci-and-intel-collaborate...
OmniSci analytics platform is
capable of leveraging Intel®
Xeon & Optane Persistent
Memory to achieve interactive
performance at any scale, on
everything from a laptop to a
multi-node cluster
Speed
At Converge, OmniSci’s user
conference, OmniSci
launched a CPU-optimized
version of OmniSci allowing
data scientists to run analytics
in milliseconds on billion+
row datasets, leveraging the
latest Intel hardware
10
OmniSci is collaborating with
Intel to make the OmniSci
platform available on all
modern Intel processor
families as well as continuing
collaboration around Intel
Optane and Intel Xe dGPU
Scale Access
OmniSci and Intel: Better Together
Modern BI
(with integrated in-memory compute)
Analytics Tools Today
Interactivity or Scale: Choose One
Data Scale
Interactivity
+
Agility
Scalable, Interactive Analytics
?
Legacy Analytic Solutions Data Lake & Data Warehouse Platforms
(paired with BI frontend)
Millions of rows Billions of rows
Thousands of rows
Milliseconds
Seconds
Hours
Minutes
1
2
Vertical Integration Yields Unprecedented Interactivity at Scale
OmniSciDB
Scalable Ultra-High Performance SQL + Rendering Engine
Modern Hardware
Massively Parallel and High Bandwidth CPUs and GPUs
SQL, Vega
requests
Compiled
queries, Vulkan
render calls
Apache Arrow SQL
results, rendered
PNGs
Query + Render
Results
OmniSci Immerse
No-Code Interactive Visual Exploration of Massive Datasets
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy
HPC co-design principles for analytics
● Runtime compiler using LLVM infrastructure
for SQL and User-defined Functions
● Columnar data layout and memory
management to optimize for IO patterns found
in Analytics and Machine Learning
● Engineered specifically to exploit
parallelism (vectorization/SIMD/SPMD) for
analytic kernels on CPU/GPU
● Result: Class-leading performance and
efficiency for analytics, making big data truly
interactive
Intel & OmniSci: Better Together
13
21-node Spark
2.4 Cluster
m3.xlarge
OmniSci on
Macbook Pro
2x Core Xeon™ Gold
Workstation
Performance that scales up and down on
Intel Hardware
Machine Spec: 2S Intel Xeon 8276L, 4 TB Optane, 384 GB DDR4-2944 DRAM,Intel 960 SSD
NYC Taxi - https://tech.marksblogg.com/benchmarks.html (1.2 billion record dataset)
Using Intel® Xeon™ Gold
Processors
Up to 15x faster than Spark
1x
15.7x
1.7x
Results may vary​.See www.Intel.com/InnovationEventClaims for workloads and configurations.
Speed at Scale, leveraging modern hardware
OmniSci uses modern high performance computing techniques including JIT compilation of analytical kernels, and
vectorization to achieve near-roofline performance for SQL and analytic kernels.
NYC Taxi –See Source https://tech.marksblogg.com/benchmarks.html for Workloads and confirgurations and
Results may vary
OmniSci performance on Intel Optane Technologies DCPMM -
preliminary benchmarks by Intel show significant scaling efficiency in
AppDirect mode
OmniSci performance on Xeon Gold, and Intel Coffee Lake on Laptops -
up to 15x faster on Xeon Gold Workstation than 21 node Spark 2.4
cluster
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy
Results may vary​.See www.Intel.com/InnovationEventClaims for workloads and configurations.
OmniSci Demo
Powered by
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy
Collaboration Results
● Adopted Modin as our primary data science tooling. Now ships
in our integrated JupyterHub. “We make Pandas fly”!
● We are integrating OneDAL across our platform,
including in our no-code Immerse client (Spring 2022)
● We have optimized our forthcoming OmniSciRF extension using
TBB (in beta with 3 major telcos)
● OmniSci core DB is currently being optimized for Intel Optane
Technologies
Intel and Databricks:
Journey of Collaboration
▪ Apache Spark industry open-source contributions and optimizations
▪ Big Data Analytics and AI developers enabling
▪ Databricks on Intel – better together through engineering collaboration
on optimizing and enabling the latest Intel® Xeon® platform analytics and
AI related technologies
18
Databricks
The data and AI company
5000+
Across the globe
Customers
Lakehouse
One simple platform to unify all of
your data, analytics, and AI workloads
Original Creators
19
Lakehouse
Data
Warehouse
Data
Lake
Lakehouse
One platform to unify all of
your data, analytics, and AI workloads
20
Data Lake for all your data
One platform for every use case
Structured transactional layer
High performance query engine
BI Reports &
Dashboards
Data Science
Workspace
Machine Learning
Lifecycle
Structured, Semi-Structured,
and Unstructured Data
DELTA ENGINE
Databricks Unified Data Analytics Platform
21
Achieve greater
Databricks Runtime
Performance on
Optimized 2nd Gen
Intel® Xeon® Scalable
Processors vs 1st Gen
Intel® Xeon® Scalable
Processors
1.49
1.16 1.2
ProcessingTimeTotalfor3TBDataset Benchmark
(Hours - Lower is Better)
Azure Standard_E16s_V3 with 1st Generation Intel® Xeon® Platinum 8171M processors), 10 instances
Azure Standard_E16s_V4 with 2nd Generation Intel® Xeon® Platinum 8272CL processors), 10 instances
Azure Standard_E8s_V4 with 2nd Generation Intel® Xeon® Platinum 8272CL processors), 20 instances
22%
Faster
25%
Faster
3.29
2.65 2.66
ProcessingTimeTotalfor10TBDataset Benchmark
(Hours - Lower is Better)
Azure Standard_E16s_V3 with 1st Generation Intel® Xeon® Platinum 8171M processors), 10 instances
Azure Standard_E16s_V4 with 2nd Generation Intel® Xeon® Platinum 8272CL processors), 10 instances
Azure Standard_E8s_V4 with 2nd Generation Intel® Xeon® Platinum 8272CL processors), 20 instances
22%
Faster
22%
Faster
Performancevaries by use, configuration and otherfactors. Configurations see appendix [2]
Performancevaries by use, configuration and otherfactors. Configurations see appendix [1]
See www.intel.com/InnovationEventClaims for workloads and configurations. Results may vary
22
1.92x
2.12x
2.24x
1.93x
1.76x 1.84x
Azure Standard_F32s_v2​ w/ 2nd Gen Intel®
Xeon® Platinum 8272CL processors
Azure Standard_F64s_v2​ w/ 2nd Gen Intel®
Xeon® Platinum 8272CL processors
Azure Standard_F72s_v2​ w/ 2nd Gen Intel®
Xeon® Platinum 8272CL processors
ProcessingTimesSpeedupw/ IntelOptimizedTensorFlow/BERT-large
(Higher is better)
Stock TensorFlow Library
Training w/ Intel-Optimized TensorFlow library
Inference w/ Intel-Optimized TensorFlow Library
1x
14.8x
52.4x
11.3x
16.4x
9.7x
1.2x 1.2x
7.4x
108.5x
kmeans ridge_regression linear_regression logistic_regression svm
ProcessingTimesSpeedupw/ IntelOptimizedScikit-learn
(Higher is better)
Stock Scikit-learn on Azure-Standard_F16s_v2 with 2nd Generation Intel® Xeon® Platinum 8272CL Processors
Training w/ Intel-optimized Scikit-learn on Azure-Standard_F16s_v2 with 2nd Generation Intel® Xeon® Platinum 8272CL Processors
Inference w/ Intel-optimized Scikit-learn on Azure-Standard_F16s_v2 with 2nd Generation Intel® Xeon® Platinum 8272CL Processors
Achieve Model
Speedup with Intel-
Optimized AI/ML
Libraries for
Databricks Runtime
for Machine
Learning on 2nd Gen
Intel® Xeon®
Scalable Processors
Performancevaries by use, configuration and otherfactors. Configurations see appendix [4]
Performancevaries by use, configuration and otherfactors. Configurations see appendix [3]
See www.intel.com/InnovationEventClaims for workloads and configurations. Results may vary
23
Ingest ETL Training
Train-Test-Split Inference
Machine Learning [Census]
Education to income level correlation on 50-year US Census dataset.
Load Dataset Tokenize Load Model
Feature Extractor Inference
Deep Learning INFERENCE [DLSA]
Huggingface API (transformer BERT-Large) and SST Stanford Sentiment
Treebank Movie reviews the dataset and classifies positive or negative reviews
Load
Dataset
Tokenize
Load
Model
Feature
Extractor
Training
Deep Learning TRAINING
AI cycles may dominate the pipeline
Results may vary​.See www.Intel.com/InnovationEventClaims for workloads and configurations.
AI Cycles are a Fraction of the End-to-End Pipeline Flow
Census (ML), Document Level Sentiment Analysis (DL Inference), and DL Training Pipelines
24
Intel® Xeon® 8380 CPU
Nvidia A100 GPU
Training Inference
Load Dataset Tokenize Load Model
Feature Extractor Inference
Census
LOWER is better
DLSA SST BS=1 multi-instance. LOWER is better
Intel Xeon 8380 CPU
Nvidia A100 GPU
Document Level Sentiment Analysis (DLSA)
Results may vary​.See www.Intel.com/InnovationEventClaims for workloads and configurations.
Census (ML) and DLSA (NLP) Sentiment Analysis Pipelines
Ingest ETL Train-Test-Split
Is Xeon 8380 really
>5x faster end-to-end?
Is Nvidia A100 really
2x faster end-to-end?
25
Intel® Xeon® 8380 CPU
AMD EPYC 7742
+ Nvidia A100 GPU
Ingest ETL Training
Train-Test-Split Inference
Census
End-to-end time (sec). LOWER is better
Intel Xeon 8380 CPU
AMD EPYC 7742
+ Nvidia A100 GPU
Load Dataset Tokenize Load Model
Feature Extractor Inference
Document Level Sentiment Analysis (DLSA)
DLSA SST BS=1 multi-instance. End-to-end time (sec). LOWER is better
Results may vary​.See www.Intel.com/InnovationEventClaims for workloads and configurations.
Census (ML) and NLP Sentiment Analysis Pipelines
26
Optimized SW Fully Utilizing Modern Parallel HW
Scale
• Improve load
balancing
• Reduce
synchronization
events, all-to-all
comms
Parallelize
• OpenMP, TBB
• Reduce
synchronization
events, serial
code
• Improve load
balancing
Vectorize
• Unit strided
access per SIMD
lane
• High vector
efficiency
• Data alignment
Memory Mgmt
• Blocking
• Data reuse
• Prefetching
• Memory
allocation
Graph Opt
• Op fusion
• Batch
normalization
• Memory
allocation
Optimized to fully utilize
modern parallel HW
O p t i m i z e d
A I / A n a l y t i c s
P a c k a g e s
IntelExtension
forScikit-learn
+ + + +
27
End-to-End ML Optimizations
Readcsv ETL
Training
Train-Test-Split Inference
ML Time
Transparently distributes the data and
computation across available cores, unlike
Pandas which only uses one core at a time.
Single line import
change to run Modin
instead of pandas
Modin can be installed from PyPI:
pip install modin
Intel® Extension for Scikit-learn
Foundational library to speed up your Scikit-learn application, that is
highly optimized with low-level HW feature enabling to cover data
analytics and machine learning.
from sklearn.svm import SVC
X, Y = get_dataset()
clf = SVC().fit(X, y)
res = clf.predict(X)
from sklearnex import patch_sklearn
patch_sklearn()
from sklearn.svm import SVC
X, Y = get_dataset()
clf = SVC().fit(X, y)
res = clf.predict(X)
Scikit-learn with Intel CPU opts
Available through PyPi
pip install scikit-learn-intelex
Scikit-learn mainline
Modin
Intel® optimizations are
now available as part
of mainline XGBoost
repository.
• Read data
• Create dataframe
• Drop columns
• Type convert
• Arithmetic ops
• Create feature set/
test set
• Train test split
• Load numpy array to dmatrix objects
• Model prediction
• Calculate accuracy
28
DL Optimizations: End-to-End DLSA Workload
Multiple Instances
DLSA E2E workload can use multiple instances to fully utilize CPU resources and benefit from NUMA to deliver
up to 1.55X performance benefit compared to single instance.
Intel® oneAPI Math Kernel Library
The fastest and most-used math library for Intel-based systems. DLSA E2E inference pipeline leverages Intel®
oneAPI MKL and Intel® AVX 512 instructions to optimize AI performance on Intel® Xeon® Scalable platforms.
Intel Extensions for PyTorch
Intel® optimizations are now available as part of stock PyTorch using
Intel® Math Kernel Library (MKL) and oneAPI Deep Neural Network
Library (oneDNN).
Preprocessingand Application Deep Learning
Task Classifier
Transformer Inference Results Evaluation
Tokenization Feature Extraction Load Model
Load Dataset
Huggingface APIs
29
▪ Unrestricted scaling up to
max cores in an Intel Xeon
processor socket
▪ Benefits both “real-time”
and “batch” inferencing
▪ E2E ICX perf throughput
can be HIGHER than A100
with multi-instance
streaming
Instance 1 Instance 2 Instance 3 Instance 4
Instance 5
Instance 6 Instance 7
Instance 8
Instance 9
Instance 10
Nvidia A100 is limited
to only 7 MIG
instances per GPU
Maximizing DLSA Performance on 3rd Gen Intel® Xeon® Scalable
Processors using Multiple Instances
Intel Xeon Platinum 8380 Processor with 40 cores per socket
This example is 10 Xeon
instances per socket
30
Results may vary​.See www.Intel.com/InnovationEventClaims for workloads and configurations.
1 2 4 5 8 10 20 40
Inference
Execution
Time (sec)
Lower is better
Number of instances per Xeon socket (40 cores)
Maximizing Performance on DLSA E2E Pipeline using Multiple Instances
Intel Xeon Platinum 8380 Processor with 40 cores per socket
Each AI configuration
(i.e. workload, BS, data type)
can be optimized by varying
Xeon cores/instance
31
The Payoff: Higher Performance/$
1.11
1.36
0
1
Document Level Sentiment
Analysis (DLSA)
DLSA SST BS=1 multi-instance
HIGHER is better
Census
Dataset
Intel® Xeon® Scalable 8380 processor
Nvidia A100 GPU
Results may vary​.See www.Intel.com/InnovationEventClaims for workloads and configurations.
Relative
performance
per system cost
Call To Action
▪ Holistically optimize all phases of your pipeline to maximize performance
▪ Fully utilize Intel HW features (cores, memory, AVX, VNNI) and
SW optimizations
▪ Utilize Intel partners for data visualization and scale out analytics solution
▪ Download and develop with Intel optimizations using Intel AI Analytics
Toolkit intel.com/oneAPI-AIKit and other channels
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci

More Related Content

PPTX
AWS & Intel Webinar Series - Accelerating AI Research
ODP
Passing The Joel Test In The PHP World
PDF
AIDC NY: BODO AI Presentation - 09.19.2019
PPTX
AI for All: Biology is eating the world & AI is eating Biology
PPTX
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
PDF
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
PDF
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
PPTX
N(ot)-o(nly)-(Ha)doop - the DAG showdown
AWS & Intel Webinar Series - Accelerating AI Research
Passing The Joel Test In The PHP World
AIDC NY: BODO AI Presentation - 09.19.2019
AI for All: Biology is eating the world & AI is eating Biology
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
N(ot)-o(nly)-(Ha)doop - the DAG showdown

What's hot (20)

PPTX
Python Data Science and Machine Learning at Scale with Intel and Anaconda
PDF
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
PDF
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
PPTX
oneAPI: Industry Initiative & Intel Product
PDF
AIDC Summit LA- Hands-on Training
PDF
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
PDF
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
PDF
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
PDF
Intel Itanium Hotchips 2011 Overview
PDF
Machine programming
PDF
Intel's Machine Learning Strategy
PPTX
A Primer on FPGAs - Field Programmable Gate Arrays
PPTX
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
PPTX
AIDC Summit LA: LA Drones Solution Overview
PDF
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
PDF
Amd ces tech day 2018 lisa su
PDF
Intel 2020 Labs Day Keynote Slides
PDF
Accelerate Machine Learning Software on Intel Architecture
PDF
Fast Scalable Easy Machine Learning with OpenPOWER, GPUs and Docker
PDF
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
oneAPI: Industry Initiative & Intel Product
AIDC Summit LA- Hands-on Training
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Intel Itanium Hotchips 2011 Overview
Machine programming
Intel's Machine Learning Strategy
A Primer on FPGAs - Field Programmable Gate Arrays
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
AIDC Summit LA: LA Drones Solution Overview
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
Amd ces tech day 2018 lisa su
Intel 2020 Labs Day Keynote Slides
Accelerate Machine Learning Software on Intel Architecture
Fast Scalable Easy Machine Learning with OpenPOWER, GPUs and Docker
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
Ad

Similar to Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci (20)

PDF
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
PDF
Accelerate Your AI Today
PDF
AIDC India - AI on IA
PDF
Pedal to the Metal: Accelerating Spark with Silicon Innovation
PDF
FPGAs and Machine Learning
PDF
Intel xeon-scalable-processors-overview
PDF
Driving Industrial InnovationOn the Path to Exascale
PPTX
Xeon Azure Local Pitch Deck - 25Q1v3.pptx
PDF
Xeon E5 Making the Business Case PowerPoint
PPTX
E5 Intel Xeon Processor E5 Family Making the Business Case
PPTX
Meg Mude, Intel - Data Engineering Lifecycle Optimized on Intel - H2O World S...
PDF
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
PDF
Microsoft Build 2019- Intel AI Workshop
PDF
“Getting Efficient DNN Inference Performance: Is It Really About the TOPS?,” ...
PDF
“Intel Video AI Box—Converging AI, Media and Computing in a Compact and Open ...
PDF
The Intel Xeon Scalable Processor and IoT
PDF
INTEL® XEON® SCALABLE PROCESSORS
PDF
HPC Platform and Memory Technologies
PDF
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
PDF
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase – Big D...
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Accelerate Your AI Today
AIDC India - AI on IA
Pedal to the Metal: Accelerating Spark with Silicon Innovation
FPGAs and Machine Learning
Intel xeon-scalable-processors-overview
Driving Industrial InnovationOn the Path to Exascale
Xeon Azure Local Pitch Deck - 25Q1v3.pptx
Xeon E5 Making the Business Case PowerPoint
E5 Intel Xeon Processor E5 Family Making the Business Case
Meg Mude, Intel - Data Engineering Lifecycle Optimized on Intel - H2O World S...
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
Microsoft Build 2019- Intel AI Workshop
“Getting Efficient DNN Inference Performance: Is It Really About the TOPS?,” ...
“Intel Video AI Box—Converging AI, Media and Computing in a Compact and Open ...
The Intel Xeon Scalable Processor and IoT
INTEL® XEON® SCALABLE PROCESSORS
HPC Platform and Memory Technologies
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase – Big D...
Ad

More from Intel® Software (20)

PDF
AI for good: Scaling AI in science, healthcare, and more.
PPTX
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
PPTX
Intel Developer Program
PDF
Intel AIDC Houston Summit - Overview Slides
PDF
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
PDF
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
PDF
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
PDF
AIDC India - Intel Movidius / Open Vino Slides
PDF
AIDC India - AI Vision Slides
PDF
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
PDF
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
PDF
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
PDF
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
PDF
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
PDF
Bring the Future of Entertainment to Your Living Room: MPEG-I Immersive Video...
PDF
Intel® AI: Parameter Efficient Training
PDF
Intel® AI: Non-Parametric Priors for Generative Adversarial Networks
PDF
Persistent Memory Programming with Pmemkv
PDF
Big Data Uses with Distributed Asynchronous Object Storage
PDF
Debugging Tools & Techniques for Persistent Memory Programming
AI for good: Scaling AI in science, healthcare, and more.
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Intel Developer Program
Intel AIDC Houston Summit - Overview Slides
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - AI Vision Slides
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
Bring the Future of Entertainment to Your Living Room: MPEG-I Immersive Video...
Intel® AI: Parameter Efficient Training
Intel® AI: Non-Parametric Priors for Generative Adversarial Networks
Persistent Memory Programming with Pmemkv
Big Data Uses with Distributed Asynchronous Object Storage
Debugging Tools & Techniques for Persistent Memory Programming

Recently uploaded (20)

PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
L1 - Introduction to python Backend.pptx
PDF
System and Network Administraation Chapter 3
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
System and Network Administration Chapter 2
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Digital Strategies for Manufacturing Companies
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Complete React Javascript Course Syllabus.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Essential Infomation Tech presentation.pptx
Wondershare Filmora 15 Crack With Activation Key [2025
PTS Company Brochure 2025 (1).pdf.......
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Design an Analysis of Algorithms I-SECS-1021-03
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
L1 - Introduction to python Backend.pptx
System and Network Administraation Chapter 3
Adobe Illustrator 28.6 Crack My Vision of Vector Design
System and Network Administration Chapter 2
Online Work Permit System for Fast Permit Processing
Digital Strategies for Manufacturing Companies
2025 Textile ERP Trends: SAP, Odoo & Oracle
Complete React Javascript Course Syllabus.pdf
ai tools demonstartion for schools and inter college
Upgrade and Innovation Strategies for SAP ERP Customers
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Design an Analysis of Algorithms II-SECS-1021-03
How Creative Agencies Leverage Project Management Software.pdf
Essential Infomation Tech presentation.pptx

Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci

  • 2. Streamline End-to-End AI Pipelines: Preprocess, Visualize, and Build AI Faster at-Scale on Intel® Architecture Meena Arunachalam, Intel Corporation Mike Flaxman, Omnisci Skip Dupree, Databricks October 2021
  • 3. 4 Notices and Disclaimers Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
  • 4. 5
  • 5. 6 Intel Confidential Data Loading Data Preprocessing Feature Engineering Create ML & DL Models Deploy
  • 6. 7 Optimizing End-to-End AI Pipelines on Intel® Xeon® Scalable Processor Intel AI software spanning end-to-end pipeline Large-scale analytics partners OmniSci, Databricks Entire end-to-end AI performant on Xeon®
  • 7. 8 Engineer Data Create Machine Learning & Deep Learning Models Deploy oneDNN oneMKL oneDAL Data Analytics at Scale Optimized Frameworks and Middleware Connect AI to Big Data BigDL oneCCL Accelerate End-to-End Data Science and AI AI Analytics Toolkit Optimize and Deploy Models Automate Low-Precision Optimization OpenVINO™ Toolkit Write Once Deploy Anywhere Intel® Neural Compressor Automate Model Tuning AutoML SigOpt oneContainer Cnvrg.io MLOps Developer Sandbox DevCloud Container Repository w/ Intel Optimizations
  • 8. 9 Partnerships with 100s of Industry Leading ISVs, SIs, OEMs, and Enterprise End Users
  • 9. 1 OmniSci. “OmniSci and Intel Collaborate to Bring Accelerated Analytics at Scale to CPUs”. https://www.omnisci.com/company/news/omnisci-and-intel-collaborate... OmniSci analytics platform is capable of leveraging Intel® Xeon & Optane Persistent Memory to achieve interactive performance at any scale, on everything from a laptop to a multi-node cluster Speed At Converge, OmniSci’s user conference, OmniSci launched a CPU-optimized version of OmniSci allowing data scientists to run analytics in milliseconds on billion+ row datasets, leveraging the latest Intel hardware 10 OmniSci is collaborating with Intel to make the OmniSci platform available on all modern Intel processor families as well as continuing collaboration around Intel Optane and Intel Xe dGPU Scale Access OmniSci and Intel: Better Together
  • 10. Modern BI (with integrated in-memory compute) Analytics Tools Today Interactivity or Scale: Choose One Data Scale Interactivity + Agility Scalable, Interactive Analytics ? Legacy Analytic Solutions Data Lake & Data Warehouse Platforms (paired with BI frontend) Millions of rows Billions of rows Thousands of rows Milliseconds Seconds Hours Minutes
  • 11. 1 2 Vertical Integration Yields Unprecedented Interactivity at Scale OmniSciDB Scalable Ultra-High Performance SQL + Rendering Engine Modern Hardware Massively Parallel and High Bandwidth CPUs and GPUs SQL, Vega requests Compiled queries, Vulkan render calls Apache Arrow SQL results, rendered PNGs Query + Render Results OmniSci Immerse No-Code Interactive Visual Exploration of Massive Datasets
  • 12. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy HPC co-design principles for analytics ● Runtime compiler using LLVM infrastructure for SQL and User-defined Functions ● Columnar data layout and memory management to optimize for IO patterns found in Analytics and Machine Learning ● Engineered specifically to exploit parallelism (vectorization/SIMD/SPMD) for analytic kernels on CPU/GPU ● Result: Class-leading performance and efficiency for analytics, making big data truly interactive Intel & OmniSci: Better Together 13 21-node Spark 2.4 Cluster m3.xlarge OmniSci on Macbook Pro 2x Core Xeon™ Gold Workstation Performance that scales up and down on Intel Hardware Machine Spec: 2S Intel Xeon 8276L, 4 TB Optane, 384 GB DDR4-2944 DRAM,Intel 960 SSD NYC Taxi - https://tech.marksblogg.com/benchmarks.html (1.2 billion record dataset) Using Intel® Xeon™ Gold Processors Up to 15x faster than Spark 1x 15.7x 1.7x Results may vary​.See www.Intel.com/InnovationEventClaims for workloads and configurations.
  • 13. Speed at Scale, leveraging modern hardware OmniSci uses modern high performance computing techniques including JIT compilation of analytical kernels, and vectorization to achieve near-roofline performance for SQL and analytic kernels. NYC Taxi –See Source https://tech.marksblogg.com/benchmarks.html for Workloads and confirgurations and Results may vary OmniSci performance on Intel Optane Technologies DCPMM - preliminary benchmarks by Intel show significant scaling efficiency in AppDirect mode OmniSci performance on Xeon Gold, and Intel Coffee Lake on Laptops - up to 15x faster on Xeon Gold Workstation than 21 node Spark 2.4 cluster Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy Results may vary​.See www.Intel.com/InnovationEventClaims for workloads and configurations.
  • 14. OmniSci Demo Powered by Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy
  • 15. Collaboration Results ● Adopted Modin as our primary data science tooling. Now ships in our integrated JupyterHub. “We make Pandas fly”! ● We are integrating OneDAL across our platform, including in our no-code Immerse client (Spring 2022) ● We have optimized our forthcoming OmniSciRF extension using TBB (in beta with 3 major telcos) ● OmniSci core DB is currently being optimized for Intel Optane Technologies
  • 16. Intel and Databricks: Journey of Collaboration ▪ Apache Spark industry open-source contributions and optimizations ▪ Big Data Analytics and AI developers enabling ▪ Databricks on Intel – better together through engineering collaboration on optimizing and enabling the latest Intel® Xeon® platform analytics and AI related technologies
  • 17. 18 Databricks The data and AI company 5000+ Across the globe Customers Lakehouse One simple platform to unify all of your data, analytics, and AI workloads Original Creators
  • 18. 19 Lakehouse Data Warehouse Data Lake Lakehouse One platform to unify all of your data, analytics, and AI workloads
  • 19. 20 Data Lake for all your data One platform for every use case Structured transactional layer High performance query engine BI Reports & Dashboards Data Science Workspace Machine Learning Lifecycle Structured, Semi-Structured, and Unstructured Data DELTA ENGINE Databricks Unified Data Analytics Platform
  • 20. 21 Achieve greater Databricks Runtime Performance on Optimized 2nd Gen Intel® Xeon® Scalable Processors vs 1st Gen Intel® Xeon® Scalable Processors 1.49 1.16 1.2 ProcessingTimeTotalfor3TBDataset Benchmark (Hours - Lower is Better) Azure Standard_E16s_V3 with 1st Generation Intel® Xeon® Platinum 8171M processors), 10 instances Azure Standard_E16s_V4 with 2nd Generation Intel® Xeon® Platinum 8272CL processors), 10 instances Azure Standard_E8s_V4 with 2nd Generation Intel® Xeon® Platinum 8272CL processors), 20 instances 22% Faster 25% Faster 3.29 2.65 2.66 ProcessingTimeTotalfor10TBDataset Benchmark (Hours - Lower is Better) Azure Standard_E16s_V3 with 1st Generation Intel® Xeon® Platinum 8171M processors), 10 instances Azure Standard_E16s_V4 with 2nd Generation Intel® Xeon® Platinum 8272CL processors), 10 instances Azure Standard_E8s_V4 with 2nd Generation Intel® Xeon® Platinum 8272CL processors), 20 instances 22% Faster 22% Faster Performancevaries by use, configuration and otherfactors. Configurations see appendix [2] Performancevaries by use, configuration and otherfactors. Configurations see appendix [1] See www.intel.com/InnovationEventClaims for workloads and configurations. Results may vary
  • 21. 22 1.92x 2.12x 2.24x 1.93x 1.76x 1.84x Azure Standard_F32s_v2​ w/ 2nd Gen Intel® Xeon® Platinum 8272CL processors Azure Standard_F64s_v2​ w/ 2nd Gen Intel® Xeon® Platinum 8272CL processors Azure Standard_F72s_v2​ w/ 2nd Gen Intel® Xeon® Platinum 8272CL processors ProcessingTimesSpeedupw/ IntelOptimizedTensorFlow/BERT-large (Higher is better) Stock TensorFlow Library Training w/ Intel-Optimized TensorFlow library Inference w/ Intel-Optimized TensorFlow Library 1x 14.8x 52.4x 11.3x 16.4x 9.7x 1.2x 1.2x 7.4x 108.5x kmeans ridge_regression linear_regression logistic_regression svm ProcessingTimesSpeedupw/ IntelOptimizedScikit-learn (Higher is better) Stock Scikit-learn on Azure-Standard_F16s_v2 with 2nd Generation Intel® Xeon® Platinum 8272CL Processors Training w/ Intel-optimized Scikit-learn on Azure-Standard_F16s_v2 with 2nd Generation Intel® Xeon® Platinum 8272CL Processors Inference w/ Intel-optimized Scikit-learn on Azure-Standard_F16s_v2 with 2nd Generation Intel® Xeon® Platinum 8272CL Processors Achieve Model Speedup with Intel- Optimized AI/ML Libraries for Databricks Runtime for Machine Learning on 2nd Gen Intel® Xeon® Scalable Processors Performancevaries by use, configuration and otherfactors. Configurations see appendix [4] Performancevaries by use, configuration and otherfactors. Configurations see appendix [3] See www.intel.com/InnovationEventClaims for workloads and configurations. Results may vary
  • 22. 23 Ingest ETL Training Train-Test-Split Inference Machine Learning [Census] Education to income level correlation on 50-year US Census dataset. Load Dataset Tokenize Load Model Feature Extractor Inference Deep Learning INFERENCE [DLSA] Huggingface API (transformer BERT-Large) and SST Stanford Sentiment Treebank Movie reviews the dataset and classifies positive or negative reviews Load Dataset Tokenize Load Model Feature Extractor Training Deep Learning TRAINING AI cycles may dominate the pipeline Results may vary​.See www.Intel.com/InnovationEventClaims for workloads and configurations. AI Cycles are a Fraction of the End-to-End Pipeline Flow Census (ML), Document Level Sentiment Analysis (DL Inference), and DL Training Pipelines
  • 23. 24 Intel® Xeon® 8380 CPU Nvidia A100 GPU Training Inference Load Dataset Tokenize Load Model Feature Extractor Inference Census LOWER is better DLSA SST BS=1 multi-instance. LOWER is better Intel Xeon 8380 CPU Nvidia A100 GPU Document Level Sentiment Analysis (DLSA) Results may vary​.See www.Intel.com/InnovationEventClaims for workloads and configurations. Census (ML) and DLSA (NLP) Sentiment Analysis Pipelines Ingest ETL Train-Test-Split Is Xeon 8380 really >5x faster end-to-end? Is Nvidia A100 really 2x faster end-to-end?
  • 24. 25 Intel® Xeon® 8380 CPU AMD EPYC 7742 + Nvidia A100 GPU Ingest ETL Training Train-Test-Split Inference Census End-to-end time (sec). LOWER is better Intel Xeon 8380 CPU AMD EPYC 7742 + Nvidia A100 GPU Load Dataset Tokenize Load Model Feature Extractor Inference Document Level Sentiment Analysis (DLSA) DLSA SST BS=1 multi-instance. End-to-end time (sec). LOWER is better Results may vary​.See www.Intel.com/InnovationEventClaims for workloads and configurations. Census (ML) and NLP Sentiment Analysis Pipelines
  • 25. 26 Optimized SW Fully Utilizing Modern Parallel HW Scale • Improve load balancing • Reduce synchronization events, all-to-all comms Parallelize • OpenMP, TBB • Reduce synchronization events, serial code • Improve load balancing Vectorize • Unit strided access per SIMD lane • High vector efficiency • Data alignment Memory Mgmt • Blocking • Data reuse • Prefetching • Memory allocation Graph Opt • Op fusion • Batch normalization • Memory allocation Optimized to fully utilize modern parallel HW O p t i m i z e d A I / A n a l y t i c s P a c k a g e s IntelExtension forScikit-learn + + + +
  • 26. 27 End-to-End ML Optimizations Readcsv ETL Training Train-Test-Split Inference ML Time Transparently distributes the data and computation across available cores, unlike Pandas which only uses one core at a time. Single line import change to run Modin instead of pandas Modin can be installed from PyPI: pip install modin Intel® Extension for Scikit-learn Foundational library to speed up your Scikit-learn application, that is highly optimized with low-level HW feature enabling to cover data analytics and machine learning. from sklearn.svm import SVC X, Y = get_dataset() clf = SVC().fit(X, y) res = clf.predict(X) from sklearnex import patch_sklearn patch_sklearn() from sklearn.svm import SVC X, Y = get_dataset() clf = SVC().fit(X, y) res = clf.predict(X) Scikit-learn with Intel CPU opts Available through PyPi pip install scikit-learn-intelex Scikit-learn mainline Modin Intel® optimizations are now available as part of mainline XGBoost repository. • Read data • Create dataframe • Drop columns • Type convert • Arithmetic ops • Create feature set/ test set • Train test split • Load numpy array to dmatrix objects • Model prediction • Calculate accuracy
  • 27. 28 DL Optimizations: End-to-End DLSA Workload Multiple Instances DLSA E2E workload can use multiple instances to fully utilize CPU resources and benefit from NUMA to deliver up to 1.55X performance benefit compared to single instance. Intel® oneAPI Math Kernel Library The fastest and most-used math library for Intel-based systems. DLSA E2E inference pipeline leverages Intel® oneAPI MKL and Intel® AVX 512 instructions to optimize AI performance on Intel® Xeon® Scalable platforms. Intel Extensions for PyTorch Intel® optimizations are now available as part of stock PyTorch using Intel® Math Kernel Library (MKL) and oneAPI Deep Neural Network Library (oneDNN). Preprocessingand Application Deep Learning Task Classifier Transformer Inference Results Evaluation Tokenization Feature Extraction Load Model Load Dataset Huggingface APIs
  • 28. 29 ▪ Unrestricted scaling up to max cores in an Intel Xeon processor socket ▪ Benefits both “real-time” and “batch” inferencing ▪ E2E ICX perf throughput can be HIGHER than A100 with multi-instance streaming Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 Instance 6 Instance 7 Instance 8 Instance 9 Instance 10 Nvidia A100 is limited to only 7 MIG instances per GPU Maximizing DLSA Performance on 3rd Gen Intel® Xeon® Scalable Processors using Multiple Instances Intel Xeon Platinum 8380 Processor with 40 cores per socket This example is 10 Xeon instances per socket
  • 29. 30 Results may vary​.See www.Intel.com/InnovationEventClaims for workloads and configurations. 1 2 4 5 8 10 20 40 Inference Execution Time (sec) Lower is better Number of instances per Xeon socket (40 cores) Maximizing Performance on DLSA E2E Pipeline using Multiple Instances Intel Xeon Platinum 8380 Processor with 40 cores per socket Each AI configuration (i.e. workload, BS, data type) can be optimized by varying Xeon cores/instance
  • 30. 31 The Payoff: Higher Performance/$ 1.11 1.36 0 1 Document Level Sentiment Analysis (DLSA) DLSA SST BS=1 multi-instance HIGHER is better Census Dataset Intel® Xeon® Scalable 8380 processor Nvidia A100 GPU Results may vary​.See www.Intel.com/InnovationEventClaims for workloads and configurations. Relative performance per system cost
  • 31. Call To Action ▪ Holistically optimize all phases of your pipeline to maximize performance ▪ Fully utilize Intel HW features (cores, memory, AVX, VNNI) and SW optimizations ▪ Utilize Intel partners for data visualization and scale out analytics solution ▪ Download and develop with Intel optimizations using Intel AI Analytics Toolkit intel.com/oneAPI-AIKit and other channels