Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci

Streamline End-to-End AI
Pipelines: Preprocess,
Visualize, and Build AI Faster
at-Scale on Intel® Architecture
Meena Arunachalam, Intel Corporation
Mike Flaxman, Omnisci
Skip Dupree, Databricks
October 2021

4
Notices and Disclaimers
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly
available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
Other names and brands may be claimed as the property of others.

6
Intel Confidential
Data Loading
Data Preprocessing
Feature Engineering
Create
ML & DL
Models
Deploy

7
Optimizing End-to-End AI Pipelines on Intel® Xeon® Scalable Processor
Intel AI software spanning end-to-end pipeline
Large-scale analytics partners OmniSci, Databricks
Entire end-to-end AI performant on Xeon®

8
Engineer Data Create Machine Learning &
Deep Learning Models Deploy
oneDNN oneMKL
oneDAL
Data Analytics at Scale Optimized Frameworks and Middleware
Connect AI to Big Data BigDL
oneCCL
Accelerate End-to-End Data Science and AI AI Analytics Toolkit
Optimize and Deploy Models
Automate
Low-Precision
Optimization
OpenVINO™
Toolkit
Write Once
Deploy
Anywhere
Intel® Neural
Compressor
Automate
Model Tuning
AutoML
SigOpt
oneContainer Cnvrg.io
MLOps Developer Sandbox
DevCloud
Container Repository
w/ Intel Optimizations

9
Partnerships with 100s of Industry Leading ISVs, SIs, OEMs,
and Enterprise End Users

1 OmniSci. “OmniSci and Intel Collaborate to Bring Accelerated Analytics at Scale to CPUs”. https://www.omnisci.com/company/news/omnisci-and-intel-collaborate...
OmniSci analytics platform is
capable of leveraging Intel®
Xeon & Optane Persistent
Memory to achieve interactive
performance at any scale, on
everything from a laptop to a
multi-node cluster
Speed
At Converge, OmniSci’s user
conference, OmniSci
launched a CPU-optimized
version of OmniSci allowing
data scientists to run analytics
in milliseconds on billion+
row datasets, leveraging the
latest Intel hardware
10
OmniSci is collaborating with
Intel to make the OmniSci
platform available on all
modern Intel processor
families as well as continuing
collaboration around Intel
Optane and Intel Xe dGPU
Scale Access
OmniSci and Intel: Better Together

Modern BI
(with integrated in-memory compute)
Analytics Tools Today
Interactivity or Scale: Choose One
Data Scale
Interactivity
+
Agility
Scalable, Interactive Analytics
?
Legacy Analytic Solutions Data Lake & Data Warehouse Platforms
(paired with BI frontend)
Millions of rows Billions of rows
Thousands of rows
Milliseconds
Seconds
Hours
Minutes

1
2
Vertical Integration Yields Unprecedented Interactivity at Scale
OmniSciDB
Scalable Ultra-High Performance SQL + Rendering Engine
Modern Hardware
Massively Parallel and High Bandwidth CPUs and GPUs
SQL, Vega
requests
Compiled
queries, Vulkan
render calls
Apache Arrow SQL
results, rendered
PNGs
Query + Render
Results
OmniSci Immerse
No-Code Interactive Visual Exploration of Massive Datasets

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy
HPC co-design principles for analytics
● Runtime compiler using LLVM infrastructure
for SQL and User-defined Functions
● Columnar data layout and memory
management to optimize for IO patterns found
in Analytics and Machine Learning
● Engineered specifically to exploit
parallelism (vectorization/SIMD/SPMD) for
analytic kernels on CPU/GPU
● Result: Class-leading performance and
efficiency for analytics, making big data truly
interactive
Intel & OmniSci: Better Together
13
21-node Spark
2.4 Cluster
m3.xlarge
OmniSci on
Macbook Pro
2x Core Xeon™ Gold
Workstation
Performance that scales up and down on
Intel Hardware
Machine Spec: 2S Intel Xeon 8276L, 4 TB Optane, 384 GB DDR4-2944 DRAM,Intel 960 SSD
NYC Taxi - https://tech.marksblogg.com/benchmarks.html (1.2 billion record dataset)
Using Intel® Xeon™ Gold
Processors
Up to 15x faster than Spark
1x
15.7x
1.7x
Results may vary.See www.Intel.com/InnovationEventClaims for workloads and configurations.

Speed at Scale, leveraging modern hardware
OmniSci uses modern high performance computing techniques including JIT compilation of analytical kernels, and
vectorization to achieve near-roofline performance for SQL and analytic kernels.
NYC Taxi –See Source https://tech.marksblogg.com/benchmarks.html for Workloads and confirgurations and
Results may vary
OmniSci performance on Intel Optane Technologies DCPMM -
preliminary benchmarks by Intel show significant scaling efficiency in
AppDirect mode
OmniSci performance on Xeon Gold, and Intel Coffee Lake on Laptops -
up to 15x faster on Xeon Gold Workstation than 21 node Spark 2.4
cluster

OmniSci Demo
Powered by

Collaboration Results
● Adopted Modin as our primary data science tooling. Now ships
in our integrated JupyterHub. “We make Pandas fly”!
● We are integrating OneDAL across our platform,
including in our no-code Immerse client (Spring 2022)
● We have optimized our forthcoming OmniSciRF extension using
TBB (in beta with 3 major telcos)
● OmniSci core DB is currently being optimized for Intel Optane
Technologies

Intel and Databricks:
Journey of Collaboration
▪ Apache Spark industry open-source contributions and optimizations
▪ Big Data Analytics and AI developers enabling
▪ Databricks on Intel – better together through engineering collaboration
on optimizing and enabling the latest Intel® Xeon® platform analytics and
AI related technologies

18
Databricks
The data and AI company
5000+
Across the globe
Customers
Lakehouse
One simple platform to unify all of
your data, analytics, and AI workloads
Original Creators

19
Lakehouse
Data
Warehouse
Data
Lake
Lakehouse
One platform to unify all of
your data, analytics, and AI workloads

20
Data Lake for all your data
One platform for every use case
Structured transactional layer
High performance query engine
BI Reports &
Dashboards
Data Science
Workspace
Machine Learning
Lifecycle
Structured, Semi-Structured,
and Unstructured Data
DELTA ENGINE
Databricks Unified Data Analytics Platform

21
Achieve greater
Databricks Runtime
Performance on
Optimized 2nd Gen
Intel® Xeon® Scalable
Processors vs 1st Gen
Intel® Xeon® Scalable
Processors
1.49
1.16 1.2
ProcessingTimeTotalfor3TBDataset Benchmark
(Hours - Lower is Better)
Azure Standard_E16s_V3 with 1st Generation Intel® Xeon® Platinum 8171M processors), 10 instances
Azure Standard_E16s_V4 with 2nd Generation Intel® Xeon® Platinum 8272CL processors), 10 instances
22%
Faster
25%
Faster
3.29
2.65 2.66
ProcessingTimeTotalfor10TBDataset Benchmark
(Hours - Lower is Better)
Azure Standard_E16s_V3 with 1st Generation Intel® Xeon® Platinum 8171M processors), 10 instances
22%
Faster
22%
Faster
Performancevaries by use, configuration and otherfactors. Configurations see appendix [2]
See www.intel.com/InnovationEventClaims for workloads and configurations. Results may vary

22
1.92x
2.12x
2.24x
1.93x
1.76x 1.84x
Azure Standard_F32s_v2 w/ 2nd Gen Intel®
Xeon® Platinum 8272CL processors
ProcessingTimesSpeedupw/ IntelOptimizedTensorFlow/BERT-large
(Higher is better)
Stock TensorFlow Library
Training w/ Intel-Optimized TensorFlow library
Inference w/ Intel-Optimized TensorFlow Library
1x
14.8x
52.4x
11.3x
16.4x
9.7x
1.2x 1.2x
7.4x
108.5x
kmeans ridge_regression linear_regression logistic_regression svm
ProcessingTimesSpeedupw/ IntelOptimizedScikit-learn
(Higher is better)
Stock Scikit-learn on Azure-Standard_F16s_v2 with 2nd Generation Intel® Xeon® Platinum 8272CL Processors
Training w/ Intel-optimized Scikit-learn on Azure-Standard_F16s_v2 with 2nd Generation Intel® Xeon® Platinum 8272CL Processors
Inference w/ Intel-optimized Scikit-learn on Azure-Standard_F16s_v2 with 2nd Generation Intel® Xeon® Platinum 8272CL Processors
Achieve Model
Speedup with Intel-
Optimized AI/ML
Libraries for
Databricks Runtime
for Machine
Learning on 2nd Gen
Intel® Xeon®
Scalable Processors
See www.intel.com/InnovationEventClaims for workloads and configurations. Results may vary

23
Ingest ETL Training
Train-Test-Split Inference
Machine Learning [Census]
Education to income level correlation on 50-year US Census dataset.
Load Dataset Tokenize Load Model
Feature Extractor Inference
Deep Learning INFERENCE [DLSA]
Huggingface API (transformer BERT-Large) and SST Stanford Sentiment
Treebank Movie reviews the dataset and classifies positive or negative reviews
Load
Dataset
Tokenize
Load
Model
Feature
Extractor
Training
Deep Learning TRAINING
AI cycles may dominate the pipeline
AI Cycles are a Fraction of the End-to-End Pipeline Flow
Census (ML), Document Level Sentiment Analysis (DL Inference), and DL Training Pipelines

24
Intel® Xeon® 8380 CPU
Nvidia A100 GPU
Training Inference
Census
LOWER is better
DLSA SST BS=1 multi-instance. LOWER is better
Intel Xeon 8380 CPU
Nvidia A100 GPU
Document Level Sentiment Analysis (DLSA)
Census (ML) and DLSA (NLP) Sentiment Analysis Pipelines
Ingest ETL Train-Test-Split
Is Xeon 8380 really
>5x faster end-to-end?
Is Nvidia A100 really
2x faster end-to-end?

25
Intel® Xeon® 8380 CPU
AMD EPYC 7742
+ Nvidia A100 GPU
Ingest ETL Training
Census
End-to-end time (sec). LOWER is better
Intel Xeon 8380 CPU
AMD EPYC 7742
+ Nvidia A100 GPU
Document Level Sentiment Analysis (DLSA)
DLSA SST BS=1 multi-instance. End-to-end time (sec). LOWER is better
Census (ML) and NLP Sentiment Analysis Pipelines

26
Optimized SW Fully Utilizing Modern Parallel HW
Scale
• Improve load
balancing
• Reduce
synchronization
events, all-to-all
comms
Parallelize
• OpenMP, TBB
• Reduce
synchronization
events, serial
code
• Improve load
balancing
Vectorize
• Unit strided
access per SIMD
lane
• High vector
efficiency
• Data alignment
Memory Mgmt
• Blocking
• Data reuse
• Prefetching
• Memory
allocation
Graph Opt
• Op fusion
• Batch
normalization
• Memory
allocation
Optimized to fully utilize
modern parallel HW
O p t i m i z e d
A I / A n a l y t i c s
P a c k a g e s
IntelExtension
forScikit-learn
+ + + +

27
End-to-End ML Optimizations
Readcsv ETL
Training
ML Time
Transparently distributes the data and
computation across available cores, unlike
Pandas which only uses one core at a time.
Single line import
change to run Modin
instead of pandas
Modin can be installed from PyPI:
pip install modin
Intel® Extension for Scikit-learn
Foundational library to speed up your Scikit-learn application, that is
highly optimized with low-level HW feature enabling to cover data
analytics and machine learning.
from sklearn.svm import SVC
X, Y = get_dataset()
clf = SVC().fit(X, y)
res = clf.predict(X)
from sklearnex import patch_sklearn
patch_sklearn()
from sklearn.svm import SVC
X, Y = get_dataset()
clf = SVC().fit(X, y)
res = clf.predict(X)
Scikit-learn with Intel CPU opts
Available through PyPi
pip install scikit-learn-intelex
Scikit-learn mainline
Modin
Intel® optimizations are
now available as part
of mainline XGBoost
repository.
• Read data
• Create dataframe
• Drop columns
• Type convert
• Arithmetic ops
• Create feature set/
test set
• Train test split
• Load numpy array to dmatrix objects
• Model prediction
• Calculate accuracy

28
DL Optimizations: End-to-End DLSA Workload
Multiple Instances
DLSA E2E workload can use multiple instances to fully utilize CPU resources and benefit from NUMA to deliver
up to 1.55X performance benefit compared to single instance.
Intel® oneAPI Math Kernel Library
The fastest and most-used math library for Intel-based systems. DLSA E2E inference pipeline leverages Intel®
oneAPI MKL and Intel® AVX 512 instructions to optimize AI performance on Intel® Xeon® Scalable platforms.
Intel Extensions for PyTorch
Intel® optimizations are now available as part of stock PyTorch using
Intel® Math Kernel Library (MKL) and oneAPI Deep Neural Network
Library (oneDNN).
Preprocessingand Application Deep Learning
Task Classifier
Transformer Inference Results Evaluation
Tokenization Feature Extraction Load Model
Load Dataset
Huggingface APIs

29
▪ Unrestricted scaling up to
max cores in an Intel Xeon
processor socket
▪ Benefits both “real-time”
and “batch” inferencing
▪ E2E ICX perf throughput
can be HIGHER than A100
with multi-instance
streaming
Instance 1 Instance 2 Instance 3 Instance 4
Instance 5
Instance 6 Instance 7
Instance 8
Instance 9
Instance 10
Nvidia A100 is limited
to only 7 MIG
instances per GPU
Maximizing DLSA Performance on 3rd Gen Intel® Xeon® Scalable
Processors using Multiple Instances
Intel Xeon Platinum 8380 Processor with 40 cores per socket
This example is 10 Xeon
instances per socket

30
1 2 4 5 8 10 20 40
Inference
Execution
Time (sec)
Lower is better
Number of instances per Xeon socket (40 cores)
Maximizing Performance on DLSA E2E Pipeline using Multiple Instances
Intel Xeon Platinum 8380 Processor with 40 cores per socket
Each AI configuration
(i.e. workload, BS, data type)
can be optimized by varying
Xeon cores/instance

31
The Payoff: Higher Performance/$
1.11
1.36
0
1
Document Level Sentiment
Analysis (DLSA)
DLSA SST BS=1 multi-instance
HIGHER is better
Census
Dataset
Intel® Xeon® Scalable 8380 processor
Nvidia A100 GPU
Relative
performance
per system cost

Call To Action
▪ Holistically optimize all phases of your pipeline to maximize performance
▪ Fully utilize Intel HW features (cores, memory, AVX, VNNI) and
SW optimizations
▪ Utilize Intel partners for data visualization and scale out analytics solution
▪ Download and develop with Intel optimizations using Intel AI Analytics
Toolkit intel.com/oneAPI-AIKit and other channels

Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci

More Related Content

What's hot (20)

Similar to Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci (20)

More from Intel® Software (20)

Recently uploaded (20)

Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci