SlideShare a Scribd company logo
Zhang Zhang, Victoriya Fedotova
Intel Corporation
November 2016
2
Agenda
Introduction
– A quick intro to Intel® Data Analytics Acceleration Library and Intel®
Distribution for Python
– A brief overview of basic machine learning concepts
Lab activities
– Warm-up exercises: Learn the gist of PyDAAL API
– Linear regression
– Classification with SVM
– K-Means clustering
– PCA
Conclusions
Get Your Hands Dirty with Intel® Distribution for Python*
Modelling
Data Analytics Flow Example
Spam Filter
not spam
not spam
spam
Pre-
process
Collect Store Load
Train &
Validate
Deploy Make Decision
Computational Aspects of Big Data
• Distributed across
different nodes/devices
• Huge data size not fitting
into node/device memory
Volume
• Non-homogeneous data
• Sparse/Missing/Noisy
data
Variety
• Data coming in timeVelocity
Converts, Indexing, Repacking Data Recovery
Distributed Computing Online Computing
D1
DK
P1
RK
R
...
Di Pi+1
Pi
Time
Memory
capacity
Attributes
OutlierNumeric Categorical Missing
Recover
Dense
Algorithm
Sparse
Algorithm
Counter
Intel® Data Analytics Acceleration Library
(Intel® DAAL)
• Targets both data centers (Intel® Xeon® and Intel® Xeon Phi™) and edge-devices (Intel® Atom)
• Perform analysis close to data source (sensor/client/server) to optimize response latency, decrease
network bandwidth utilization, and maximize security
• Offload data to server/cluster for complex and large-scale analytics
(De-)Compression
(De-)Serialization
PCA
Statistical moments
Quantiles
Variance matrix
QR, SVD, Cholesky
Apriori
Outlier detection
Regression
• Linear
• Ridge
Classification
• Naïve Bayes
• SVM
• Classifier boosting
• kNN
Clustering
• Kmeans
• EM GMM
Collaborative filtering
• ALS
Neural Networks
Pre-processing Transformation Analysis Modeling Decision Making
Scientific/Engineering
Web/Social
Business
Validation
Intel® DAAL Main Features
Building end-to-end data applications
Optimized for Intel architectures, from Intel® Atom™, Intel®
Core™, Intel® Xeon®, to Intel® Xeon Phi™
A rich set of widely applicable algorithms for data mining and
machine learning
Batch, online, and distributed processing
Data connectors to a variety of data sources and formats:
KDB*, MySQL*, HDFS, CSV, and user-defined sources/formats
C++, Java, and Python APIs
*Other names and brands may be claimed as the property of others
http://www.rarewallpapers.com/animals/blue-snake-2029/
Python Landscape
Challenge#1:
Domain specialists are not professional
software programmers.
Adoption of Python
continues to grow among
domain specialists and
developers for its
productivity benefits
Challenge#2:
Python performance limits migration
to production systems
Intel’s solution is to…
 Accelerate Python performance
 Enable easy access
 Empower the community
1
Highlights: Intel® Distribution for Python* 2017
Focus on advancing Python performance closer to native speeds
• Prebuilt, accelerated Distribution for numerical & scientific
computing, data analytics, HPC. Optimized for IA
• Drop in replacement for your existing Python. No code changes
required
Easy, out-of-the-box
access to high
performance Python
• Accelerated NumPy/SciPy/scikit-learn with Intel® Math Kernel
Library
• Data analytics with pyDAAL, Enhanced thread scheduling with
TBB, Jupyter* notebook interface, Numba, Cython
• Scale easily with optimized mpi4py and Jupyter notebooks
Drive performance with
multiple optimization
techniques
• Distribution and individual optimized packages available through
conda and Anaconda Cloud
• Optimizations upstreamed back to main Python trunk
Faster access to latest
optimizations for Intel
architecture
Performance Gain from MKL (Compare to “vanilla” SciPy)
Configuration info: - Versions: Intel® Distribution for Python 2017 Beta, icc 15.0; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF), 64 GB
of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS.
Linear Algebra
• BLAS
• LAPACK
• ScaLAPACK
• Sparse BLAS
• Sparse Solvers
Fast Fourier Transforms
• Multidimensional
• FFTW interfaces
• Cluster FFT
Vector Math
• Trigonometric
• Hyperbolic
• Exponential
• Log
• Power, Root
Vector RNGs
• Multiple BRNG
• Support methods
for independent
streams creation
• Support all key probability
distributions
Summary Statistics
• Kurtosis
• Variation coefficient
• Order statistics
• Min/max
• Variance-covariance
And More
• Splines
• Interpolation
• Trust Region
• Fast Poisson Solver
Up to
100x
faster
Up to
10x
faster!
Up to
10x
faster!
Up to
60x
faster!
PyDAAL (Python API for Intel® DAAL)
Turbocharged machine learning tool for Python developers
Interoperability and composability with the SciPy ecosystem:
– Work directly with NumPy ndarrays
– Faster than scikit-learn
We’ll see how to use it in this lab
Get Your Hands Dirty with Intel® Distribution for Python*
Problems
– A company wants to define the impact of
the pricing changes on the number of
product sales
– A biologist wants to define the
relationships between body size, shape,
anatomy and behavior of the organism
Solution: Linear Regression
– A linear model for relationship between
features and the response
Regression
14
Source: Gareth James, Daniela Witten, Trevor Hastie, Robert
Tibshirani. (2014). An Introduction to Statistical Learning. Springer
Problems
– An emailing service provider wants to build a
spam filter for the customers
– A postal service wants to implement
handwritten address interpretation
Solution: Support Vector Machine (SVM)
– Works well for non-linear decision boundary
– Two kernel functions are provided:
– Linear kernel
– Gaussian kernel (RBF)
– Multi-class classifier
– One-vs-One
Classification
Source: Gareth James, Daniela Witten, Trevor Hastie, Robert
Tibshirani. (2014). An Introduction to Statistical Learning. Springer
Problems
– A news provider wants to group the news
with similar headlines in the same section
– Humans with similar genetic pattern are
grouped together to identify correlation
with a specific disease
Solution: K-Means
– Pick k centroids
– Repeat until converge:
– Assign data points to the closest centroid
– Re-calculate centroids as the mean of all points in
the current cluster
– Re-assign data points to the closest centroid
Cluster Analysis
Problems
– Data scientist wants to visualize a multi-
dimensional data set
– A classifier built on the whole data set tends
to overfit
Solution: Principal Component Analysis
– Compute eigen decomposition on the
correlation matrix
– Apply the largest eigenvectors to compute
the largest principal components that can
explain most of variance in original data
Dimensionality Reduction
18
Setup
 Unpack the archive to the local disk
 Run setup script:
– Linux, OS X: ./setup.sh
– Windows: setup.bat
 Set path to conda:
– Linux, OS X: export PATH=<path_to_idp>/bin:$PATH
– Windows: set PATH=<path_to_idp>Scripts;%PATH%
Lab 1: Warm-up Exercise
Learning objectives:
 Understand NumericTable - The main data structure of DAAL
– Create NumericTable from data sources
– Interoperability with NumPy, Pandas, scikit-learn
– Get NumPy ndarray from NumericTable
 Understand code sequence of using DAAL API
– Create an algorithm object
– Pass in input data
– Set algorithm specific parameters
– Compute
– Get results
Lab 2: Linear Regression
Learning objectives:
 Understand the 2 regression algorithms currently available in DAAL
– Linear regression without regularization
– Ridge regression
 Learn supervised learning workflow
– Train a model using known data
– Test the model by making predictions on new data
 Visualize prediction results
Lab 3: Classification with SVM
Learning objectives:
 Understand SVM algorithm usage model
– Multi-class classification with SVM
– Two-class classification with SVM
 Understand quality metrics in classification
– Confusion matrix
– Metrics computed using the confusion matrix (accuracy, etc.)
Lab 4: Clustering with K-Means
Learning objectives:
 Understand the K-Means algorithm supported in DAAL
 Learn basic clustering workflow
– Initialize cluster centroids
– Minimize the goal function
 Visualize clusters
Lab 5: Principal Component Analysis
Learning objectives:
 Understand PCA algorithms support in DAAL:
– Correlation matrix method
– SVD method
 Evaluate and visualize principal components
References
Intel DAAL User’s Guide and Reference Manual
– https://software.intel.com/sites/products/documentation/doclib/daal/daa
l-user-and-reference-guides/index.htm
Intel Distribution for Python Documentation
– https://software.intel.com/en-us/intel-distribution-for-python-
support/documentation
What’s Next - Takeaways
Learn more about Intel® DAAL
– It supports C++ and Java, too!
– We want you to use DAAL in your data projects
Learn more about Intel® Distribution for Python
– Beyond machine learning, many more benefits
Keep an eye on the tutorial repository
– https://github.com/daaltces/pydaal-tutorials
– I’m adding more labs, samples, etc.
Zhang Zhang (zhang.zhang@intel.com)
Victoriya Fedotova (victoriya.s.fedotova@intel.com)
www.intel.com/hpcdevcon
Get Your Hands Dirty with Intel® Distribution for Python*

More Related Content

PDF
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
PDF
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
PDF
Numba Overview
PDF
Numba: Flexible analytics written in Python with machine-code speeds and avo...
PDF
Buzzwords Numba Presentation
PDF
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
PDF
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
PDF
SciPy 2019: How to Accelerate an Existing Codebase with Numba
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Numba Overview
Numba: Flexible analytics written in Python with machine-code speeds and avo...
Buzzwords Numba Presentation
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
SciPy 2019: How to Accelerate an Existing Codebase with Numba

What's hot (20)

PDF
Numba: Array-oriented Python Compiler for NumPy
PDF
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
PDF
Standardizing on a single N-dimensional array API for Python
PDF
A Library for Emerging High-Performance Computing Clusters
PPTX
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
PDF
PyData NYC whatsnew NumPy-SciPy 2019
PDF
Tokyo Webmining Talk1
PDF
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
PDF
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
PDF
Python array API standardization - current state and benefits
PPTX
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
PPTX
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
PDF
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
PPTX
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
PDF
running Tensorflow in Production
PDF
NumPy Roadmap presentation at NumFOCUS Forum
PDF
Common Design of Deep Learning Frameworks
PPTX
MPI Raspberry pi 3 cluster
PDF
The Joy of SciPy
PDF
Scipy, numpy and friends
Numba: Array-oriented Python Compiler for NumPy
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Standardizing on a single N-dimensional array API for Python
A Library for Emerging High-Performance Computing Clusters
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
PyData NYC whatsnew NumPy-SciPy 2019
Tokyo Webmining Talk1
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Python array API standardization - current state and benefits
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
running Tensorflow in Production
NumPy Roadmap presentation at NumFOCUS Forum
Common Design of Deep Learning Frameworks
MPI Raspberry pi 3 cluster
The Joy of SciPy
Scipy, numpy and friends
Ad

Similar to Get Your Hands Dirty with Intel® Distribution for Python* (20)

PPT
A Hands-on Intro to Data Science and R Presentation.ppt
PDF
04 open source_tools
PPTX
Role of python in hpc
PDF
Big Data Analytics (ML, DL, AI) hands-on
PDF
Tips and tricks for data science projects with Python
PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
PPTX
Distributed Deep Learning + others for Spark Meetup
PDF
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
PDF
2. Data Preprocessing.pdf
PPTX
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
PPTX
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
PPTX
Python ml
PPTX
Apache Spark sql
PPTX
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
PPTX
Basic of python for data analysis
PPTX
System mldl meetup
PPTX
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaannnnnnnnnnnnnnnnnnnnnnkkkkkkkkkk
PPTX
stock market prediction using LSTM ankit
PPTX
Combining Machine Learning frameworks with Apache Spark
PPTX
Real time streaming analytics
A Hands-on Intro to Data Science and R Presentation.ppt
04 open source_tools
Role of python in hpc
Big Data Analytics (ML, DL, AI) hands-on
Tips and tricks for data science projects with Python
Data Science With Python | Python For Data Science | Python Data Science Cour...
Distributed Deep Learning + others for Spark Meetup
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
2. Data Preprocessing.pdf
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Python ml
Apache Spark sql
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Basic of python for data analysis
System mldl meetup
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaannnnnnnnnnnnnnnnnnnnnnkkkkkkkkkk
stock market prediction using LSTM ankit
Combining Machine Learning frameworks with Apache Spark
Real time streaming analytics
Ad

More from Intel® Software (20)

PPTX
AI for All: Biology is eating the world & AI is eating Biology
PPTX
Python Data Science and Machine Learning at Scale with Intel and Anaconda
PDF
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
PDF
AI for good: Scaling AI in science, healthcare, and more.
PDF
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
PPTX
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
PPTX
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
PPTX
AWS & Intel Webinar Series - Accelerating AI Research
PPTX
Intel Developer Program
PDF
Intel AIDC Houston Summit - Overview Slides
PDF
AIDC NY: BODO AI Presentation - 09.19.2019
PDF
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
PDF
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
PDF
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
PDF
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
PDF
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
PDF
AIDC India - AI on IA
PDF
AIDC India - Intel Movidius / Open Vino Slides
PDF
AIDC India - AI Vision Slides
PDF
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
AI for All: Biology is eating the world & AI is eating Biology
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
AI for good: Scaling AI in science, healthcare, and more.
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
AWS & Intel Webinar Series - Accelerating AI Research
Intel Developer Program
Intel AIDC Houston Summit - Overview Slides
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
AIDC India - AI on IA
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - AI Vision Slides
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Electronic commerce courselecture one. Pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Cloud computing and distributed systems.
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Programs and apps: productivity, graphics, security and other tools
Mobile App Security Testing_ A Comprehensive Guide.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
A comparative analysis of optical character recognition models for extracting...
MYSQL Presentation for SQL database connectivity
Review of recent advances in non-invasive hemoglobin estimation
Electronic commerce courselecture one. Pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Unlocking AI with Model Context Protocol (MCP)
Cloud computing and distributed systems.
sap open course for s4hana steps from ECC to s4
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Building Integrated photovoltaic BIPV_UPV.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
Programs and apps: productivity, graphics, security and other tools

Get Your Hands Dirty with Intel® Distribution for Python*

  • 1. Zhang Zhang, Victoriya Fedotova Intel Corporation November 2016
  • 2. 2 Agenda Introduction – A quick intro to Intel® Data Analytics Acceleration Library and Intel® Distribution for Python – A brief overview of basic machine learning concepts Lab activities – Warm-up exercises: Learn the gist of PyDAAL API – Linear regression – Classification with SVM – K-Means clustering – PCA Conclusions
  • 4. Modelling Data Analytics Flow Example Spam Filter not spam not spam spam Pre- process Collect Store Load Train & Validate Deploy Make Decision
  • 5. Computational Aspects of Big Data • Distributed across different nodes/devices • Huge data size not fitting into node/device memory Volume • Non-homogeneous data • Sparse/Missing/Noisy data Variety • Data coming in timeVelocity Converts, Indexing, Repacking Data Recovery Distributed Computing Online Computing D1 DK P1 RK R ... Di Pi+1 Pi Time Memory capacity Attributes OutlierNumeric Categorical Missing Recover Dense Algorithm Sparse Algorithm Counter
  • 6. Intel® Data Analytics Acceleration Library (Intel® DAAL) • Targets both data centers (Intel® Xeon® and Intel® Xeon Phi™) and edge-devices (Intel® Atom) • Perform analysis close to data source (sensor/client/server) to optimize response latency, decrease network bandwidth utilization, and maximize security • Offload data to server/cluster for complex and large-scale analytics (De-)Compression (De-)Serialization PCA Statistical moments Quantiles Variance matrix QR, SVD, Cholesky Apriori Outlier detection Regression • Linear • Ridge Classification • Naïve Bayes • SVM • Classifier boosting • kNN Clustering • Kmeans • EM GMM Collaborative filtering • ALS Neural Networks Pre-processing Transformation Analysis Modeling Decision Making Scientific/Engineering Web/Social Business Validation
  • 7. Intel® DAAL Main Features Building end-to-end data applications Optimized for Intel architectures, from Intel® Atom™, Intel® Core™, Intel® Xeon®, to Intel® Xeon Phi™ A rich set of widely applicable algorithms for data mining and machine learning Batch, online, and distributed processing Data connectors to a variety of data sources and formats: KDB*, MySQL*, HDFS, CSV, and user-defined sources/formats C++, Java, and Python APIs *Other names and brands may be claimed as the property of others
  • 9. Python Landscape Challenge#1: Domain specialists are not professional software programmers. Adoption of Python continues to grow among domain specialists and developers for its productivity benefits Challenge#2: Python performance limits migration to production systems Intel’s solution is to…  Accelerate Python performance  Enable easy access  Empower the community
  • 10. 1 Highlights: Intel® Distribution for Python* 2017 Focus on advancing Python performance closer to native speeds • Prebuilt, accelerated Distribution for numerical & scientific computing, data analytics, HPC. Optimized for IA • Drop in replacement for your existing Python. No code changes required Easy, out-of-the-box access to high performance Python • Accelerated NumPy/SciPy/scikit-learn with Intel® Math Kernel Library • Data analytics with pyDAAL, Enhanced thread scheduling with TBB, Jupyter* notebook interface, Numba, Cython • Scale easily with optimized mpi4py and Jupyter notebooks Drive performance with multiple optimization techniques • Distribution and individual optimized packages available through conda and Anaconda Cloud • Optimizations upstreamed back to main Python trunk Faster access to latest optimizations for Intel architecture
  • 11. Performance Gain from MKL (Compare to “vanilla” SciPy) Configuration info: - Versions: Intel® Distribution for Python 2017 Beta, icc 15.0; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS. Linear Algebra • BLAS • LAPACK • ScaLAPACK • Sparse BLAS • Sparse Solvers Fast Fourier Transforms • Multidimensional • FFTW interfaces • Cluster FFT Vector Math • Trigonometric • Hyperbolic • Exponential • Log • Power, Root Vector RNGs • Multiple BRNG • Support methods for independent streams creation • Support all key probability distributions Summary Statistics • Kurtosis • Variation coefficient • Order statistics • Min/max • Variance-covariance And More • Splines • Interpolation • Trust Region • Fast Poisson Solver Up to 100x faster Up to 10x faster! Up to 10x faster! Up to 60x faster!
  • 12. PyDAAL (Python API for Intel® DAAL) Turbocharged machine learning tool for Python developers Interoperability and composability with the SciPy ecosystem: – Work directly with NumPy ndarrays – Faster than scikit-learn We’ll see how to use it in this lab
  • 14. Problems – A company wants to define the impact of the pricing changes on the number of product sales – A biologist wants to define the relationships between body size, shape, anatomy and behavior of the organism Solution: Linear Regression – A linear model for relationship between features and the response Regression 14 Source: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2014). An Introduction to Statistical Learning. Springer
  • 15. Problems – An emailing service provider wants to build a spam filter for the customers – A postal service wants to implement handwritten address interpretation Solution: Support Vector Machine (SVM) – Works well for non-linear decision boundary – Two kernel functions are provided: – Linear kernel – Gaussian kernel (RBF) – Multi-class classifier – One-vs-One Classification Source: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2014). An Introduction to Statistical Learning. Springer
  • 16. Problems – A news provider wants to group the news with similar headlines in the same section – Humans with similar genetic pattern are grouped together to identify correlation with a specific disease Solution: K-Means – Pick k centroids – Repeat until converge: – Assign data points to the closest centroid – Re-calculate centroids as the mean of all points in the current cluster – Re-assign data points to the closest centroid Cluster Analysis
  • 17. Problems – Data scientist wants to visualize a multi- dimensional data set – A classifier built on the whole data set tends to overfit Solution: Principal Component Analysis – Compute eigen decomposition on the correlation matrix – Apply the largest eigenvectors to compute the largest principal components that can explain most of variance in original data Dimensionality Reduction
  • 18. 18
  • 19. Setup  Unpack the archive to the local disk  Run setup script: – Linux, OS X: ./setup.sh – Windows: setup.bat  Set path to conda: – Linux, OS X: export PATH=<path_to_idp>/bin:$PATH – Windows: set PATH=<path_to_idp>Scripts;%PATH%
  • 20. Lab 1: Warm-up Exercise Learning objectives:  Understand NumericTable - The main data structure of DAAL – Create NumericTable from data sources – Interoperability with NumPy, Pandas, scikit-learn – Get NumPy ndarray from NumericTable  Understand code sequence of using DAAL API – Create an algorithm object – Pass in input data – Set algorithm specific parameters – Compute – Get results
  • 21. Lab 2: Linear Regression Learning objectives:  Understand the 2 regression algorithms currently available in DAAL – Linear regression without regularization – Ridge regression  Learn supervised learning workflow – Train a model using known data – Test the model by making predictions on new data  Visualize prediction results
  • 22. Lab 3: Classification with SVM Learning objectives:  Understand SVM algorithm usage model – Multi-class classification with SVM – Two-class classification with SVM  Understand quality metrics in classification – Confusion matrix – Metrics computed using the confusion matrix (accuracy, etc.)
  • 23. Lab 4: Clustering with K-Means Learning objectives:  Understand the K-Means algorithm supported in DAAL  Learn basic clustering workflow – Initialize cluster centroids – Minimize the goal function  Visualize clusters
  • 24. Lab 5: Principal Component Analysis Learning objectives:  Understand PCA algorithms support in DAAL: – Correlation matrix method – SVD method  Evaluate and visualize principal components
  • 25. References Intel DAAL User’s Guide and Reference Manual – https://software.intel.com/sites/products/documentation/doclib/daal/daa l-user-and-reference-guides/index.htm Intel Distribution for Python Documentation – https://software.intel.com/en-us/intel-distribution-for-python- support/documentation
  • 26. What’s Next - Takeaways Learn more about Intel® DAAL – It supports C++ and Java, too! – We want you to use DAAL in your data projects Learn more about Intel® Distribution for Python – Beyond machine learning, many more benefits Keep an eye on the tutorial repository – https://github.com/daaltces/pydaal-tutorials – I’m adding more labs, samples, etc.
  • 27. Zhang Zhang (zhang.zhang@intel.com) Victoriya Fedotova (victoriya.s.fedotova@intel.com) www.intel.com/hpcdevcon