SlideShare a Scribd company logo
Profiling and optimizing for Xeon Phi
with Allinea MAP
Discovering bottlenecks without pain
What is happening ?
Single Core Era Multi-Core Era Many-Core Era
Constraints :
-Power
-Complexity of algorithms
Constraints :
-Power
-Parallel software availability
-Scalability
Constraints :
-Programming models
Performance
Time(years)
• Parallel profiler designed for:
‒ C/C++, Fortran
‒ MPI code
 Interdependent or independent processes
‒ Multithreaded code
 Monitor the main threads for each process
‒ Accelerated codes
 GPUs, Intel Xeon Phi
• Improve productivity :
‒ Helps you detect performance issues quickly and easily
‒ Tells you immediately where your time is spent in your source code
‒ Helps you to optimize your application efficiently
Allinea MAP
Increase application performance
• Support for I/O metrics
‒ I/O can be a major bottleneck in HPC systems
‒ Find the optimal configuration for your file system.
Benefit : Broader profiling and analysis capabilities to
solve even more performance issues.
• Support for Intel Xeon Phi
‒ Already supported on Allinea DDT
‒ Officially extended to profiling
Benefit : Ensure you are getting the best performance
from new technology.
Allinea MAP 4.1
New features at ISC 2013
Intel Xeon Phi and Allinea
• Started architecture and tools discussions with Intel
• Early development prototypes exchanged2011
• Full debugger support for Intel MIC architecture
• Official 3.2 release
• Feedback from early adopters2012
• Profiling support for Intel Xeon Phi announced
• #1 Green 500 system, Xeon Phi-powered Beacon chooses
Allinea
• Dramatic surge in interest in debugging and profiling on
Xeon Phi
2013
Optimizing for the Xeon Phi
Where do you start?
“Code that’s well-optimized for the host
usually performs pretty well on the cards”
- Pretty much everyone
Optimizing for the Xeon Phi
But what matters?
Vectorization
Other
stuff
Performance
Optimizing for the Xeon Phi
Is my code well-vectorized?
… maybe?
Optimizing for the Xeon Phi
Is my code well-vectorized?
… maybe?
Optimizing for the Xeon Phi
Is my code well-vectorized?
… maybe?
Not in this loop
(16.5% of total time)
Optimizing for the Xeon Phi
Non-obvious tradeoffs
Optimizing for the Xeon Phi
Non-obvious tradeoffs
Here a loop taking
55% of total runtime
isn’t vectorized at all
Taking the unvectorizable rand() out of the loop
allows the sqrt workload to be fully-vectorized –
reverse loop fusion!
Optimizing for the Xeon Phi
Non-obvious tradeoffs
Now the floating-
point workload is
fully-vectorized
But all the time is being spent in the random
number generation, so that’s what really needs to
be optimized
Optimizing for the Xeon Phi
Know your tools
Replace rand() with Intel’s vectorized version and re-fuse the loop
to retain temporal cache locality benefits
Optimizing for the Xeon Phi
The full picture
You need to see the full picture to spot these
tradeoffs – Allinea MAP shows you the way
Optimizing for the Xeon Phi
Running on the card
Allinea MAP runs with full metrics on Xeon Phi cards!
Optimizing for the Xeon Phi
Running on the card
This makes it easy to compare and learn versus the host
• Full, graphical debugger designed for :
‒ C/C++, Fortran, Xeon Phi, UPC, …
‒ MPI, OpenMP and mixed-mode code
• Unified interface with Allinea MAP :
‒ Just what you need when you’ve added
OpenMP and now everything segfaults!
‒ One interface eliminates learning curve
‒ Spend more time on your results
• Slash your time to develop :
‒ Reproduces and triggers your bugs instantly
‒ Helps you easily understand where issues come from quickly
‒ Helps you to fix them as swiftly as possible
Allinea DDT
Unified interface for debugging
• Ten years of high-quality development tools
‒ Leading in HPC software tools market worldwide
‒ Global customer base
• Making parallel programming accessible to the widest
range of scientists and programmers
‒ Design an unrivaled productive and easy-to-use development environment…
‒ … To help you reach the highest level of performance and scalability
‒ Define a new standard of customer support
Allinea Software
Summary
• Allinea’s tools are the premier Xeon Phi development
environment
– See at a glance which loops to vectorized and which to
ignore
– Full profiling metrics available on the Xeon Phi cards
– Unified interface with Allinea DDT keeps you productive,
whatever you’re working on
To learn more, visit us at our
booth #655 !
Thank you
Your contacts :
– Technical Support team : support@allinea.com
– Sales team : sales@allinea.com

More Related Content

PPTX
HPC Performance & Development Tuning tools for scientists to go parallel fast...
PDF
NAB Tech Talk
PPTX
How to Supercharge your PHP Web API
PDF
201801 CSE240 Lecture 04
PPTX
Programming FPGA in electronic systems
PPTX
Real-world Vision Systems Design: Challenges and Techniques
PPTX
Working Well Together: How to Keep High-end Game Development Teams Productive
PDF
An intuitive guide to combining free monad and free applicative
HPC Performance & Development Tuning tools for scientists to go parallel fast...
NAB Tech Talk
How to Supercharge your PHP Web API
201801 CSE240 Lecture 04
Programming FPGA in electronic systems
Real-world Vision Systems Design: Challenges and Techniques
Working Well Together: How to Keep High-end Game Development Teams Productive
An intuitive guide to combining free monad and free applicative

What's hot (18)

PPTX
Develop and optimize CV/DL applications with Intel OpenVINO toolkit
PDF
Optimizing thread performance for a genomics variant caller
ODP
Elm & Elixir: Functional Programming and Web
PPTX
Web presentation
PPTX
Callout architecture
PDF
Manchester Expert Talks (April 2017) - Breaking Down Your Build: Architectura...
PPTX
KAREL Programming - Workshop
PDF
Continuous Integration for iOS Developer
PPTX
J-Testr concept
PDF
SITREP - Asterisk REST. The first steps are done, now what? - CommCon 2019
PDF
rTest, a Testing Tool for FME Workspaces
PPT
Interoperate - Product Presentation
PDF
astricon2018
PPTX
PDF
"How to Get the Best Deep Learning Performance with the OpenVINO Toolkit," a ...
PDF
P.I.Z.Z.A.: Status Update
PPTX
OpenVINO introduction
PDF
Hacking Robots for Fun and Profit
Develop and optimize CV/DL applications with Intel OpenVINO toolkit
Optimizing thread performance for a genomics variant caller
Elm & Elixir: Functional Programming and Web
Web presentation
Callout architecture
Manchester Expert Talks (April 2017) - Breaking Down Your Build: Architectura...
KAREL Programming - Workshop
Continuous Integration for iOS Developer
J-Testr concept
SITREP - Asterisk REST. The first steps are done, now what? - CommCon 2019
rTest, a Testing Tool for FME Workspaces
Interoperate - Product Presentation
astricon2018
"How to Get the Best Deep Learning Performance with the OpenVINO Toolkit," a ...
P.I.Z.Z.A.: Status Update
OpenVINO introduction
Hacking Robots for Fun and Profit
Ad

Viewers also liked (20)

PPT
Lesson plan 1 mata 4
PPTX
What i can do with cd cs
PDF
Hdl buspro catalog 2014 2015гг рус
PDF
Matt Stone's Career Map
PDF
エンパブリック情報誌「地産知縁」第3号 紹介
PDF
Never travel without a (career) map talent zoo
PPTX
Дверные звонки HDL для гостиничных решений
PDF
DIRT Challenge: itinerario
PPT
Ct user group governance
PPTX
Compumatrix marketing
PPSX
ACTIVIDADES DE BIBLIOTECA
PPTX
Going green kl presentation
PPTX
Hdl hotel solution rus
PPT
Adding labels to a layer
PDF
тээмп суперконденсаторы, брошюра
PPTX
алексадр иванов мониторинг на маршруте
PPSX
Social sceince classs vi
PDF
共感を生み出すコミュニケーション 導入編
PDF
Altair on Intel Xeon Phi: Optimizing HPC for Breakthrough Performance
PPTX
HDL BusPro
Lesson plan 1 mata 4
What i can do with cd cs
Hdl buspro catalog 2014 2015гг рус
Matt Stone's Career Map
エンパブリック情報誌「地産知縁」第3号 紹介
Never travel without a (career) map talent zoo
Дверные звонки HDL для гостиничных решений
DIRT Challenge: itinerario
Ct user group governance
Compumatrix marketing
ACTIVIDADES DE BIBLIOTECA
Going green kl presentation
Hdl hotel solution rus
Adding labels to a layer
тээмп суперконденсаторы, брошюра
алексадр иванов мониторинг на маршруте
Social sceince classs vi
共感を生み出すコミュニケーション 導入編
Altair on Intel Xeon Phi: Optimizing HPC for Breakthrough Performance
HDL BusPro
Ad

Similar to Profiling and Optimizing for Xeon Phi with Allinea MAP (20)

PDF
Preparing Codes for Intel Knights Landing (KNL)
PDF
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
PPTX
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
PDF
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
PPTX
Version 7 (002)
PDF
Embedded. What Why How
PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
PDF
XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging -...
PDF
Performance Optimization of HPC Applications: From Hardware to Source Code
PPTX
Mirabilis_Design AMD Versal System-Level IP Library
PDF
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
PDF
Refactoring Applications for the XK7 and Future Hybrid Architectures
PPTX
Real Time Debugging - What to do when a breakpoint just won't do
PDF
⭐⭐⭐⭐⭐ CHARLA FIEC: Monitoring of system memory usage embedded in #FPGA
PPTX
Are you ready to work in the Parallel Universe? Rise to the challenge at SC13
PPTX
Performance analysis and troubleshooting using DTrace
PPTX
Unveiling the Early Universe with Intel Xeon Processors and Intel Xeon Phi at...
PDF
Software Development Tools for Intel® IoT Platforms
PDF
IBM XL Compilers Performance Tuning 2016-11-18
Preparing Codes for Intel Knights Landing (KNL)
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Version 7 (002)
Embedded. What Why How
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging -...
Performance Optimization of HPC Applications: From Hardware to Source Code
Mirabilis_Design AMD Versal System-Level IP Library
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Refactoring Applications for the XK7 and Future Hybrid Architectures
Real Time Debugging - What to do when a breakpoint just won't do
⭐⭐⭐⭐⭐ CHARLA FIEC: Monitoring of system memory usage embedded in #FPGA
Are you ready to work in the Parallel Universe? Rise to the challenge at SC13
Performance analysis and troubleshooting using DTrace
Unveiling the Early Universe with Intel Xeon Processors and Intel Xeon Phi at...
Software Development Tools for Intel® IoT Platforms
IBM XL Compilers Performance Tuning 2016-11-18

More from Intel IT Center (20)

PDF
AI Crash Course- Supercomputing
PPTX
FPGA Inference - DellEMC SURFsara
PDF
High Memory Bandwidth Demo @ One Intel Station
PDF
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
PDF
Disrupt Hackers With Robust User Authentication
PDF
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
PDF
Harness Digital Disruption to Create 2022’s Workplace Today
PPTX
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
PDF
Achieve Unconstrained Collaboration in a Digital World
PDF
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
PDF
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
PPTX
Identity Protection for the Digital Age
PDF
Three Steps to Making a Digital Workplace a Reality
PDF
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
PDF
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
PDF
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
PDF
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
PDF
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
PDF
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
PDF
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
AI Crash Course- Supercomputing
FPGA Inference - DellEMC SURFsara
High Memory Bandwidth Demo @ One Intel Station
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
Disrupt Hackers With Robust User Authentication
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Harness Digital Disruption to Create 2022’s Workplace Today
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
Achieve Unconstrained Collaboration in a Digital World
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
Identity Protection for the Digital Age
Three Steps to Making a Digital Workplace a Reality
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Modernizing your data center with Dell and AMD
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Cloud computing and distributed systems.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation theory and applications.pdf
Review of recent advances in non-invasive hemoglobin estimation
Modernizing your data center with Dell and AMD
Spectral efficient network and resource selection model in 5G networks
Reach Out and Touch Someone: Haptics and Empathic Computing
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
NewMind AI Weekly Chronicles - August'25 Week I
“AI and Expert System Decision Support & Business Intelligence Systems”
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Cloud computing and distributed systems.
Understanding_Digital_Forensics_Presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Profiling and Optimizing for Xeon Phi with Allinea MAP

  • 1. Profiling and optimizing for Xeon Phi with Allinea MAP Discovering bottlenecks without pain
  • 2. What is happening ? Single Core Era Multi-Core Era Many-Core Era Constraints : -Power -Complexity of algorithms Constraints : -Power -Parallel software availability -Scalability Constraints : -Programming models Performance Time(years)
  • 3. • Parallel profiler designed for: ‒ C/C++, Fortran ‒ MPI code  Interdependent or independent processes ‒ Multithreaded code  Monitor the main threads for each process ‒ Accelerated codes  GPUs, Intel Xeon Phi • Improve productivity : ‒ Helps you detect performance issues quickly and easily ‒ Tells you immediately where your time is spent in your source code ‒ Helps you to optimize your application efficiently Allinea MAP Increase application performance
  • 4. • Support for I/O metrics ‒ I/O can be a major bottleneck in HPC systems ‒ Find the optimal configuration for your file system. Benefit : Broader profiling and analysis capabilities to solve even more performance issues. • Support for Intel Xeon Phi ‒ Already supported on Allinea DDT ‒ Officially extended to profiling Benefit : Ensure you are getting the best performance from new technology. Allinea MAP 4.1 New features at ISC 2013
  • 5. Intel Xeon Phi and Allinea • Started architecture and tools discussions with Intel • Early development prototypes exchanged2011 • Full debugger support for Intel MIC architecture • Official 3.2 release • Feedback from early adopters2012 • Profiling support for Intel Xeon Phi announced • #1 Green 500 system, Xeon Phi-powered Beacon chooses Allinea • Dramatic surge in interest in debugging and profiling on Xeon Phi 2013
  • 6. Optimizing for the Xeon Phi Where do you start? “Code that’s well-optimized for the host usually performs pretty well on the cards” - Pretty much everyone
  • 7. Optimizing for the Xeon Phi But what matters? Vectorization Other stuff Performance
  • 8. Optimizing for the Xeon Phi Is my code well-vectorized? … maybe?
  • 9. Optimizing for the Xeon Phi Is my code well-vectorized? … maybe?
  • 10. Optimizing for the Xeon Phi Is my code well-vectorized? … maybe? Not in this loop (16.5% of total time)
  • 11. Optimizing for the Xeon Phi Non-obvious tradeoffs
  • 12. Optimizing for the Xeon Phi Non-obvious tradeoffs Here a loop taking 55% of total runtime isn’t vectorized at all Taking the unvectorizable rand() out of the loop allows the sqrt workload to be fully-vectorized – reverse loop fusion!
  • 13. Optimizing for the Xeon Phi Non-obvious tradeoffs Now the floating- point workload is fully-vectorized But all the time is being spent in the random number generation, so that’s what really needs to be optimized
  • 14. Optimizing for the Xeon Phi Know your tools Replace rand() with Intel’s vectorized version and re-fuse the loop to retain temporal cache locality benefits
  • 15. Optimizing for the Xeon Phi The full picture You need to see the full picture to spot these tradeoffs – Allinea MAP shows you the way
  • 16. Optimizing for the Xeon Phi Running on the card Allinea MAP runs with full metrics on Xeon Phi cards!
  • 17. Optimizing for the Xeon Phi Running on the card This makes it easy to compare and learn versus the host
  • 18. • Full, graphical debugger designed for : ‒ C/C++, Fortran, Xeon Phi, UPC, … ‒ MPI, OpenMP and mixed-mode code • Unified interface with Allinea MAP : ‒ Just what you need when you’ve added OpenMP and now everything segfaults! ‒ One interface eliminates learning curve ‒ Spend more time on your results • Slash your time to develop : ‒ Reproduces and triggers your bugs instantly ‒ Helps you easily understand where issues come from quickly ‒ Helps you to fix them as swiftly as possible Allinea DDT Unified interface for debugging
  • 19. • Ten years of high-quality development tools ‒ Leading in HPC software tools market worldwide ‒ Global customer base • Making parallel programming accessible to the widest range of scientists and programmers ‒ Design an unrivaled productive and easy-to-use development environment… ‒ … To help you reach the highest level of performance and scalability ‒ Define a new standard of customer support Allinea Software
  • 20. Summary • Allinea’s tools are the premier Xeon Phi development environment – See at a glance which loops to vectorized and which to ignore – Full profiling metrics available on the Xeon Phi cards – Unified interface with Allinea DDT keeps you productive, whatever you’re working on To learn more, visit us at our booth #655 !
  • 21. Thank you Your contacts : – Technical Support team : support@allinea.com – Sales team : sales@allinea.com