SlideShare a Scribd company logo
CALLGRAPH ANALYSIS &
PERFORMANCETOOLS DEVELOPMENTS
Roberto A.Vitillo
Lawrence Berkeley National Laboratory
ATLAS Software and Computing Workshop, 7 April 2011
1
COMPILER OPTIONS
• Compiler options that bring the greatest benefits1 are the ones that permit to:
‣ don’t keep the frame pointer in a register for functions that don’t need one
‣ inline functions
• Enable Streaming SIMD Extensions (SSE)
‣ Use SSE for scalar floating point math: still some alignment issues with gcc-4.3.5, solved in
gcc-4.4.4
‣ Use SSE vector instructions were possible (see gcc built-in functions)
‣ Use glibc with SSE support (glibc >= 2.10 with IFUNC), e.g.: improved memcpy, memmove
‣ Enable autovectorization
1 KapilVaswani, P. J. Joseph, Matthew J.Thazhuthaveetil, andY. N. Srikant. 2007. Microarchitecture sensitive empirical models for
compiler optimizations. In Proceedings of the International Symposium on Code Generation and Optimization
2
CALLGRAPH ANALYSIS
• Problem:
• low instruction retired / call retired ratio
• high call retired / branch retired ratio
• Inlining functions called millions of times per event can indeed bring considerable
benefits, e.g.:
‣ Trigger/TrigT1/TrigT1RPChardware - 4% instruction retired reduction
(RecExCommon/bstoesd, 1 function inlined)
‣ TileCalorimeter/TileCalib/TileCalibBlobObjs - 5% instruction retired reduction
(RecExCommon/bstoesd, 5 functions inlined)
3
CALLGRAPH ANALYSIS
• David Levinthal’s proposal:
‣ “Use Last Branch Records (LBR) and static analysis to
evaluate frequency and cost of function calls”
‣ “Use social network analysis / network theory to identify
clusters of active, costly function call activity”
‣ “Order cluster by total cost and inline”
4
CALLGRAPH ANALYSIS
• Callgraph of a five event reconstruction job
(d482) built from callgrind output (40 KN,
~160 KE)
• For visualization purposes we consider only the
following sub-callgraph (~500 N, ~600 E)
‣ Nodes > 0.5% total executed instructions
‣ Arcs > 0.1% relative frequency
5
CALLGRAPH ANALYSIS
• Nodes with higher weighted degree (WD) are
highlighted with a “warmer” color and have a bigger
size
• Where are the cluster? Naive approach:
‣ Build a new subgraph containing only nodes with
WD > threshold and their respective edges
‣ Find the connected components and compute
the cost for each of them
‣ Inline the clusters by descending cost
~300M
6
CALLGRAPH ANALYSIS
• Use of a force based algorithm to layout the graph and visualize the clusters
• Nodes act as point charges
• Arcs act as springs
• Callgraph cluster analysis and inlining could be embedded in a compiler through the PGO
component
layout animation
7
CALLGRAPH ANALYSIS
• Use of a force based algorithm to layout the graph and visualize the clusters
• Nodes act as point charges
• Arcs act as springs
• Callgraph cluster analysis and inlining could be embedded in a compiler through the PGO
component
layout animation
7
CALLGRAPH ANALYSIS
CaloCluster::eta
std::Rb_tree_increment
Trk::RungeKuttaPropagator::propagateWithJacobian
TF1_EvalWrapper::DoEval
Also notice the chains: usually a function calls only one “heavy” function
8
CALLGRAPH ANALYSIS
• More clusters in the full graph: ~1% (2K/160K) of call spots make up for ~90% (14.2G/15.8G) of all function
calls!
• Complete sorted list available on lxplus: ~vitillo/public/callspots
9
CALLGRAPH ANALYSIS
• Not every function can be inlined:
‣ Third party library functions: use Link Time Optimization (LTO) + Profile
Guided Optimization (PGO) if possible (LTO needs static libraries and a
recent compiler)
‣ Virtual functions: use explicit qualification or final keyword (c++0x or custom
patch) where possible & compiler devirtualization support if available
• Inline functions only in specific spots: use alternative versions; introduce a pragma
in combination with LTO; use LTO + PGO
• Conclusion:To try to solve the problem automatically we need LTO (gcc-4.6) and
some form of PGO
10
WHAT NEXT?
• Problem: High indirect call / call retired ratio
• Possible solution: don’t use position independent code
• Will the performance gain be greater than the amount of
unsharable library code?
• On x86-64 PIC is mandatory for shared libraries
• At this point, should we consider to use static libraries that can
be used also for LTO?
11
PERFORMANCETOOLS
• IgProf
‣ simple tool for measuring and analyzing application memory and performance characteristics
‣ no changes to the application or build process required
‣ fix for Athena developed
• Systemtap
• useful to "dynamically instrument" specific functions and much more
• provides a simple command line interface and scripting language for writing instrumentation
• with uprobe kernel module it can be used also with userspace code
• Perf-events
• Kernel module that permit to access sw and hw performance counters
12
PERFORMANCETOOLS
• Collaboration with Google: David Levinthal & Stephane Eranian
• Short term goal: use kcachegrind to visualize perf-events
reports
‣ Benefits: performance wise an order of magnitude faster than
using instrumented code
• Long term goal: develop an open source visualizer that shows
collected data with emphasis on OO applications
13
PERFORMANCETOOLS
• LBR is used to evaluate the frequency of function calls
• Sampling performed on the BR_RET_EXEC events (available on Sandy Bridge architectures)
‣ BR_CALL_EXEC cannot be used directly with trampolines
• Caveat: LBR is currently not available on the user side of perf-events
‣ Kernel patch to dump the LBR is ready - filtering still missing
‣ The patch cannot be accepted until there is an useful use-case integrated within the tools
‣ Proposed simple use-case: % of branches inside and outside of a module
• Random sampling has been added to the kernel to avoid synchronization issues
14
PERFORMANCETOOLS
• Improve basic block counts by:
• using branch records to generate
software instruction retired event
• adhering to flow conservation rules
while limiting the amount of changes
to sample counts to a minimum
B1 B2
B3
In general with sampling
#B1 + #B2 != #B3
3 4
1
15
CONCLUSIONS
• We knew that there was a problem and now we know what
to fix in order to solve it
• Inlining of the clusters will require some manual changes to
the codebase but a more general solution needs LTO & PGO
• Avoid PIC?
• Next generation performance tools with emphasis to OO
software needed
16

More Related Content

PPTX
Python Streaming Pipelines with Beam on Flink
PDF
Streaming your Lyft Ride Prices - Flink Forward SF 2019
PDF
Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...
PPTX
SMiLE: Design and Development of an ISS Payload for Liquid Behavior Study in ...
PPT
Tech Days 2015: User Presentation Vermont Technical College
PDF
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
ODP
QVT Traceability: What does it really mean?
PDF
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Python Streaming Pipelines with Beam on Flink
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...
SMiLE: Design and Development of an ISS Payload for Liquid Behavior Study in ...
Tech Days 2015: User Presentation Vermont Technical College
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
QVT Traceability: What does it really mean?
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...

What's hot (20)

PDF
Large Graph Processing
PPTX
Real-Time Voice Actuation
PDF
Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...
ODP
Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...
PDF
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...
PPTX
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
PPTX
A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH
ODP
Boost your App with Gatling
ODP
Optimized declarative transformation First Eclipse QVTc results
PPTX
Strel streaming
PDF
Ml also helps generic compiler ?
PPTX
Massif - the love child of Matlab Simulink and Eclipse
PDF
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
PPTX
GNAT Pro User Day: QGen: Simulink® static verification and code generation
PPTX
JVM++: The Graal VM
PPTX
Gatling overview
PDF
SC13: OpenMP and NVIDIA
PDF
magellan_mongodb_workload_analysis
PDF
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
PDF
Deep Reality Simulation for Automated Poacher Detection with Mark Hamilton an...
Large Graph Processing
Real-Time Voice Actuation
Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...
Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH
Boost your App with Gatling
Optimized declarative transformation First Eclipse QVTc results
Strel streaming
Ml also helps generic compiler ?
Massif - the love child of Matlab Simulink and Eclipse
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
GNAT Pro User Day: QGen: Simulink® static verification and code generation
JVM++: The Graal VM
Gatling overview
SC13: OpenMP and NVIDIA
magellan_mongodb_workload_analysis
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Deep Reality Simulation for Automated Poacher Detection with Mark Hamilton an...
Ad

Similar to Callgraph analysis (20)

PDF
Performance tools developments
PDF
HPC Application Profiling and Analysis
PPTX
HPC Application Profiling & Analysis
PDF
XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging -...
PDF
TIP1 - Overview of C/C++ Debugging/Tracing/Profiling Tools
PDF
May2010 hex-core-opt
PDF
Stale pointers are the new black - white paper
PDF
Building source code level profiler for C++.pdf
PDF
GOoDA tutorial
PDF
Beyond Breakpoints: A Tour of Dynamic Analysis
PDF
Instrumentation & the Pitfalls of Abstraction
PDF
TAU for Accelerating AI Applications at OpenPOWER Summit Europe
PDF
Cray XT Porting, Scaling, and Optimization Best Practices
PDF
Peddle the Pedal to the Metal
PDF
A whirlwind tour of the LLVM optimizer
PPTX
Compiler optimizations based on call-graph flattening
PPTX
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
PDF
Beyond Breakpoints: A Tour of Dynamic Analysis
PDF
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
PDF
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Performance tools developments
HPC Application Profiling and Analysis
HPC Application Profiling & Analysis
XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging -...
TIP1 - Overview of C/C++ Debugging/Tracing/Profiling Tools
May2010 hex-core-opt
Stale pointers are the new black - white paper
Building source code level profiler for C++.pdf
GOoDA tutorial
Beyond Breakpoints: A Tour of Dynamic Analysis
Instrumentation & the Pitfalls of Abstraction
TAU for Accelerating AI Applications at OpenPOWER Summit Europe
Cray XT Porting, Scaling, and Optimization Best Practices
Peddle the Pedal to the Metal
A whirlwind tour of the LLVM optimizer
Compiler optimizations based on call-graph flattening
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
Beyond Breakpoints: A Tour of Dynamic Analysis
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Ad

More from Roberto Agostino Vitillo (12)

PDF
Telemetry Onboarding
PDF
Growing a Data Pipeline for Analytics
PDF
Telemetry Datasets
PDF
Growing a SQL Query
PDF
Telemetry Onboarding
PDF
All you need to know about Statistics
PDF
Spark meets Telemetry
PDF
Vectorization on x86: all you need to know
PDF
Sharing C++ objects in Linux
PDF
Exploiting vectorization with ISPC
PDF
Vectorization in ATLAS
PDF
Inter-process communication on steroids
Telemetry Onboarding
Growing a Data Pipeline for Analytics
Telemetry Datasets
Growing a SQL Query
Telemetry Onboarding
All you need to know about Statistics
Spark meets Telemetry
Vectorization on x86: all you need to know
Sharing C++ objects in Linux
Exploiting vectorization with ISPC
Vectorization in ATLAS
Inter-process communication on steroids

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Encapsulation theory and applications.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Approach and Philosophy of On baking technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
MYSQL Presentation for SQL database connectivity
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation_ Review paper, used for researhc scholars
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
The Rise and Fall of 3GPP – Time for a Sabbatical?
Unlocking AI with Model Context Protocol (MCP)
Diabetes mellitus diagnosis method based random forest with bat algorithm
Approach and Philosophy of On baking technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Dropbox Q2 2025 Financial Results & Investor Presentation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx

Callgraph analysis

  • 1. CALLGRAPH ANALYSIS & PERFORMANCETOOLS DEVELOPMENTS Roberto A.Vitillo Lawrence Berkeley National Laboratory ATLAS Software and Computing Workshop, 7 April 2011 1
  • 2. COMPILER OPTIONS • Compiler options that bring the greatest benefits1 are the ones that permit to: ‣ don’t keep the frame pointer in a register for functions that don’t need one ‣ inline functions • Enable Streaming SIMD Extensions (SSE) ‣ Use SSE for scalar floating point math: still some alignment issues with gcc-4.3.5, solved in gcc-4.4.4 ‣ Use SSE vector instructions were possible (see gcc built-in functions) ‣ Use glibc with SSE support (glibc >= 2.10 with IFUNC), e.g.: improved memcpy, memmove ‣ Enable autovectorization 1 KapilVaswani, P. J. Joseph, Matthew J.Thazhuthaveetil, andY. N. Srikant. 2007. Microarchitecture sensitive empirical models for compiler optimizations. In Proceedings of the International Symposium on Code Generation and Optimization 2
  • 3. CALLGRAPH ANALYSIS • Problem: • low instruction retired / call retired ratio • high call retired / branch retired ratio • Inlining functions called millions of times per event can indeed bring considerable benefits, e.g.: ‣ Trigger/TrigT1/TrigT1RPChardware - 4% instruction retired reduction (RecExCommon/bstoesd, 1 function inlined) ‣ TileCalorimeter/TileCalib/TileCalibBlobObjs - 5% instruction retired reduction (RecExCommon/bstoesd, 5 functions inlined) 3
  • 4. CALLGRAPH ANALYSIS • David Levinthal’s proposal: ‣ “Use Last Branch Records (LBR) and static analysis to evaluate frequency and cost of function calls” ‣ “Use social network analysis / network theory to identify clusters of active, costly function call activity” ‣ “Order cluster by total cost and inline” 4
  • 5. CALLGRAPH ANALYSIS • Callgraph of a five event reconstruction job (d482) built from callgrind output (40 KN, ~160 KE) • For visualization purposes we consider only the following sub-callgraph (~500 N, ~600 E) ‣ Nodes > 0.5% total executed instructions ‣ Arcs > 0.1% relative frequency 5
  • 6. CALLGRAPH ANALYSIS • Nodes with higher weighted degree (WD) are highlighted with a “warmer” color and have a bigger size • Where are the cluster? Naive approach: ‣ Build a new subgraph containing only nodes with WD > threshold and their respective edges ‣ Find the connected components and compute the cost for each of them ‣ Inline the clusters by descending cost ~300M 6
  • 7. CALLGRAPH ANALYSIS • Use of a force based algorithm to layout the graph and visualize the clusters • Nodes act as point charges • Arcs act as springs • Callgraph cluster analysis and inlining could be embedded in a compiler through the PGO component layout animation 7
  • 8. CALLGRAPH ANALYSIS • Use of a force based algorithm to layout the graph and visualize the clusters • Nodes act as point charges • Arcs act as springs • Callgraph cluster analysis and inlining could be embedded in a compiler through the PGO component layout animation 7
  • 10. CALLGRAPH ANALYSIS • More clusters in the full graph: ~1% (2K/160K) of call spots make up for ~90% (14.2G/15.8G) of all function calls! • Complete sorted list available on lxplus: ~vitillo/public/callspots 9
  • 11. CALLGRAPH ANALYSIS • Not every function can be inlined: ‣ Third party library functions: use Link Time Optimization (LTO) + Profile Guided Optimization (PGO) if possible (LTO needs static libraries and a recent compiler) ‣ Virtual functions: use explicit qualification or final keyword (c++0x or custom patch) where possible & compiler devirtualization support if available • Inline functions only in specific spots: use alternative versions; introduce a pragma in combination with LTO; use LTO + PGO • Conclusion:To try to solve the problem automatically we need LTO (gcc-4.6) and some form of PGO 10
  • 12. WHAT NEXT? • Problem: High indirect call / call retired ratio • Possible solution: don’t use position independent code • Will the performance gain be greater than the amount of unsharable library code? • On x86-64 PIC is mandatory for shared libraries • At this point, should we consider to use static libraries that can be used also for LTO? 11
  • 13. PERFORMANCETOOLS • IgProf ‣ simple tool for measuring and analyzing application memory and performance characteristics ‣ no changes to the application or build process required ‣ fix for Athena developed • Systemtap • useful to "dynamically instrument" specific functions and much more • provides a simple command line interface and scripting language for writing instrumentation • with uprobe kernel module it can be used also with userspace code • Perf-events • Kernel module that permit to access sw and hw performance counters 12
  • 14. PERFORMANCETOOLS • Collaboration with Google: David Levinthal & Stephane Eranian • Short term goal: use kcachegrind to visualize perf-events reports ‣ Benefits: performance wise an order of magnitude faster than using instrumented code • Long term goal: develop an open source visualizer that shows collected data with emphasis on OO applications 13
  • 15. PERFORMANCETOOLS • LBR is used to evaluate the frequency of function calls • Sampling performed on the BR_RET_EXEC events (available on Sandy Bridge architectures) ‣ BR_CALL_EXEC cannot be used directly with trampolines • Caveat: LBR is currently not available on the user side of perf-events ‣ Kernel patch to dump the LBR is ready - filtering still missing ‣ The patch cannot be accepted until there is an useful use-case integrated within the tools ‣ Proposed simple use-case: % of branches inside and outside of a module • Random sampling has been added to the kernel to avoid synchronization issues 14
  • 16. PERFORMANCETOOLS • Improve basic block counts by: • using branch records to generate software instruction retired event • adhering to flow conservation rules while limiting the amount of changes to sample counts to a minimum B1 B2 B3 In general with sampling #B1 + #B2 != #B3 3 4 1 15
  • 17. CONCLUSIONS • We knew that there was a problem and now we know what to fix in order to solve it • Inlining of the clusters will require some manual changes to the codebase but a more general solution needs LTO & PGO • Avoid PIC? • Next generation performance tools with emphasis to OO software needed 16