SlideShare a Scribd company logo
A study of the Scalability of Stop-
the-World Garbage Collectors on
            Multicores
          Aliya Ibragimova
        University of Fribourg
Agenda
•   Overview
•   Problem Statement
•   Parallel Scavenge description
•   Identifying bottlenecks
•   Methods and solutions
•   Results
•   Conclusion
Overview
• A Stop-the-World Collector performs garbage
  collection while the application is completely
  stopped
• A Parallel Collection uses multiple threads to
  perform Garbage Collection

Parallel Scavenge example available in
OpenJDK7
Problem Statement
   Stop-the-world (STW) algorithm degrades badly beyond
8 – cores on a 48-core NUMA-machine with OpenJDK 7:

  – Does the Stop-the-World design has intrinsic
    limitations?
  – If no what are the limitations of the STW approach?
  – How we can improve the current design?
Parallel Scavenge
Contended locks: GC monitor’s lock
Beginning of parallel phase

                 GC monitor’s lock




                         GC task queue



   GC threads

     Solution: use Michael-Scott lock-free queue
Contended locks: GC monitor’s lock
The end of parallel phase

                      GC monitor’s lock

                      Global
                      counter




Solution: remove redundant synchronization
          use timestamps to avoid race conditions
Contended locks: GC monitor’s lock
Idea: remove GC monitor’s lock

1. Task queue
     Use lock-free task queue

2. Barrier at the end of parallel phase
     Remove redundant synchronization

3. Conditional variable of the GC monitor
     Replace conditional variable with Linux’s
     futex_wait calls.
Lack of NUMA-awareness

        Memory           Memory

       CPU   CPU        CPU   CPU



     NUMA – Non-Uniform Memory access

• Memory access imbalance
• Memory locality
Lack of NUMA-awareness
 • Interleaved spaces
     – map pages from different nodes with round robin
       policy
 • Fragmented spaces
     – thread allocates memory from the fragment
       associated with the node where it is executing
 • Segregated spaces
     – Fragmented space that is restricted to being
       accessed by GC threads running on the same node
Best performance: fragmented spaces in the young space interleaved
in others
Results
Resulting GC, NAPS for NUMA-Aware Parallel Scavenge

Look at the effect of the optimization on 3
benchmarks:
      • SPECjbb2005
      • SPECjvm2008
      • DeCapo
8 memory nodes, 48 cores, 96 GB RAM, Linux 3.0 64-bit
Results
• NAPS improves performance and scalability over
  Parallel Scavenge all most in all cases
• NAPS performance continue to increase up to 48
  cores
• NAPS reduces pause time up to 2.8 times in the best
  case
• NAPS improves responsiveness of applications
Conclusion

• This slide is about next steps…
Questions
If you have any questions you are welcome to ask.

More Related Content

PPTX
CNN Dataflow Implementation on FPGAs
PPTX
How to assign the disks in Netapp storage cluster mode 8.X
PDF
NUMA and Java Databases
PDF
Recent advancements in cache technology
PPT
Icg hpc-user
PPTX
GPU Performance Prediction Using High-level Application Models
PDF
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
PDF
Keeping Latency Low and Throughput High with Application-level Priority Manag...
CNN Dataflow Implementation on FPGAs
How to assign the disks in Netapp storage cluster mode 8.X
NUMA and Java Databases
Recent advancements in cache technology
Icg hpc-user
GPU Performance Prediction Using High-level Application Models
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Keeping Latency Low and Throughput High with Application-level Priority Manag...

What's hot (7)

PDF
Memory Bandwidth QoS
PPTX
Zk meetup talk
PPTX
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
PPTX
CNN Dataflow Implementation on FPGAs
PPT
Memory models
PDF
Yet another introduction to Linux RCU
PDF
Stack Frame Protection
Memory Bandwidth QoS
Zk meetup talk
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
CNN Dataflow Implementation on FPGAs
Memory models
Yet another introduction to Linux RCU
Stack Frame Protection
Ad

Viewers also liked (7)

PDF
Cat orgin ofmfr
PDF
Realism of image composits
PPTX
Starting a company in Qatar
PPTX
Sport tandem
DOC
Guidelines toupload
PPTX
manfaat dan kandungan alpukat
PPT
Ah, lord god
Cat orgin ofmfr
Realism of image composits
Starting a company in Qatar
Sport tandem
Guidelines toupload
manfaat dan kandungan alpukat
Ah, lord god
Ad

Similar to Stop-the-world GCs on milticores (20)

PPT
Chap2 slides
PPT
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
PDF
Theta and the Future of Accelerator Programming
PDF
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
PPTX
Jvm problem diagnostics
PPTX
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
PDF
Kvm performance optimization for ubuntu
PDF
High Speed Design Closure Techniques-Balachander Krishnamurthy
PDF
Preparing Codes for Intel Knights Landing (KNL)
PDF
Javantura v6 - On the Aspects of Polyglot Programming and Memory Management i...
PDF
A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...
PPTX
Project Slides for Website 2020-22.pptx
PDF
Java Performance Tuning
PPT
Harnessing OpenCL in Modern Coprocessors
PPTX
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
PDF
chap2_slidesforparallelcomputingananthgarama
PDF
Js on-microcontrollers
PDF
Demystifying Garbage Collection in Java
PDF
CUG2011 Introduction to GPU Computing
PPT
Ways to reduce misses
Chap2 slides
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
Theta and the Future of Accelerator Programming
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Jvm problem diagnostics
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Kvm performance optimization for ubuntu
High Speed Design Closure Techniques-Balachander Krishnamurthy
Preparing Codes for Intel Knights Landing (KNL)
Javantura v6 - On the Aspects of Polyglot Programming and Memory Management i...
A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...
Project Slides for Website 2020-22.pptx
Java Performance Tuning
Harnessing OpenCL in Modern Coprocessors
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
chap2_slidesforparallelcomputingananthgarama
Js on-microcontrollers
Demystifying Garbage Collection in Java
CUG2011 Introduction to GPU Computing
Ways to reduce misses

Recently uploaded (20)

PDF
Hybrid model detection and classification of lung cancer
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PPT
What is a Computer? Input Devices /output devices
PDF
STKI Israel Market Study 2025 version august
PPTX
Tartificialntelligence_presentation.pptx
PPTX
The various Industrial Revolutions .pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
Hybrid model detection and classification of lung cancer
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Zenith AI: Advanced Artificial Intelligence
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Enhancing emotion recognition model for a student engagement use case through...
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
What is a Computer? Input Devices /output devices
STKI Israel Market Study 2025 version august
Tartificialntelligence_presentation.pptx
The various Industrial Revolutions .pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Assigned Numbers - 2025 - Bluetooth® Document
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Programs and apps: productivity, graphics, security and other tools
1 - Historical Antecedents, Social Consideration.pdf
Getting Started with Data Integration: FME Form 101
A contest of sentiment analysis: k-nearest neighbor versus neural network

Stop-the-world GCs on milticores

  • 1. A study of the Scalability of Stop- the-World Garbage Collectors on Multicores Aliya Ibragimova University of Fribourg
  • 2. Agenda • Overview • Problem Statement • Parallel Scavenge description • Identifying bottlenecks • Methods and solutions • Results • Conclusion
  • 3. Overview • A Stop-the-World Collector performs garbage collection while the application is completely stopped • A Parallel Collection uses multiple threads to perform Garbage Collection Parallel Scavenge example available in OpenJDK7
  • 4. Problem Statement Stop-the-world (STW) algorithm degrades badly beyond 8 – cores on a 48-core NUMA-machine with OpenJDK 7: – Does the Stop-the-World design has intrinsic limitations? – If no what are the limitations of the STW approach? – How we can improve the current design?
  • 6. Contended locks: GC monitor’s lock Beginning of parallel phase GC monitor’s lock GC task queue GC threads Solution: use Michael-Scott lock-free queue
  • 7. Contended locks: GC monitor’s lock The end of parallel phase GC monitor’s lock Global counter Solution: remove redundant synchronization use timestamps to avoid race conditions
  • 8. Contended locks: GC monitor’s lock Idea: remove GC monitor’s lock 1. Task queue Use lock-free task queue 2. Barrier at the end of parallel phase Remove redundant synchronization 3. Conditional variable of the GC monitor Replace conditional variable with Linux’s futex_wait calls.
  • 9. Lack of NUMA-awareness Memory Memory CPU CPU CPU CPU NUMA – Non-Uniform Memory access • Memory access imbalance • Memory locality
  • 10. Lack of NUMA-awareness • Interleaved spaces – map pages from different nodes with round robin policy • Fragmented spaces – thread allocates memory from the fragment associated with the node where it is executing • Segregated spaces – Fragmented space that is restricted to being accessed by GC threads running on the same node Best performance: fragmented spaces in the young space interleaved in others
  • 11. Results Resulting GC, NAPS for NUMA-Aware Parallel Scavenge Look at the effect of the optimization on 3 benchmarks: • SPECjbb2005 • SPECjvm2008 • DeCapo 8 memory nodes, 48 cores, 96 GB RAM, Linux 3.0 64-bit
  • 12. Results • NAPS improves performance and scalability over Parallel Scavenge all most in all cases • NAPS performance continue to increase up to 48 cores • NAPS reduces pause time up to 2.8 times in the best case • NAPS improves responsiveness of applications
  • 13. Conclusion • This slide is about next steps…
  • 14. Questions If you have any questions you are welcome to ask.