SlideShare a Scribd company logo
Jean-Philippe BEMPEL
WebScale
@jpbempel
Understanding
JVM GC
2 •
• GC basics
• G1
• Shenandoah
• Azul’s C4
• ZGC
• How to choose a GC algorithm?
Understanding JVM GC: Advanced!
GC Basics
4 •
Generations
5 •
• Traversing references to mark live objects
• Stopping when reaching old generation
• From GC roots (static fields, thread stack, JNI)
Marking for Minor GC
Young Old
6 •
Card Table for references old -> young references
Write barrier to update card table on assignation
X.f = Y
Card Table
Young 0 0 1
CARD_TABLE[&X >> 9] = 1
mov DWORD PTR [r10+0x6c],r8d
mov r11,r10
shr r11,0x9
mov r8d,0x2383000
mov BYTE PTR [r8+r11*1],r12b
G1
8 •
• Generational
• Region based
• Pause time target (soft real-time)
• -XX:MaxGCPauseMillis=n (default 200)
• Default GC since JDK9
Garbage First
9 •
Heap divided into fixed-size regions
Regions
10 •
Regions
Credit: Kirk Pepperdine
11 •
• Young collection (STW)
• Initial Mark (STW)
• Concurrent Marking
• Final Remark (STW)
• Cleanup (STW)
• Mixed collection (STW)
G1 phases
12 •
• Stop-The-World event
• Evacuates live objects to Survivor or Old regions
• Only objects in young generation are considered
Young GC
13 •
• Card table per region
• Avoid scanning the entire heap
Remembered Sets
14 •
• For each reference assignation (X.f = Y) we need to check:
• References (X & Y) are NOT in the same region
• Y is not null
• => enqueue for Remebered Set processing
• Refinement threads to process the queue
• Additional instructions added after assignation
Remembered Sets: Post Write Barrier
if (!isInSameRegion(X, Y)
&& Y != null)
RSEnqueue(X)
mov DWORD PTR [rbp+0x74],r10d
mov r11,rbp
mov r8,r10
shl r8,0x3
xor r8,r11
shr r8,0x14
test r8,r8
je cont
test r10d,r10d
je cont
shr r11,0x9
movabs rcx,0x2965ecc3000
add rcx,r11
cmp BYTE PTR [rcx],0x20
je cont
mov r10,QWORD PTR [r15+0x70]
mov r11,QWORD PTR [r15+0x80]
lock add DWORD PTR [rsp-0x40],0x0
cmp BYTE PTR [rcx],0x0
je cont
mov BYTE PTR [rcx],0x0
test r10,r10
jne 0x000002965edc62bc
mov rdx,r15
movabs r10,0x7ffac2febc30
call r10
jmp cont
mov QWORD PTR [r11+r10*1-0x8],rcx
add r10,0xfffffffffffffff8
mov QWORD PTR [r15+0x70],r10
15 •
• Triggered based on Initiating Heap Occupancy Percent flag (IHOP default to 45%)
• Try to mark the whole object graph concurrently with the application running
• Based on Tri-color abstraction & Snapshot-At-The-Beginning algorithm
Concurrent Marking
16 •
Concurrent Marking: Tri-Color Abstraction
17 •
Concurrent Marking: Issues
• New allocations during marking phase can be handled by:
• Marking automatically object at allocation
• Not considering new allocations for the current cycle
• Tri-Color abstraction provides 2 properties of missed object:
1. The mutator stores a reference to a white object into a black object.
2. All paths from any gray objects to that white object are destroyed.
http://www.memorymanagement.org/glossary/s.html#term-snapshot-at-the-beginning
18 •
Concurrent Marking: Issues
A
B
C
A.field1 = C;
B.field2 = null;
OOPS!
19 •
• 2 ways to ensure not missing any marking
• For SATB, Pre-Write Barriers, recording object for marking
• SATB barrier is only active when Marking is on (global state)
Concurrent Marking: Resolving misses
if (SATB_WriteBarrier) {
if (X.f != null)
SATB_enqueue(X.f);
}
cmp BYTE PTR [r15+0x30],0x0
jne 0x000002965edc62e5
[...]
mov r11d,DWORD PTR [rbp+0x74]
test r11d,r11d
je 0x000002965edc6253
mov r10,QWORD PTR [r15+0x38]
mov rcx,r11
shl rcx,0x3
test r10,r10
je 0x000002965edc6318
mov r11,QWORD PTR [r15+0x48]
mov QWORD PTR [r11+r10*1-0x8],rcx
add r10,0xfffffffffffffff8
mov QWORD PTR [r15+0x38],r10
jmp 0x000002965edc6253
mov rdx,r15
movabs r10,0x7ffac2febc50
call r10
jmp 0x000002965edc6253
20 •
• At the end of Marking, we have per region liveness information
• Regions are sorted by liveness (ascending)
• Regions full of garbage are collected during cleanup STW phase
• CollectionSet is built based on
• Liveness, up until thresholds (G1HeapWastePercent,
G1MixedGCLiveThresholdPercent)
• Maximum number of regions (G1OldCSetRegionThresholdPercent)
CollectionSet
21 •
• Based on CollectionSet, G1 schedule to collect part of old regions
• When a Young is triggered, old regions to collect are piggy backed
• Not all old regions are considered to not waste time and reach the pause goal
• Several Young GCs can be used to collect old regions (mixed event)
Mixed GC
22 •
Mixed GC
23 •
• Still fallback to FullGC (serial < JDK10)
• Fragmentation can still happen (regions with lot of lived objects)
• Still unpredictable
FullGC
Shenandoah
25 •
• Non-generational (still option for partial collection)
• Region based
• Use Read Barrier: Brooks pointer
• Self-Healing
• Cooperation between mutator threads & GC threads
• Only for concurrent compaction
• Mostly based on G1 but with concurrent compaction
Shenandoah GC
26 •
• Initial Marking (STW)
• Concurrent Marking
• Final Remark (STW)
• Concurrent Cleanup
• Concurrent Evacuation
• Init Update References (STW)
• Concurrent Update References
• Final Update References (STW)
• Concurrent Cleanup
Shenandoah Phases
27 •
• SATB-style (like G1)
• 2 STW pauses for Initial Mark & Final Remark
• Conditional Write Barrier
• To deal with concurrent modification of object graph
Concurrent Marking
28 •
• Same principle than G1:
• Build CollectionSet with Garbage First!
• Evacuate to new regions to release the region for reuse
• Concurrent Evacuation done with the help of:
• 1 Read Barrier : Brooks pointer
• 4 Write Barriers
• Barriers help to keep the to-space invariant:
• All Writes are made into an object in to-space
Concurrent Evacuation
29 •
• All objects have an additional forwarding pointer
• Placed before the regular object
• Dereference the forwarding pointer for each access
• Memory footprint overhead
• Throughput overhead
Brooks pointers
Header
Brooks pointer
mov r13,QWORD PTR [r12+r14*8-0x8]
30 •
Concurrent Copy: GC thread
Header
Brooks pointer
Header
Brooks pointer
From-Space To-Space
GC thread
31 •
Concurrent Copy: Reader threads
Header
Brooks pointer
From-Space To-Space
Reader
thread
Reader
thread
32 •
Concurrent Copy: Writer threads
Header
Brooks pointer
Header
Brooks pointer
From-Space To-Space
Writer
thread
Writer
thread
Header
Brooks pointer
33 •
• Any writes (even primitives) to from-space object needs to be protected
• Exotic barriers:
• acmp (pointer comparison)
• CAS
• clone
Write Barriers
if (evacInProgress
&& inCollectionSet(obj)
&& notCopyYet(obj)) {
evacuateObject(obj)
}
test BYTE PTR [r15+0x3c0],0x2
jne 0x000000000281bcbc
[...]
mov r10d,DWORD PTR [r13+0xc]
test r10d,r10d
je 0x000000000281bc2b
mov r11,QWORD PTR [r15+0x360]
mov rcx,r10
shl rcx,0x3
test r11,r11
je 0x000000000281bd0d
[...]
mov rdx,r15
movabs r10,0x62d1f660
call r10
jmp 0x000000000281bc2b
34 •
• Late memory release
• Only happens when all refs updated (Concurrent Cleanup phase)
• Allocations can overrun the GC
• Failure modes:
• Pacing
• Degenerated GC
• FullGC
Extreme cases
Azul’s C4
36 •
• Generational (young & old)
• Region based (pages)
• Use Read Barrier: Loaded Value Barrier
• Self-Healing
• Cooperation between mutator threads & GC threads
• Pauseless algorithm but implementation requires safepoints
• Pauses are most of the time < 1ms
Continuously Concurrent Compacting Collector
37 •
• Baker-style Barrier
• move objects through forwarding addresses stored aside
• Applied at load time, not when dereferencing
• Ensure C4 invariants:
• Marked Through the current cycle
• Not relocated
• If not => Self-healing process to correct it
• Mark object
• Relocate & correct reference
• Checked for each reference loads
• Benefits from JIT optimization for caching loaded value (registers)
LVB
38 •
• States of objects stored inside reference address => Colored pointers
• NMT bit
• Generation
• Checked against a global expected value during the GC cycle
• Thread local, almost always L1 cache hits
• Register
• Relocated: x86 Implementation use trap from VM memory translation Guest/Host
• Intel EPT
• AMD NPT
LVB
test r9, rax
jne 0x3001443b
mov r10d, dword ptr [rax + 8]
39 •
Virtual Memory vs Physical Memory
Virtual Memory
Physical Memory
0 2^64
0 2^37
40 •
• All phases are fully parallel & concurrent
• No "rush" to finish phases
• No constraint about STW pause to be short
• Physical memory released quickly in relocation phase
• Can be reused for new allocations
• Plenty of virtual space vs physical memory
C4 Phases
41 •
• Mark
• Marking all objects in graph
• Relocation
• Moving objects to release pages
• Remap
• Fixup references in object graph
• Folded with next mark cycle
C4 Phases
42 •
• Incremental Update Marking (vs SATB)
• Single pass
• No final mark/remark
• Self-Healing: Mark object that are not marked for the current cycle
Mark Phase
43 •
Mark Phase: Concurrent Modification
A
B
C
A.field1 = C;
B.field2 = null;
LVB
44 •
• Scanning roots (Static var, Thread stacks, register, JNI handles)
• GC threads scans stalled threads
• Running threads scans their own stack stopping individually at Safepoint
• Scanning object graph like a parallel collector
• Newly allocated objects into new pages, not considered for reclaim (relocation)
• For each page, summing live data bytes, used to select page to reclaim
Mark Phase
45 •
• Select pages with the greatest number of dead objects (garbage first!)
• Protect page selected from being accessed by mutators thread
• Move objects to new allocated pages
• Build side arrays (off heap tables) for forwarding information
• Self-Healing: As protected, LVB will trigger a trap to:
• Copy object to the new location if not done
• Use forward pointer to fix the reference
Relocation Phase
46 •
Virtual
Physical
Relocation Phase
Forwarding table
47 •
• Few chances mutators stall on accessing a ref as processing mostly dead pages
• Once object copy done, physical memory is released (Quick Release)
• Can be immediately reused (remapped) to satisfy new allocations
• Pages evacuated are still mapped & protected to help remap phase
• Cannot be released until all objects are remapped
• Not a problem as we have a huge virtual address space
Relocation Phase
48 •
• Traverse Object Graph and fixup references
• Execute LVB barrier for each object
• Self-Healing: fixup references using forward information
• As we traverse again, mark for the next phase
• Mark & Remap phases are folded!
Remap Phase
49 •
• Algorithm requires a sustainable rate or remapping operations
• Linux limitations:
• TLB invalidation
• Only 4KB pages can be remapped
• Single threaded remapping (write lock)
• Kernel module implements API for the Zing JVM to increase significantly the remapping rate
• Implements also virtual address aliasing for addressing objects with metadata
Remap – Kernel module
50 •
• Young & Old collections done by same algorithm and can be concurrent
• Size of the generation are dynamically adjusted
• Card Marking with write barrier (Stored Value Barrier)
• Old collection is based on young-to-old roots generated by previous young cycle
• Young collection will perform card scanning per page
• hold an eventual concurrent Old collection per page scanned
Generational
51 •
• Used by Hadoop Name Node
• 580GB Heap
• Very hard to tune with G1
• No issue so far regarding GC since production roll out (Oct 2017)
C4 @ Criteo
Z GC
53 •
• Non generational
• Region based (zPages, dynamically sized)
• Concurrent Marking, Compaction, Ref processing
• Use Colored Pointers & Read/Load Barrier
• Self-Healing
• Cooperation between mutator threads & GC threads
• Experimental in JDK 11 (-XX:+UnlockExperimentalVMOptions –XX:+UseZGC)
Z GC
mov r10,QWORD PTR [r11+0xb0]
test QWORD PTR [r15+0x20],r10
jne 0x00007f9594cc54b5
54 •
Z GC
55 •
• Initial Mark (STW)
• Concurrent Mark/Remap
• Final Mark (STW)
• Concurrent Prepare for Relocation
• Start Relocate (STW)
• Concurrent Relocate
Z GC phases:
56 •
• Store metadata in unused bits of reference address
• 42 bits for addressing (4TB)
• 4 bits for metadata
• Marked0
• Marked1
• Remapped
• Finalizable
Colored Pointers
57 •
• Colored pointers needs to be unmasked for dereferencing
• Some HW support masking (SPARC, Aarch64))
• On linux/windows, overhead if done with classical instructions
• Only one view is active at any point
• Plenty of Virtual Space
Multi-Mapping
58 •
Multi-Mapping
Virtual Memory
Physical Memory
0 2^64
0 2^37
(marked0)
001<address>
(marked1)
010<address>
(remapped)
100<address>
59 •
• Pages are multiple of 2MB
• 3 different groups
• Small: 2MB pages with object size <= 256KB
• Medium: 32MB pages with object size <= 4MB
• Large: 2MB pages, objects span over multiple of them
• Objects in Large group are meant to not to be relocated (too expensive)
Page Allocations
60 •
• Handling remapping
• C4: Memory protection + trap
• Z: mask in colored pointer
• Unmasking ref addresses
• C4: Kernel module aliasing
• Z: Multi-mapping or HW support
• Pages & Relocation
• C4:
• Page are fixed to match OS size (mem protection)
• relocation for large objects by remapping
• Z:
• zPages are dynamic, a zPage can be 100MB large
• No relocation for large objects
Difference between C4 & Z GC
How to choose a GC algorithm
62 •
• Case 1:
• Need maximum of work done in a time frame (offline job)
• Can afford FullGC of several seconds
 Use a throughput collector like ParalleGC or G1
• Case 2:
• Have time constraint per unit of work (online job)
• Cannot afford FullGC of several seconds
 Use a low latency collector like C4, Shenandoah or Z
Throughput vs Latency
63 •
• You have to run on Windows
• Shenandoah
• Battlefield tested GC (maturity)
• C4
• Shenandoah
• Minimizing any kind of JVM pauses
• C4
• Z
• You don’t want pay for it:
• Shenandoah
• Z
Low latency GCs
References
65 •
• Java Garbage Collection distilled by Martin Thompson
• The Java GC mini book
• Oracle’s white paper on JVM memory management & GC
• What differences JVM makes by Nitsan Wakart
• Memory Management Reference
• IBM Pause-Less GC
References GC Basics
66 •
• Garbage-First Garbage Collection (2004)
• G1 One Garbage Collector to rule them all by Monica Beckwith
• Tips for Tuning The G1 GC by Monica Beckwith
• G1 Garbage Collector Details and Tuning by Simone Bordet
• Write Barriers in Garbage-First Garbage Collector by Monica Beckwith
References G1
67 •
• Shenandoah: An open-source concurrent compacting garbage collector for OpenJDK
• Shenandoah: The Garbage Collector That Could by Aleksey Shipilev
• Shenandoah GC Wiki
References Shenandoah
68 •
• The Pauseless GC algorithm (2005)
• C4: Continuously Concurrent Compacting Collector (2011)
• Azul GC in Detail by Charles Humble
• 2010 version source code
References C4
69 •
• ZGC - Low Latency GC for OpenJDK by Per Liden
• Java's new Z Garbage Collector (ZGC) is very exciting by Richard Warburton
• A first look into ZGC by Dominik Inführ
• Architectural Comparison with C4/Pauseless
References ZGC
Thank You!
@jpbempel

More Related Content

PPTX
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
ODP
G1 Garbage Collector: Details and Tuning
PPT
Real time scheduling - basic concepts
PPTX
Garbage collection algorithms
PPTX
Job sequencing with deadlines(with example)
PPTX
Garbage First Garbage Collector (G1 GC): Current and Future Adaptability and ...
PDF
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
PDF
Understanding Memory Management In Spark For Fun And Profit
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
G1 Garbage Collector: Details and Tuning
Real time scheduling - basic concepts
Garbage collection algorithms
Job sequencing with deadlines(with example)
Garbage First Garbage Collector (G1 GC): Current and Future Adaptability and ...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
Understanding Memory Management In Spark For Fun And Profit

What's hot (20)

PPTX
Buddy Memory Allocation system
PDF
Semaphores
PDF
ETL and Event Sourcing
PDF
GDB Rocks!
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PPTX
Deadlock Avoidance in Operating System
PPTX
Low latency in java 8 by Peter Lawrey
PPT
RTOS Basic Concepts
PPTX
Heap Management
PDF
Microservices Tracing With Spring Cloud and Zipkin @Szczecin JUG
PDF
ZGC-SnowOne.pdf
PPTX
Round robin scheduling
PPTX
RTOS- Real Time Operating Systems
PPTX
Real time Operating System
PDF
Os lab final
PPT
pushdown automata
PPTX
Presentation on Breadth First Search (BFS)
PPTX
3. planning in situational calculas
PPT
Chapter 8 - Main Memory
Buddy Memory Allocation system
Semaphores
ETL and Event Sourcing
GDB Rocks!
A Deep Dive into Query Execution Engine of Spark SQL
Deadlock Avoidance in Operating System
Low latency in java 8 by Peter Lawrey
RTOS Basic Concepts
Heap Management
Microservices Tracing With Spring Cloud and Zipkin @Szczecin JUG
ZGC-SnowOne.pdf
Round robin scheduling
RTOS- Real Time Operating Systems
Real time Operating System
Os lab final
pushdown automata
Presentation on Breadth First Search (BFS)
3. planning in situational calculas
Chapter 8 - Main Memory
Ad

Similar to Understanding jvm gc advanced (20)

PDF
Understanding JVM GC: advanced!
PDF
Understanding low latency jvm gcs
PDF
Understanding low latency jvm gcs V2
PDF
Demystifying Garbage Collection in Java
PPTX
OpenJDK Concurrent Collectors
PDF
New Algorithms in Java
PPT
Garbage collection in JVM
PPT
Lp seminar
PPTX
Java garbage collection & GC friendly coding
PDF
Compiler Construction | Lecture 15 | Memory Management
PPTX
Java GC
PPT
Chapter 7 Run Time Environment
PDF
JVM Memory Management Details
PPTX
Intro to Garbage Collection
ODP
Garbage Collection in Hotspot JVM
ODP
Quick introduction to Java Garbage Collector (JVM GC)
ODP
Gc algorithms
PPTX
Garbage collection
PDF
OPENJDK: IN THE NEW AGE OF CONCURRENT GARBAGE COLLECTORS
PDF
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
Understanding JVM GC: advanced!
Understanding low latency jvm gcs
Understanding low latency jvm gcs V2
Demystifying Garbage Collection in Java
OpenJDK Concurrent Collectors
New Algorithms in Java
Garbage collection in JVM
Lp seminar
Java garbage collection & GC friendly coding
Compiler Construction | Lecture 15 | Memory Management
Java GC
Chapter 7 Run Time Environment
JVM Memory Management Details
Intro to Garbage Collection
Garbage Collection in Hotspot JVM
Quick introduction to Java Garbage Collector (JVM GC)
Gc algorithms
Garbage collection
OPENJDK: IN THE NEW AGE OF CONCURRENT GARBAGE COLLECTORS
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
Ad

More from Jean-Philippe BEMPEL (15)

PDF
Mastering GC.pdf
PDF
Javaday 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneur...
PDF
Devoxx Fr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneu...
PDF
Tools in action jdk mission control and flight recorder
PPTX
Clr jvm implementation differences
PDF
Le guide de dépannage de la jvm
PDF
Out ofmemoryerror what is the cost of java objects
PDF
OutOfMemoryError : quel est le coût des objets en java
PDF
Low latency & mechanical sympathy issues and solutions
PDF
Lock free programming - pro tips devoxx uk
PDF
Lock free programming- pro tips
PDF
Programmation lock free - les techniques des pros (2eme partie)
PDF
Programmation lock free - les techniques des pros (1ere partie)
PDF
Measuring directly from cpu hardware performance counters
PDF
Devoxx france 2014 compteurs de perf
Mastering GC.pdf
Javaday 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneur...
Devoxx Fr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneu...
Tools in action jdk mission control and flight recorder
Clr jvm implementation differences
Le guide de dépannage de la jvm
Out ofmemoryerror what is the cost of java objects
OutOfMemoryError : quel est le coût des objets en java
Low latency & mechanical sympathy issues and solutions
Lock free programming - pro tips devoxx uk
Lock free programming- pro tips
Programmation lock free - les techniques des pros (2eme partie)
Programmation lock free - les techniques des pros (1ere partie)
Measuring directly from cpu hardware performance counters
Devoxx france 2014 compteurs de perf

Recently uploaded (20)

PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Electronic commerce courselecture one. Pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
sap open course for s4hana steps from ECC to s4
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Cloud computing and distributed systems.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
MYSQL Presentation for SQL database connectivity
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
A Presentation on Artificial Intelligence
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Assigned Numbers - 2025 - Bluetooth® Document
NewMind AI Weekly Chronicles - August'25-Week II
Electronic commerce courselecture one. Pdf
Network Security Unit 5.pdf for BCA BBA.
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
sap open course for s4hana steps from ECC to s4
The AUB Centre for AI in Media Proposal.docx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Cloud computing and distributed systems.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
MYSQL Presentation for SQL database connectivity
A comparative analysis of optical character recognition models for extracting...
A Presentation on Artificial Intelligence

Understanding jvm gc advanced

  • 2. 2 • • GC basics • G1 • Shenandoah • Azul’s C4 • ZGC • How to choose a GC algorithm? Understanding JVM GC: Advanced!
  • 5. 5 • • Traversing references to mark live objects • Stopping when reaching old generation • From GC roots (static fields, thread stack, JNI) Marking for Minor GC Young Old
  • 6. 6 • Card Table for references old -> young references Write barrier to update card table on assignation X.f = Y Card Table Young 0 0 1 CARD_TABLE[&X >> 9] = 1 mov DWORD PTR [r10+0x6c],r8d mov r11,r10 shr r11,0x9 mov r8d,0x2383000 mov BYTE PTR [r8+r11*1],r12b
  • 7. G1
  • 8. 8 • • Generational • Region based • Pause time target (soft real-time) • -XX:MaxGCPauseMillis=n (default 200) • Default GC since JDK9 Garbage First
  • 9. 9 • Heap divided into fixed-size regions Regions
  • 11. 11 • • Young collection (STW) • Initial Mark (STW) • Concurrent Marking • Final Remark (STW) • Cleanup (STW) • Mixed collection (STW) G1 phases
  • 12. 12 • • Stop-The-World event • Evacuates live objects to Survivor or Old regions • Only objects in young generation are considered Young GC
  • 13. 13 • • Card table per region • Avoid scanning the entire heap Remembered Sets
  • 14. 14 • • For each reference assignation (X.f = Y) we need to check: • References (X & Y) are NOT in the same region • Y is not null • => enqueue for Remebered Set processing • Refinement threads to process the queue • Additional instructions added after assignation Remembered Sets: Post Write Barrier if (!isInSameRegion(X, Y) && Y != null) RSEnqueue(X) mov DWORD PTR [rbp+0x74],r10d mov r11,rbp mov r8,r10 shl r8,0x3 xor r8,r11 shr r8,0x14 test r8,r8 je cont test r10d,r10d je cont shr r11,0x9 movabs rcx,0x2965ecc3000 add rcx,r11 cmp BYTE PTR [rcx],0x20 je cont mov r10,QWORD PTR [r15+0x70] mov r11,QWORD PTR [r15+0x80] lock add DWORD PTR [rsp-0x40],0x0 cmp BYTE PTR [rcx],0x0 je cont mov BYTE PTR [rcx],0x0 test r10,r10 jne 0x000002965edc62bc mov rdx,r15 movabs r10,0x7ffac2febc30 call r10 jmp cont mov QWORD PTR [r11+r10*1-0x8],rcx add r10,0xfffffffffffffff8 mov QWORD PTR [r15+0x70],r10
  • 15. 15 • • Triggered based on Initiating Heap Occupancy Percent flag (IHOP default to 45%) • Try to mark the whole object graph concurrently with the application running • Based on Tri-color abstraction & Snapshot-At-The-Beginning algorithm Concurrent Marking
  • 16. 16 • Concurrent Marking: Tri-Color Abstraction
  • 17. 17 • Concurrent Marking: Issues • New allocations during marking phase can be handled by: • Marking automatically object at allocation • Not considering new allocations for the current cycle • Tri-Color abstraction provides 2 properties of missed object: 1. The mutator stores a reference to a white object into a black object. 2. All paths from any gray objects to that white object are destroyed. http://www.memorymanagement.org/glossary/s.html#term-snapshot-at-the-beginning
  • 18. 18 • Concurrent Marking: Issues A B C A.field1 = C; B.field2 = null; OOPS!
  • 19. 19 • • 2 ways to ensure not missing any marking • For SATB, Pre-Write Barriers, recording object for marking • SATB barrier is only active when Marking is on (global state) Concurrent Marking: Resolving misses if (SATB_WriteBarrier) { if (X.f != null) SATB_enqueue(X.f); } cmp BYTE PTR [r15+0x30],0x0 jne 0x000002965edc62e5 [...] mov r11d,DWORD PTR [rbp+0x74] test r11d,r11d je 0x000002965edc6253 mov r10,QWORD PTR [r15+0x38] mov rcx,r11 shl rcx,0x3 test r10,r10 je 0x000002965edc6318 mov r11,QWORD PTR [r15+0x48] mov QWORD PTR [r11+r10*1-0x8],rcx add r10,0xfffffffffffffff8 mov QWORD PTR [r15+0x38],r10 jmp 0x000002965edc6253 mov rdx,r15 movabs r10,0x7ffac2febc50 call r10 jmp 0x000002965edc6253
  • 20. 20 • • At the end of Marking, we have per region liveness information • Regions are sorted by liveness (ascending) • Regions full of garbage are collected during cleanup STW phase • CollectionSet is built based on • Liveness, up until thresholds (G1HeapWastePercent, G1MixedGCLiveThresholdPercent) • Maximum number of regions (G1OldCSetRegionThresholdPercent) CollectionSet
  • 21. 21 • • Based on CollectionSet, G1 schedule to collect part of old regions • When a Young is triggered, old regions to collect are piggy backed • Not all old regions are considered to not waste time and reach the pause goal • Several Young GCs can be used to collect old regions (mixed event) Mixed GC
  • 23. 23 • • Still fallback to FullGC (serial < JDK10) • Fragmentation can still happen (regions with lot of lived objects) • Still unpredictable FullGC
  • 25. 25 • • Non-generational (still option for partial collection) • Region based • Use Read Barrier: Brooks pointer • Self-Healing • Cooperation between mutator threads & GC threads • Only for concurrent compaction • Mostly based on G1 but with concurrent compaction Shenandoah GC
  • 26. 26 • • Initial Marking (STW) • Concurrent Marking • Final Remark (STW) • Concurrent Cleanup • Concurrent Evacuation • Init Update References (STW) • Concurrent Update References • Final Update References (STW) • Concurrent Cleanup Shenandoah Phases
  • 27. 27 • • SATB-style (like G1) • 2 STW pauses for Initial Mark & Final Remark • Conditional Write Barrier • To deal with concurrent modification of object graph Concurrent Marking
  • 28. 28 • • Same principle than G1: • Build CollectionSet with Garbage First! • Evacuate to new regions to release the region for reuse • Concurrent Evacuation done with the help of: • 1 Read Barrier : Brooks pointer • 4 Write Barriers • Barriers help to keep the to-space invariant: • All Writes are made into an object in to-space Concurrent Evacuation
  • 29. 29 • • All objects have an additional forwarding pointer • Placed before the regular object • Dereference the forwarding pointer for each access • Memory footprint overhead • Throughput overhead Brooks pointers Header Brooks pointer mov r13,QWORD PTR [r12+r14*8-0x8]
  • 30. 30 • Concurrent Copy: GC thread Header Brooks pointer Header Brooks pointer From-Space To-Space GC thread
  • 31. 31 • Concurrent Copy: Reader threads Header Brooks pointer From-Space To-Space Reader thread Reader thread
  • 32. 32 • Concurrent Copy: Writer threads Header Brooks pointer Header Brooks pointer From-Space To-Space Writer thread Writer thread Header Brooks pointer
  • 33. 33 • • Any writes (even primitives) to from-space object needs to be protected • Exotic barriers: • acmp (pointer comparison) • CAS • clone Write Barriers if (evacInProgress && inCollectionSet(obj) && notCopyYet(obj)) { evacuateObject(obj) } test BYTE PTR [r15+0x3c0],0x2 jne 0x000000000281bcbc [...] mov r10d,DWORD PTR [r13+0xc] test r10d,r10d je 0x000000000281bc2b mov r11,QWORD PTR [r15+0x360] mov rcx,r10 shl rcx,0x3 test r11,r11 je 0x000000000281bd0d [...] mov rdx,r15 movabs r10,0x62d1f660 call r10 jmp 0x000000000281bc2b
  • 34. 34 • • Late memory release • Only happens when all refs updated (Concurrent Cleanup phase) • Allocations can overrun the GC • Failure modes: • Pacing • Degenerated GC • FullGC Extreme cases
  • 36. 36 • • Generational (young & old) • Region based (pages) • Use Read Barrier: Loaded Value Barrier • Self-Healing • Cooperation between mutator threads & GC threads • Pauseless algorithm but implementation requires safepoints • Pauses are most of the time < 1ms Continuously Concurrent Compacting Collector
  • 37. 37 • • Baker-style Barrier • move objects through forwarding addresses stored aside • Applied at load time, not when dereferencing • Ensure C4 invariants: • Marked Through the current cycle • Not relocated • If not => Self-healing process to correct it • Mark object • Relocate & correct reference • Checked for each reference loads • Benefits from JIT optimization for caching loaded value (registers) LVB
  • 38. 38 • • States of objects stored inside reference address => Colored pointers • NMT bit • Generation • Checked against a global expected value during the GC cycle • Thread local, almost always L1 cache hits • Register • Relocated: x86 Implementation use trap from VM memory translation Guest/Host • Intel EPT • AMD NPT LVB test r9, rax jne 0x3001443b mov r10d, dword ptr [rax + 8]
  • 39. 39 • Virtual Memory vs Physical Memory Virtual Memory Physical Memory 0 2^64 0 2^37
  • 40. 40 • • All phases are fully parallel & concurrent • No "rush" to finish phases • No constraint about STW pause to be short • Physical memory released quickly in relocation phase • Can be reused for new allocations • Plenty of virtual space vs physical memory C4 Phases
  • 41. 41 • • Mark • Marking all objects in graph • Relocation • Moving objects to release pages • Remap • Fixup references in object graph • Folded with next mark cycle C4 Phases
  • 42. 42 • • Incremental Update Marking (vs SATB) • Single pass • No final mark/remark • Self-Healing: Mark object that are not marked for the current cycle Mark Phase
  • 43. 43 • Mark Phase: Concurrent Modification A B C A.field1 = C; B.field2 = null; LVB
  • 44. 44 • • Scanning roots (Static var, Thread stacks, register, JNI handles) • GC threads scans stalled threads • Running threads scans their own stack stopping individually at Safepoint • Scanning object graph like a parallel collector • Newly allocated objects into new pages, not considered for reclaim (relocation) • For each page, summing live data bytes, used to select page to reclaim Mark Phase
  • 45. 45 • • Select pages with the greatest number of dead objects (garbage first!) • Protect page selected from being accessed by mutators thread • Move objects to new allocated pages • Build side arrays (off heap tables) for forwarding information • Self-Healing: As protected, LVB will trigger a trap to: • Copy object to the new location if not done • Use forward pointer to fix the reference Relocation Phase
  • 47. 47 • • Few chances mutators stall on accessing a ref as processing mostly dead pages • Once object copy done, physical memory is released (Quick Release) • Can be immediately reused (remapped) to satisfy new allocations • Pages evacuated are still mapped & protected to help remap phase • Cannot be released until all objects are remapped • Not a problem as we have a huge virtual address space Relocation Phase
  • 48. 48 • • Traverse Object Graph and fixup references • Execute LVB barrier for each object • Self-Healing: fixup references using forward information • As we traverse again, mark for the next phase • Mark & Remap phases are folded! Remap Phase
  • 49. 49 • • Algorithm requires a sustainable rate or remapping operations • Linux limitations: • TLB invalidation • Only 4KB pages can be remapped • Single threaded remapping (write lock) • Kernel module implements API for the Zing JVM to increase significantly the remapping rate • Implements also virtual address aliasing for addressing objects with metadata Remap – Kernel module
  • 50. 50 • • Young & Old collections done by same algorithm and can be concurrent • Size of the generation are dynamically adjusted • Card Marking with write barrier (Stored Value Barrier) • Old collection is based on young-to-old roots generated by previous young cycle • Young collection will perform card scanning per page • hold an eventual concurrent Old collection per page scanned Generational
  • 51. 51 • • Used by Hadoop Name Node • 580GB Heap • Very hard to tune with G1 • No issue so far regarding GC since production roll out (Oct 2017) C4 @ Criteo
  • 52. Z GC
  • 53. 53 • • Non generational • Region based (zPages, dynamically sized) • Concurrent Marking, Compaction, Ref processing • Use Colored Pointers & Read/Load Barrier • Self-Healing • Cooperation between mutator threads & GC threads • Experimental in JDK 11 (-XX:+UnlockExperimentalVMOptions –XX:+UseZGC) Z GC mov r10,QWORD PTR [r11+0xb0] test QWORD PTR [r15+0x20],r10 jne 0x00007f9594cc54b5
  • 55. 55 • • Initial Mark (STW) • Concurrent Mark/Remap • Final Mark (STW) • Concurrent Prepare for Relocation • Start Relocate (STW) • Concurrent Relocate Z GC phases:
  • 56. 56 • • Store metadata in unused bits of reference address • 42 bits for addressing (4TB) • 4 bits for metadata • Marked0 • Marked1 • Remapped • Finalizable Colored Pointers
  • 57. 57 • • Colored pointers needs to be unmasked for dereferencing • Some HW support masking (SPARC, Aarch64)) • On linux/windows, overhead if done with classical instructions • Only one view is active at any point • Plenty of Virtual Space Multi-Mapping
  • 58. 58 • Multi-Mapping Virtual Memory Physical Memory 0 2^64 0 2^37 (marked0) 001<address> (marked1) 010<address> (remapped) 100<address>
  • 59. 59 • • Pages are multiple of 2MB • 3 different groups • Small: 2MB pages with object size <= 256KB • Medium: 32MB pages with object size <= 4MB • Large: 2MB pages, objects span over multiple of them • Objects in Large group are meant to not to be relocated (too expensive) Page Allocations
  • 60. 60 • • Handling remapping • C4: Memory protection + trap • Z: mask in colored pointer • Unmasking ref addresses • C4: Kernel module aliasing • Z: Multi-mapping or HW support • Pages & Relocation • C4: • Page are fixed to match OS size (mem protection) • relocation for large objects by remapping • Z: • zPages are dynamic, a zPage can be 100MB large • No relocation for large objects Difference between C4 & Z GC
  • 61. How to choose a GC algorithm
  • 62. 62 • • Case 1: • Need maximum of work done in a time frame (offline job) • Can afford FullGC of several seconds  Use a throughput collector like ParalleGC or G1 • Case 2: • Have time constraint per unit of work (online job) • Cannot afford FullGC of several seconds  Use a low latency collector like C4, Shenandoah or Z Throughput vs Latency
  • 63. 63 • • You have to run on Windows • Shenandoah • Battlefield tested GC (maturity) • C4 • Shenandoah • Minimizing any kind of JVM pauses • C4 • Z • You don’t want pay for it: • Shenandoah • Z Low latency GCs
  • 65. 65 • • Java Garbage Collection distilled by Martin Thompson • The Java GC mini book • Oracle’s white paper on JVM memory management & GC • What differences JVM makes by Nitsan Wakart • Memory Management Reference • IBM Pause-Less GC References GC Basics
  • 66. 66 • • Garbage-First Garbage Collection (2004) • G1 One Garbage Collector to rule them all by Monica Beckwith • Tips for Tuning The G1 GC by Monica Beckwith • G1 Garbage Collector Details and Tuning by Simone Bordet • Write Barriers in Garbage-First Garbage Collector by Monica Beckwith References G1
  • 67. 67 • • Shenandoah: An open-source concurrent compacting garbage collector for OpenJDK • Shenandoah: The Garbage Collector That Could by Aleksey Shipilev • Shenandoah GC Wiki References Shenandoah
  • 68. 68 • • The Pauseless GC algorithm (2005) • C4: Continuously Concurrent Compacting Collector (2011) • Azul GC in Detail by Charles Humble • 2010 version source code References C4
  • 69. 69 • • ZGC - Low Latency GC for OpenJDK by Per Liden • Java's new Z Garbage Collector (ZGC) is very exciting by Richard Warburton • A first look into ZGC by Dominik Inführ • Architectural Comparison with C4/Pauseless References ZGC