SlideShare a Scribd company logo
Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger , Kathryn McKinley * ,  Robert Blumofe, Paul Wilson Department of Computer Sciences * Department of Computer Science
Motivation Parallel multithreaded programs becoming prevalent web servers, search engines, database managers, etc. run on SMP’s for high performance often embarrassingly parallel Memory allocation is a bottleneck prevents scaling with number of processors
Assessment Criteria for Multiprocessor Allocators Speed competitive with uniprocessor allocators on one processor Scalability performance linear with the number of processors Fragmentation  (= max allocated / max in use) competitive with uniprocessor allocators worst-case  and  average-case
Uniprocessor Allocators on Multiprocessors Fragmentation:  Excellent Very low for most programs [Wilson & Johnstone] Speed & Scalability:  Poor Heap contention a single lock protects the heap Can exacerbate false sharing different processors can share cache lines
Allocator-Induced False Sharing Allocators cause false sharing! Cache lines can end up spread across a number of processors Practically all allocators do this processor 1 processor 2 x2 = malloc(s); x1 = malloc(s); A cache line thrash… thrash…
Existing Multiprocessor Allocators Speed: One concurrent heap (e.g., concurrent B-tree):    too expensive too many locks/atomic updates O(log n) cost per memory operation    Fast allocators use multiple heaps Scalability: Allocator-induced false sharing and other bottlenecks Fragmentation:  P-fold increase or even unbounded
Multiprocessor Allocator I: Pure Private Heaps Pure private heaps : one heap per processor. malloc  gets memory from the processor's heap or the system free  puts memory on the processor's heap Avoids heap contention Examples: STL,  ad hoc  (e.g., Cilk 4.1) x1= malloc(s) free(x1) free(x2) x3= malloc(s) x2= malloc(s) x4= malloc(s) processor 1 processor 2 = allocated by heap 1 = free, on heap 2
How to Break Pure Private Heaps: Fragmentation Pure private heaps : memory consumption can grow without bound! Producer-consumer: processor 1 allocates processor 2 frees free(x1) x2= malloc(s) free(x2) x1= malloc(s) processor 1 processor 2 x3= malloc(s) free(x3)
Multiprocessor Allocator II: Private Heaps with Ownership Private heaps with ownership: free  puts memory back on the  originating processor 's heap. Avoids unbounded memory consumption Examples:  ptmalloc  [Gloger],  LKmalloc  [Larson & Krishnan] x1= malloc(s) free(x1) free(x2) x2= malloc(s) processor 1 processor 2
How to Break Private Heaps with Ownership:Fragmentation Private heaps with ownership: memory consumption can blowup by a factor of P. Round-robin producer-consumer: processor  i  allocates processor  i+1  frees This really happens (NDS). free(x2) free(x1) free(x3) x1= malloc(s) x2= malloc(s) x3=malloc(s) processor 1 processor 2 processor 3
So What Do We Do Now?
The Hoard Multiprocessor Memory Allocator Manages memory in page-sized  superblocks  of same-sized objects - Avoids false sharing by not carving up cache lines - Avoids heap contention - local heaps allocate & free small blocks from their set of superblocks Adds a  global heap  that is a repository of superblocks When the fraction of free memory exceeds the  empty fraction , moves superblocks to the global heap - Avoids blowup in memory consumption
Hoard Example Hoard : one heap per processor + a global heap malloc  gets memory from a  superblock  on its heap. free  returns memory to its  superblock . If the heap is “too empty”, it moves a superblock to the global heap. x1= malloc(s) processor 1 global heap free(x7) … some mallocs … some frees Empty fraction = 1/3
Summary of Analytical Results Worst-case memory consumption: O(n log M/m  + P ) [instead of O( P  n log M/m)] n = memory required M = biggest object size m = smallest object size P = number of processors Best possible: O(n log M/m) [Robson] Provably low synchronization in most cases
Experiments Run on a dedicated 14-processor Sun Enterprise 300 MHz UltraSparc, 1 GB of RAM Solaris 2.7 All programs compiled with  g++  version 2.95.1 Allocators: Hoard version 2.0.2 Solaris (system allocator) Ptmalloc  (GNU libc – private heaps with ownership) mtmalloc  (Sun’s “MT-hot” allocator)
Performance:  threadtest speedup (x,P) =  runtime (Solaris allocator, one processor)   /  runtime (x on P processors)
Performance:  Larson Server-style benchmark with sharing
Performance:  false sharing Each thread reads & writes heap data
Fragmentation Results On most standard uniprocessor benchmarks, Hoard’s fragmentation was low: p2c  (Pascal-to-C):  1.20   espresso: 1.47 LRUsim :  1.05 Ghostscript : 1.15 Within 20% of Lea’s allocator On the multiprocessor benchmarks and other codes: Fragmentation was between 1.02 and 1.24 for all but one anomalous benchmark  (shbench : 3.17) .
Hoard Conclusions Speed:  Excellent As fast as a uniprocessor allocator on one processor amortized O(1) cost 1 lock for  malloc , 2 for  free Scalability:  Excellent Scales linearly with the number of processors Avoids false sharing Fragmentation:  Very good Worst-case is provably close to ideal Actual observed fragmentation is low
Hoard Heap Details “ Segregated size class” allocator Size classes are logarithmically-spaced Superblocks hold objects of one size class empty superblocks are “recycled” Approximately radix-sorted: Allocate from mostly-full superblocks Fast removal of mostly-empty superblocks 8 16 24 32 40 48 sizeclass bins radix-sorted superblock lists (emptiest to fullest) superblocks

More Related Content

PPTX
Lecture3
PDF
Breaking Bottlenecks: LSF @ AMD
PPT
Linux memory consumption
PDF
Spectrum Scale Memory Usage
PDF
Linux Locking Mechanisms
PDF
Systems building Systems: A Puppet Story
PDF
Blosc Talk by Francesc Alted from PyData London 2014
Lecture3
Breaking Bottlenecks: LSF @ AMD
Linux memory consumption
Spectrum Scale Memory Usage
Linux Locking Mechanisms
Systems building Systems: A Puppet Story
Blosc Talk by Francesc Alted from PyData London 2014

What's hot (12)

PPTX
Cache design
PDF
Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Franc...
PDF
Give or take a block
PPTX
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating System
PDF
It's the memory, stupid! CodeJam 2014
PPTX
Limitations of memory system performance
PDF
Ltsp Slide
PPT
Memory management in linux
PPT
Snooping 2
PDF
computer-memory
PDF
Coherence and consistency models in multiprocessor architecture
Cache design
Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Franc...
Give or take a block
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating System
It's the memory, stupid! CodeJam 2014
Limitations of memory system performance
Ltsp Slide
Memory management in linux
Snooping 2
computer-memory
Coherence and consistency models in multiprocessor architecture
Ad

Similar to Hoard: A Scalable Memory Allocator for Multithreaded Applications (20)

PDF
Hoard_2022AIM1001.pptx.pdf
PPT
3parallel memoryallocation
PPT
Composing High-Performance Memory Allocators with Heap Layers
PPT
Exploiting Multicore CPUs Now: Scalability and Reliability for Off-the-shelf ...
ODP
Kernel Pool
PPTX
final GROUP 4.pptx
PPTX
C dynamic ppt
PDF
2014 valat-phd-defense-slides
PPTX
Dynamic Memory Allocation.pptx for c language and basic knowledge.
PPTX
Memory Management.pptx
PPTX
Algoritmos e Estruturas de Dados - dynamic memory allocation
PPT
dynamic_v1-memory-management-in-c-cpp.ppt
PDF
Tips of Malloc & Free
PPTX
Scope Stack Allocation
PDF
dynamic-allocation.pdf
PDF
Practical Code & Data Design
PDF
03 Essential C Security for hacking tricks
PPT
Bullet pts Dyn Mem Alloc.pptghftfhtfytftyft
PPTX
Dynamic memeory allocation DMA (dyunamic momory .pptx
Hoard_2022AIM1001.pptx.pdf
3parallel memoryallocation
Composing High-Performance Memory Allocators with Heap Layers
Exploiting Multicore CPUs Now: Scalability and Reliability for Off-the-shelf ...
Kernel Pool
final GROUP 4.pptx
C dynamic ppt
2014 valat-phd-defense-slides
Dynamic Memory Allocation.pptx for c language and basic knowledge.
Memory Management.pptx
Algoritmos e Estruturas de Dados - dynamic memory allocation
dynamic_v1-memory-management-in-c-cpp.ppt
Tips of Malloc & Free
Scope Stack Allocation
dynamic-allocation.pdf
Practical Code & Data Design
03 Essential C Security for hacking tricks
Bullet pts Dyn Mem Alloc.pptghftfhtfytftyft
Dynamic memeory allocation DMA (dyunamic momory .pptx
Ad

More from Emery Berger (20)

PPTX
Doppio: Breaking the Browser Language Barrier
PPTX
Dthreads: Efficient Deterministic Multithreading
PDF
Programming with People
PDF
Stabilizer: Statistically Sound Performance Evaluation
PDF
DieHarder (CCS 2010, WOOT 2011)
PDF
Operating Systems - Advanced File Systems
PDF
Operating Systems - File Systems
PDF
Operating Systems - Networks
PDF
Operating Systems - Queuing Systems
PDF
Operating Systems - Distributed Parallel Computing
PDF
Operating Systems - Concurrency
PDF
Operating Systems - Advanced Synchronization
PDF
Operating Systems - Synchronization
PDF
Processes and Threads
PDF
Virtual Memory and Paging
PDF
Operating Systems - Virtual Memory
PPT
MC2: High-Performance Garbage Collection for Memory-Constrained Environments
PPT
Vam: A Locality-Improving Dynamic Memory Allocator
PPT
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management
PDF
Garbage Collection without Paging
Doppio: Breaking the Browser Language Barrier
Dthreads: Efficient Deterministic Multithreading
Programming with People
Stabilizer: Statistically Sound Performance Evaluation
DieHarder (CCS 2010, WOOT 2011)
Operating Systems - Advanced File Systems
Operating Systems - File Systems
Operating Systems - Networks
Operating Systems - Queuing Systems
Operating Systems - Distributed Parallel Computing
Operating Systems - Concurrency
Operating Systems - Advanced Synchronization
Operating Systems - Synchronization
Processes and Threads
Virtual Memory and Paging
Operating Systems - Virtual Memory
MC2: High-Performance Garbage Collection for Memory-Constrained Environments
Vam: A Locality-Improving Dynamic Memory Allocator
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management
Garbage Collection without Paging

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
A Presentation on Artificial Intelligence
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Machine learning based COVID-19 study performance prediction
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Cloud computing and distributed systems.
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
A comparative analysis of optical character recognition models for extracting...
Network Security Unit 5.pdf for BCA BBA.
A Presentation on Artificial Intelligence
Per capita expenditure prediction using model stacking based on satellite ima...
The AUB Centre for AI in Media Proposal.docx
Machine learning based COVID-19 study performance prediction
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Assigned Numbers - 2025 - Bluetooth® Document
Reach Out and Touch Someone: Haptics and Empathic Computing
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
sap open course for s4hana steps from ECC to s4
Spectral efficient network and resource selection model in 5G networks
Programs and apps: productivity, graphics, security and other tools
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf

Hoard: A Scalable Memory Allocator for Multithreaded Applications

  • 1. Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger , Kathryn McKinley * , Robert Blumofe, Paul Wilson Department of Computer Sciences * Department of Computer Science
  • 2. Motivation Parallel multithreaded programs becoming prevalent web servers, search engines, database managers, etc. run on SMP’s for high performance often embarrassingly parallel Memory allocation is a bottleneck prevents scaling with number of processors
  • 3. Assessment Criteria for Multiprocessor Allocators Speed competitive with uniprocessor allocators on one processor Scalability performance linear with the number of processors Fragmentation (= max allocated / max in use) competitive with uniprocessor allocators worst-case and average-case
  • 4. Uniprocessor Allocators on Multiprocessors Fragmentation: Excellent Very low for most programs [Wilson & Johnstone] Speed & Scalability: Poor Heap contention a single lock protects the heap Can exacerbate false sharing different processors can share cache lines
  • 5. Allocator-Induced False Sharing Allocators cause false sharing! Cache lines can end up spread across a number of processors Practically all allocators do this processor 1 processor 2 x2 = malloc(s); x1 = malloc(s); A cache line thrash… thrash…
  • 6. Existing Multiprocessor Allocators Speed: One concurrent heap (e.g., concurrent B-tree): too expensive too many locks/atomic updates O(log n) cost per memory operation  Fast allocators use multiple heaps Scalability: Allocator-induced false sharing and other bottlenecks Fragmentation: P-fold increase or even unbounded
  • 7. Multiprocessor Allocator I: Pure Private Heaps Pure private heaps : one heap per processor. malloc gets memory from the processor's heap or the system free puts memory on the processor's heap Avoids heap contention Examples: STL, ad hoc (e.g., Cilk 4.1) x1= malloc(s) free(x1) free(x2) x3= malloc(s) x2= malloc(s) x4= malloc(s) processor 1 processor 2 = allocated by heap 1 = free, on heap 2
  • 8. How to Break Pure Private Heaps: Fragmentation Pure private heaps : memory consumption can grow without bound! Producer-consumer: processor 1 allocates processor 2 frees free(x1) x2= malloc(s) free(x2) x1= malloc(s) processor 1 processor 2 x3= malloc(s) free(x3)
  • 9. Multiprocessor Allocator II: Private Heaps with Ownership Private heaps with ownership: free puts memory back on the originating processor 's heap. Avoids unbounded memory consumption Examples: ptmalloc [Gloger], LKmalloc [Larson & Krishnan] x1= malloc(s) free(x1) free(x2) x2= malloc(s) processor 1 processor 2
  • 10. How to Break Private Heaps with Ownership:Fragmentation Private heaps with ownership: memory consumption can blowup by a factor of P. Round-robin producer-consumer: processor i allocates processor i+1 frees This really happens (NDS). free(x2) free(x1) free(x3) x1= malloc(s) x2= malloc(s) x3=malloc(s) processor 1 processor 2 processor 3
  • 11. So What Do We Do Now?
  • 12. The Hoard Multiprocessor Memory Allocator Manages memory in page-sized superblocks of same-sized objects - Avoids false sharing by not carving up cache lines - Avoids heap contention - local heaps allocate & free small blocks from their set of superblocks Adds a global heap that is a repository of superblocks When the fraction of free memory exceeds the empty fraction , moves superblocks to the global heap - Avoids blowup in memory consumption
  • 13. Hoard Example Hoard : one heap per processor + a global heap malloc gets memory from a superblock on its heap. free returns memory to its superblock . If the heap is “too empty”, it moves a superblock to the global heap. x1= malloc(s) processor 1 global heap free(x7) … some mallocs … some frees Empty fraction = 1/3
  • 14. Summary of Analytical Results Worst-case memory consumption: O(n log M/m + P ) [instead of O( P n log M/m)] n = memory required M = biggest object size m = smallest object size P = number of processors Best possible: O(n log M/m) [Robson] Provably low synchronization in most cases
  • 15. Experiments Run on a dedicated 14-processor Sun Enterprise 300 MHz UltraSparc, 1 GB of RAM Solaris 2.7 All programs compiled with g++ version 2.95.1 Allocators: Hoard version 2.0.2 Solaris (system allocator) Ptmalloc (GNU libc – private heaps with ownership) mtmalloc (Sun’s “MT-hot” allocator)
  • 16. Performance: threadtest speedup (x,P) = runtime (Solaris allocator, one processor) / runtime (x on P processors)
  • 17. Performance: Larson Server-style benchmark with sharing
  • 18. Performance: false sharing Each thread reads & writes heap data
  • 19. Fragmentation Results On most standard uniprocessor benchmarks, Hoard’s fragmentation was low: p2c (Pascal-to-C): 1.20 espresso: 1.47 LRUsim : 1.05 Ghostscript : 1.15 Within 20% of Lea’s allocator On the multiprocessor benchmarks and other codes: Fragmentation was between 1.02 and 1.24 for all but one anomalous benchmark (shbench : 3.17) .
  • 20. Hoard Conclusions Speed: Excellent As fast as a uniprocessor allocator on one processor amortized O(1) cost 1 lock for malloc , 2 for free Scalability: Excellent Scales linearly with the number of processors Avoids false sharing Fragmentation: Very good Worst-case is provably close to ideal Actual observed fragmentation is low
  • 21. Hoard Heap Details “ Segregated size class” allocator Size classes are logarithmically-spaced Superblocks hold objects of one size class empty superblocks are “recycled” Approximately radix-sorted: Allocate from mostly-full superblocks Fast removal of mostly-empty superblocks 8 16 24 32 40 48 sizeclass bins radix-sorted superblock lists (emptiest to fullest) superblocks