SlideShare a Scribd company logo
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007
Single-core computer
Single-core CPU chip the single core
Multi-core architectures This lecture is about a new trend in computer architecture: Replicate multiple processor cores on a single die. Core 1 Core 2 Core 3 Core 4 Multi-core CPU chip
Multi-core CPU chip The cores fit on a single processor socket Also called CMP (Chip Multi-Processor) core 1 core 2 core 3 core 4
The cores run in parallel core 1 core 2 core 3 core 4 thread 1 thread 2 thread 3 thread 4
Within each core, threads are time-sliced (just like on a uniprocessor) core 1 core 2 core 3 core 4 several  threads several  threads several  threads several  threads
Interaction with the Operating System OS perceives each core as a separate processor OS scheduler maps threads/processes  to different cores Most major OS support multi-core today: Windows, Linux, Mac OS X, …
Why multi-core ? Difficult to make single-core clock frequencies even higher  Deeply pipelined circuits: heat problems speed of light problems difficult design and verification large design teams necessary server farms need expensive air-conditioning Many new applications are multithreaded  General trend in computer architecture (shift towards more parallelism)
Instruction-level parallelism Parallelism at the machine-instruction level The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years
Thread-level parallelism (TLP) This is parallelism on a more coarser scale Server can serve each client in a separate thread (Web server, database server) A computer game can do AI, graphics, and physics in three separate threads Single-core superscalar processors cannot fully exploit TLP Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP
General context: Multiprocessors Multiprocessor is any  computer with several  processors SIMD Single instruction, multiple data Modern graphics cards MIMD Multiple instructions, multiple data Lemieux cluster, Pittsburgh  supercomputing  center
Multiprocessor memory types Shared memory: In this model, there is one (large) common shared memory for all processors Distributed memory: In this model, each processor has its own (small) local memory, and its content is not replicated anywhere else
Multi-core processor is a special kind of a multiprocessor: All processors are on the same chip Multi-core processors are MIMD: Different cores execute different threads ( M ultiple  I nstructions), operating on different parts of memory ( M ultiple  D ata). Multi-core is a shared memory multiprocessor: All cores share the same memory
What applications benefit  from multi-core? Database servers Web servers (Web commerce) Compilers Multimedia applications Scientific applications, CAD/CAM In general, applications with  Thread-level parallelism (as opposed to instruction-level parallelism) Each can run on its own core
More examples Editing a photo while recording a TV show through a digital video recorder Downloading software while running an anti-virus program  “Anything that can be threaded today will map efficiently to multi-core” BUT: some applications difficult to parallelize
A technique complementary to multi-core: Simultaneous multithreading   Problem addressed: The processor pipeline  can get stalled: Waiting for the result  of a long floating point  (or integer) operation Waiting for data to  arrive from memory  Other execution units wait unused Source: Intel BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode  ROM BTB L2 Cache and Control Bus
Simultaneous multithreading (SMT) Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core Weaving together multiple “threads”  on the same core Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units
Without SMT, only a single thread can run at any given time BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1: floating point
Without SMT, only a single thread can run at any given time BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 2: integer operation
SMT processor: both threads can run concurrently BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1: floating point Thread 2: integer operation
But: Can’t simultaneously use  the same functional unit BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1 Thread 2 This scenario is impossible with SMT on a single core (assuming a single integer unit) IMPOSSIBLE
SMT not a “true” parallel processor Enables better threading (e.g. up to 30%) OS and applications perceive each simultaneous thread as a separate  “virtual processor” The chip has only a single copy  of each resource Compare to multi-core: each core has its own copy of resources
Multi-core:  threads can run on separate cores BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode  ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode  ROM BTB L2 Cache and Control Bus Thread 1 Thread 2
Multi-core:  threads can run on separate cores BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode  ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode  ROM BTB L2 Cache and Control Bus Thread 3 Thread 4
Combining Multi-core and SMT Cores can be SMT-enabled (or not) The different combinations: Single-core, non-SMT: standard uniprocessor Single-core, with SMT  Multi-core, non-SMT Multi-core, with SMT: our fish machines The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads Intel calls them “hyper-threads”
SMT Dual-core: all four threads can run concurrently BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode  ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode  ROM BTB L2 Cache and Control Bus Thread 1 Thread 3 Thread 2 Thread 4
Comparison: multi-core vs SMT Advantages/disadvantages?
Comparison: multi-core vs SMT Multi-core: Since there are several cores, each is smaller and not as powerful (but also easier to design and manufacture) However, great with thread-level parallelism SMT Can have one large and fast superscalar core Great performance on a single thread Mostly still only exploits instruction-level parallelism
The memory hierarchy If simultaneous multithreading only:  all caches shared Multi-core chips: L1 caches private L2 caches private in some architectures and shared in others Memory is always shared
“Fish” machines Dual-core Intel Xeon processors Each core is  hyper-threaded Private L1 caches Shared L2 caches memory L2 cache L1 cache L1 cache C O R E 1 C O R E 0 hyper-threads
Designs with private L2 caches memory L2 cache L1 cache L1 cache C O R E 1 C O R E 0 L2 cache memory L2 cache L1 cache L1 cache C O R E 1 C O R E 0 L2 cache Both L1 and L2 are private Examples: AMD Opteron,  AMD Athlon, Intel Pentium D L3 cache L3 cache A design with L3 caches Example: Intel Itanium 2
Private vs shared caches? Advantages/disadvantages?
Private vs shared caches Advantages of private: They are closer to core, so faster access Reduces contention Advantages of shared: Threads on different cores can share the same cache data More cache space available if a single (or a few) high-performance thread runs on the system
The cache coherence problem Since we have private caches: How to keep the data consistent across caches? Each core should perceive the memory as a monolithic array, shared by all the cores
The cache coherence problem Suppose variable x initially contains 15213 One or more  levels of  cache One or more  levels of  cache One or more  levels of  cache One or more  levels of  cache Main memory x=15213 multi-core chip Core 1 Core 2 Core 3 Core 4
The cache coherence problem Core 1 reads x One or more  levels of  cache x=15213 One or more  levels of  cache One or more  levels of  cache One or more  levels of  cache Main memory x=15213 multi-core chip Core 1 Core 2 Core 3 Core 4
The cache coherence problem Core 2 reads x One or more  levels of  cache x=15213 One or more  levels of  cache x=15213 One or more  levels of  cache One or more  levels of  cache Main memory x=15213 multi-core chip Core 1 Core 2 Core 3 Core 4
The cache coherence problem Core 1 writes to x, setting it to 21660 One or more  levels of  cache x=21660 One or more  levels of  cache x=15213 One or more  levels of  cache One or more  levels of  cache Main memory x=21660 multi-core chip assuming  write-through  caches Core 1 Core 2 Core 3 Core 4
The cache coherence problem Core 2 attempts to read x… gets a stale copy One or more  levels of  cache x=21660 One or more  levels of  cache x=15213 One or more  levels of  cache One or more  levels of  cache Main memory x=21660 multi-core chip Core 1 Core 2 Core 3 Core 4
Solutions for cache coherence This is a general problem with multiprocessors, not limited just to multi-core There exist many solution algorithms, coherence protocols, etc. A simple solution: invalidation -based protocol with  snooping
Inter-core bus One or more  levels of  cache One or more  levels of  cache One or more  levels of  cache One or more  levels of  cache Main memory multi-core chip inter-core bus Core 1 Core 2 Core 3 Core 4
Invalidation protocol with snooping Invalidation: If a core writes to a data item, all other copies of this data item in other caches are  invalidated Snooping:  All cores continuously “snoop” (monitor) the bus connecting the cores.
The cache coherence problem Revisited: Cores 1 and 2 have both read x One or more  levels of  cache x=15213 One or more  levels of  cache x=15213 One or more  levels of  cache One or more  levels of  cache Main memory x=15213 multi-core chip Core 1 Core 2 Core 3 Core 4
The cache coherence problem Core 1 writes to x, setting it to 21660 One or more  levels of  cache x=21660 One or more  levels of  cache x=15213 One or more  levels of  cache One or more  levels of  cache Main memory x=21660 multi-core chip assuming  write-through  caches INVALIDATED sends invalidation request inter-core bus Core 1 Core 2 Core 3 Core 4
The cache coherence problem After invalidation: One or more  levels of  cache x=21660 One or more  levels of  cache One or more  levels of  cache One or more  levels of  cache Main memory x=21660 multi-core chip Core 1 Core 2 Core 3 Core 4
The cache coherence problem Core 2 reads x. Cache misses,   and loads the new copy. One or more  levels of  cache x=21660 One or more  levels of  cache x=21660 One or more  levels of  cache One or more  levels of  cache Main memory x=21660 multi-core chip Core 1 Core 2 Core 3 Core 4
Alternative to invalidate protocol: update protocol Core 1 writes x=21660: One or more  levels of  cache x=21660 One or more  levels of  cache x= 21660 One or more  levels of  cache One or more  levels of  cache Main memory x=21660 multi-core chip assuming  write-through  caches UPDATED broadcasts updated value inter-core bus Core 1 Core 2 Core 3 Core 4
Which do you think is better? Invalidation or update?
Invalidation vs update Multiple writes to the same location invalidation: only the first time update: must broadcast each write    (which includes new variable value) Invalidation generally performs better: it generates less bus traffic
Invalidation protocols This was just the basic  invalidation protocol More sophisticated protocols  use extra cache state bits MSI, MESI (Modified, Exclusive, Shared, Invalid)
Programming for multi-core Programmers must use threads or processes Spread the workload across multiple cores Write parallel algorithms OS will map threads/processes to cores
Thread safety very important Pre-emptive context switching: context switch can happen AT ANY TIME True concurrency, not just uniprocessor time-slicing Concurrency bugs exposed much faster with multi-core
However: Need to use synchronization even if only time-slicing on a uniprocessor int counter=0; void thread1() { int temp1=counter; counter = temp1 + 1; } void thread2() { int temp2=counter; counter = temp2 + 1; }
Need to use synchronization even if only time-slicing on a uniprocessor temp1=counter; counter = temp1 + 1; temp2=counter; counter = temp2 + 1 temp1=counter; temp2=counter; counter = temp1 + 1; counter = temp2 + 1 gives counter=2 gives counter=1
Assigning threads to the cores Each thread/process has an  affinity mask Affinity mask specifies what cores the thread is allowed to run on Different threads can have different masks Affinities are inherited across fork()
Affinity masks are bit vectors Example: 4-way multi-core, without SMT 1 0 1 1 core 3 core 2 core 1 core 0 Process/thread is allowed to run on   cores 0,2,3, but not on core 1
Affinity masks when multi-core and SMT combined Separate bits for each simultaneous thread Example: 4-way multi-core,  2 threads per core 1 core 3 core 2 core 1 core 0 1 0 0 1 0 1 1 thread 1 Core 2 can’t run the process Core 1 can only use one simultaneous thread thread 0 thread 1 thread 0 thread 1 thread 0 thread 1 thread 0
Default Affinities Default affinity mask is all 1s: all threads can run on all processors Then, the OS scheduler decides what threads run on what core OS scheduler detects skewed workloads, migrating threads to less busy processors
Process migration is costly Need to restart the execution pipeline Cached data is invalidated OS scheduler tries to avoid migration as much as possible:  it tends to keeps a thread on the same core  This is called  soft affinity
Hard affinities The programmer can prescribe her own affinities (hard affinities) Rule of thumb: use the default scheduler unless a good reason not to
When to set your own affinities Two (or more) threads share data-structures in memory map to same core so that can share cache Real-time threads: Example: a thread running  a robot controller: - must not be context switched,    or else robot can go unstable - dedicate an entire core just to this thread Source: Sensable.com
Kernel scheduler API #include <sched.h> int  sched_getaffinity (pid_t pid,  unsigned int len, unsigned long * mask); Retrieves the current affinity mask of process ‘pid’ and stores it into space pointed to by ‘mask’. ‘ len’ is the system word size: sizeof(unsigned int long)
Kernel scheduler API #include <sched.h> int  sched_setaffinity (pid_t pid,  unsigned int len, unsigned long * mask); Sets  the current affinity mask of process ‘pid’ to *mask  ‘ len’ is the system word size: sizeof(unsigned int long) To query affinity of a running process: [barbic@bonito ~]$ taskset -p 3935 pid 3935's current affinity mask: f
Windows Task Manager core 2 core 1
Legal licensing issues Will software vendors charge a separate license per each core or only a single license per chip? Microsoft, Red Hat Linux, Suse Linux will license their OS per chip, not per core
Conclusion Multi-core chips an  important new trend in  computer architecture  Several new multi-core  chips in design phases Parallel programming techniques likely to gain importance

More Related Content

PPTX
Computer architecture multi core processor
PPT
RT linux
PPT
Multicore Processors
PPT
message passing
DOCX
Parallel computing persentation
PPT
Real time scheduling - basic concepts
PPTX
Heterogeneous computing
Computer architecture multi core processor
RT linux
Multicore Processors
message passing
Parallel computing persentation
Real time scheduling - basic concepts
Heterogeneous computing

What's hot (20)

PPTX
Alanoud alqoufi inductive learning
PPT
Communication primitives
DOC
Unit 1 architecture of distributed systems
PDF
Design issues of dos
PPTX
file sharing semantics by Umar Danjuma Maiwada
PPTX
Multiprocessor
PPTX
Single and Multi core processor
PPT
Pipelining
PPTX
Internet congestion
PPT
Smp and asmp architecture.
PDF
CS6003 AD HOC AND SENSOR NETWORKS
PPTX
Cache performance considerations
PDF
Memory consistency models
PPSX
Foult Tolerence In Distributed System
PPTX
CISC & RISC Architecture
PDF
Os services
PPT
Os Threads
PPTX
Single &amp;Multi Core processor
Alanoud alqoufi inductive learning
Communication primitives
Unit 1 architecture of distributed systems
Design issues of dos
file sharing semantics by Umar Danjuma Maiwada
Multiprocessor
Single and Multi core processor
Pipelining
Internet congestion
Smp and asmp architecture.
CS6003 AD HOC AND SENSOR NETWORKS
Cache performance considerations
Memory consistency models
Foult Tolerence In Distributed System
CISC & RISC Architecture
Os services
Os Threads
Single &amp;Multi Core processor
Ad

Viewers also liked (20)

PDF
IBM z/OS V2R2 Networking Technologies Update
PDF
Intel's Presentation in SIGGRAPH OpenCL BOF
PDF
Ludden q3 2008_boston
PDF
Embedded Solutions 2010: Intel Multicore by Eastronics
PDF
IBM z/OS V2R2 Performance and Availability Topics
PDF
z/OS V2R2 Enhancements
PPT
Multicore computers
PPTX
Cache & CPU performance
DOC
Introduction to multi core
PDF
可靠分布式系统基础 Paxos的直观解释
PPT
Multi core-architecture
PPTX
Low Level CPU Performance Profiling Examples
PDF
Linux BPF Superpowers
KEY
SMP/Multithread
PDF
Linux Systems Performance 2016
PPTX
Broken Linux Performance Tools 2016
PDF
Velocity 2015 linux perf tools
PDF
Linux Profiling at Netflix
PDF
Computex 2014 AMD Press Conference
 
PDF
AMD Ryzen CPU Zen Cores Architecture
IBM z/OS V2R2 Networking Technologies Update
Intel's Presentation in SIGGRAPH OpenCL BOF
Ludden q3 2008_boston
Embedded Solutions 2010: Intel Multicore by Eastronics
IBM z/OS V2R2 Performance and Availability Topics
z/OS V2R2 Enhancements
Multicore computers
Cache & CPU performance
Introduction to multi core
可靠分布式系统基础 Paxos的直观解释
Multi core-architecture
Low Level CPU Performance Profiling Examples
Linux BPF Superpowers
SMP/Multithread
Linux Systems Performance 2016
Broken Linux Performance Tools 2016
Velocity 2015 linux perf tools
Linux Profiling at Netflix
Computex 2014 AMD Press Conference
 
AMD Ryzen CPU Zen Cores Architecture
Ad

Similar to Multi-core architectures (20)

PDF
27 multicore
PDF
27 multicore
PPT
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
PPT
Osa-multi-core.ppt
PDF
Trip down the GPU lane with Machine Learning
PPTX
Multicore processor by Ankit Raj and Akash Prajapati
PPTX
Processors and its Types
PPTX
Multi-core processor and Multi-channel memory architecture
PPTX
5.6 Basic computer structure microprocessors
DOCX
Multi-Core on Chip Architecture *doc - IK
PPTX
CA presentation of multicore processor
PPT
Memory Mapping Cache
PPTX
Lec04 gpu architecture
PPTX
Lecture 4.pptx
PPT
Intro To .Net Threads
PPTX
Final draft intel core i5 processors architecture
PDF
fundamentals of digital communication Unit 5_microprocessor.pdf
PPT
Multiprocessor_YChen.ppt
PPT
The Cell Processor
PPT
Paralle programming 2
27 multicore
27 multicore
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
Osa-multi-core.ppt
Trip down the GPU lane with Machine Learning
Multicore processor by Ankit Raj and Akash Prajapati
Processors and its Types
Multi-core processor and Multi-channel memory architecture
5.6 Basic computer structure microprocessors
Multi-Core on Chip Architecture *doc - IK
CA presentation of multicore processor
Memory Mapping Cache
Lec04 gpu architecture
Lecture 4.pptx
Intro To .Net Threads
Final draft intel core i5 processors architecture
fundamentals of digital communication Unit 5_microprocessor.pdf
Multiprocessor_YChen.ppt
The Cell Processor
Paralle programming 2

More from nextlib (20)

PDF
Nio
PDF
Hadoop Map Reduce Arch
PDF
D Rb Silicon Valley Ruby Conference
PPT
Aldous Huxley Brave New World
PDF
Social Graph
PPT
Ajax Prediction
PDF
Closures for Java
PDF
A Content-Driven Reputation System for the Wikipedia
PPT
SVD review
PDF
Mongrel Handlers
PPT
Blue Ocean Strategy
PPT
日本7-ELEVEN消費心理學
PDF
Comparing State-of-the-Art Collaborative Filtering Systems
PPT
Item Based Collaborative Filtering Recommendation Algorithms
PPT
Agile Adoption2007
PPT
Modern Compiler Design
PPT
透过众神的眼睛--鸟瞰非洲
PDF
Improving Quality of Search Results Clustering with Approximate Matrix Factor...
PPT
Support Vector Machines
PPT
Bigtable
Nio
Hadoop Map Reduce Arch
D Rb Silicon Valley Ruby Conference
Aldous Huxley Brave New World
Social Graph
Ajax Prediction
Closures for Java
A Content-Driven Reputation System for the Wikipedia
SVD review
Mongrel Handlers
Blue Ocean Strategy
日本7-ELEVEN消費心理學
Comparing State-of-the-Art Collaborative Filtering Systems
Item Based Collaborative Filtering Recommendation Algorithms
Agile Adoption2007
Modern Compiler Design
透过众神的眼睛--鸟瞰非洲
Improving Quality of Search Results Clustering with Approximate Matrix Factor...
Support Vector Machines
Bigtable

Recently uploaded (20)

PPTX
Spectroscopy.pptx food analysis technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Big Data Technologies - Introduction.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
Teaching material agriculture food technology
Spectroscopy.pptx food analysis technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Chapter 3 Spatial Domain Image Processing.pdf
Network Security Unit 5.pdf for BCA BBA.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Understanding_Digital_Forensics_Presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Big Data Technologies - Introduction.pptx
cuic standard and advanced reporting.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Teaching material agriculture food technology

Multi-core architectures

  • 1. Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007
  • 3. Single-core CPU chip the single core
  • 4. Multi-core architectures This lecture is about a new trend in computer architecture: Replicate multiple processor cores on a single die. Core 1 Core 2 Core 3 Core 4 Multi-core CPU chip
  • 5. Multi-core CPU chip The cores fit on a single processor socket Also called CMP (Chip Multi-Processor) core 1 core 2 core 3 core 4
  • 6. The cores run in parallel core 1 core 2 core 3 core 4 thread 1 thread 2 thread 3 thread 4
  • 7. Within each core, threads are time-sliced (just like on a uniprocessor) core 1 core 2 core 3 core 4 several threads several threads several threads several threads
  • 8. Interaction with the Operating System OS perceives each core as a separate processor OS scheduler maps threads/processes to different cores Most major OS support multi-core today: Windows, Linux, Mac OS X, …
  • 9. Why multi-core ? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits: heat problems speed of light problems difficult design and verification large design teams necessary server farms need expensive air-conditioning Many new applications are multithreaded General trend in computer architecture (shift towards more parallelism)
  • 10. Instruction-level parallelism Parallelism at the machine-instruction level The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years
  • 11. Thread-level parallelism (TLP) This is parallelism on a more coarser scale Server can serve each client in a separate thread (Web server, database server) A computer game can do AI, graphics, and physics in three separate threads Single-core superscalar processors cannot fully exploit TLP Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP
  • 12. General context: Multiprocessors Multiprocessor is any computer with several processors SIMD Single instruction, multiple data Modern graphics cards MIMD Multiple instructions, multiple data Lemieux cluster, Pittsburgh supercomputing center
  • 13. Multiprocessor memory types Shared memory: In this model, there is one (large) common shared memory for all processors Distributed memory: In this model, each processor has its own (small) local memory, and its content is not replicated anywhere else
  • 14. Multi-core processor is a special kind of a multiprocessor: All processors are on the same chip Multi-core processors are MIMD: Different cores execute different threads ( M ultiple I nstructions), operating on different parts of memory ( M ultiple D ata). Multi-core is a shared memory multiprocessor: All cores share the same memory
  • 15. What applications benefit from multi-core? Database servers Web servers (Web commerce) Compilers Multimedia applications Scientific applications, CAD/CAM In general, applications with Thread-level parallelism (as opposed to instruction-level parallelism) Each can run on its own core
  • 16. More examples Editing a photo while recording a TV show through a digital video recorder Downloading software while running an anti-virus program “Anything that can be threaded today will map efficiently to multi-core” BUT: some applications difficult to parallelize
  • 17. A technique complementary to multi-core: Simultaneous multithreading Problem addressed: The processor pipeline can get stalled: Waiting for the result of a long floating point (or integer) operation Waiting for data to arrive from memory Other execution units wait unused Source: Intel BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus
  • 18. Simultaneous multithreading (SMT) Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core Weaving together multiple “threads” on the same core Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units
  • 19. Without SMT, only a single thread can run at any given time BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1: floating point
  • 20. Without SMT, only a single thread can run at any given time BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 2: integer operation
  • 21. SMT processor: both threads can run concurrently BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1: floating point Thread 2: integer operation
  • 22. But: Can’t simultaneously use the same functional unit BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1 Thread 2 This scenario is impossible with SMT on a single core (assuming a single integer unit) IMPOSSIBLE
  • 23. SMT not a “true” parallel processor Enables better threading (e.g. up to 30%) OS and applications perceive each simultaneous thread as a separate “virtual processor” The chip has only a single copy of each resource Compare to multi-core: each core has its own copy of resources
  • 24. Multi-core: threads can run on separate cores BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1 Thread 2
  • 25. Multi-core: threads can run on separate cores BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 3 Thread 4
  • 26. Combining Multi-core and SMT Cores can be SMT-enabled (or not) The different combinations: Single-core, non-SMT: standard uniprocessor Single-core, with SMT Multi-core, non-SMT Multi-core, with SMT: our fish machines The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads Intel calls them “hyper-threads”
  • 27. SMT Dual-core: all four threads can run concurrently BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1 Thread 3 Thread 2 Thread 4
  • 28. Comparison: multi-core vs SMT Advantages/disadvantages?
  • 29. Comparison: multi-core vs SMT Multi-core: Since there are several cores, each is smaller and not as powerful (but also easier to design and manufacture) However, great with thread-level parallelism SMT Can have one large and fast superscalar core Great performance on a single thread Mostly still only exploits instruction-level parallelism
  • 30. The memory hierarchy If simultaneous multithreading only: all caches shared Multi-core chips: L1 caches private L2 caches private in some architectures and shared in others Memory is always shared
  • 31. “Fish” machines Dual-core Intel Xeon processors Each core is hyper-threaded Private L1 caches Shared L2 caches memory L2 cache L1 cache L1 cache C O R E 1 C O R E 0 hyper-threads
  • 32. Designs with private L2 caches memory L2 cache L1 cache L1 cache C O R E 1 C O R E 0 L2 cache memory L2 cache L1 cache L1 cache C O R E 1 C O R E 0 L2 cache Both L1 and L2 are private Examples: AMD Opteron, AMD Athlon, Intel Pentium D L3 cache L3 cache A design with L3 caches Example: Intel Itanium 2
  • 33. Private vs shared caches? Advantages/disadvantages?
  • 34. Private vs shared caches Advantages of private: They are closer to core, so faster access Reduces contention Advantages of shared: Threads on different cores can share the same cache data More cache space available if a single (or a few) high-performance thread runs on the system
  • 35. The cache coherence problem Since we have private caches: How to keep the data consistent across caches? Each core should perceive the memory as a monolithic array, shared by all the cores
  • 36. The cache coherence problem Suppose variable x initially contains 15213 One or more levels of cache One or more levels of cache One or more levels of cache One or more levels of cache Main memory x=15213 multi-core chip Core 1 Core 2 Core 3 Core 4
  • 37. The cache coherence problem Core 1 reads x One or more levels of cache x=15213 One or more levels of cache One or more levels of cache One or more levels of cache Main memory x=15213 multi-core chip Core 1 Core 2 Core 3 Core 4
  • 38. The cache coherence problem Core 2 reads x One or more levels of cache x=15213 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=15213 multi-core chip Core 1 Core 2 Core 3 Core 4
  • 39. The cache coherence problem Core 1 writes to x, setting it to 21660 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches Core 1 Core 2 Core 3 Core 4
  • 40. The cache coherence problem Core 2 attempts to read x… gets a stale copy One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=21660 multi-core chip Core 1 Core 2 Core 3 Core 4
  • 41. Solutions for cache coherence This is a general problem with multiprocessors, not limited just to multi-core There exist many solution algorithms, coherence protocols, etc. A simple solution: invalidation -based protocol with snooping
  • 42. Inter-core bus One or more levels of cache One or more levels of cache One or more levels of cache One or more levels of cache Main memory multi-core chip inter-core bus Core 1 Core 2 Core 3 Core 4
  • 43. Invalidation protocol with snooping Invalidation: If a core writes to a data item, all other copies of this data item in other caches are invalidated Snooping: All cores continuously “snoop” (monitor) the bus connecting the cores.
  • 44. The cache coherence problem Revisited: Cores 1 and 2 have both read x One or more levels of cache x=15213 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=15213 multi-core chip Core 1 Core 2 Core 3 Core 4
  • 45. The cache coherence problem Core 1 writes to x, setting it to 21660 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches INVALIDATED sends invalidation request inter-core bus Core 1 Core 2 Core 3 Core 4
  • 46. The cache coherence problem After invalidation: One or more levels of cache x=21660 One or more levels of cache One or more levels of cache One or more levels of cache Main memory x=21660 multi-core chip Core 1 Core 2 Core 3 Core 4
  • 47. The cache coherence problem Core 2 reads x. Cache misses, and loads the new copy. One or more levels of cache x=21660 One or more levels of cache x=21660 One or more levels of cache One or more levels of cache Main memory x=21660 multi-core chip Core 1 Core 2 Core 3 Core 4
  • 48. Alternative to invalidate protocol: update protocol Core 1 writes x=21660: One or more levels of cache x=21660 One or more levels of cache x= 21660 One or more levels of cache One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches UPDATED broadcasts updated value inter-core bus Core 1 Core 2 Core 3 Core 4
  • 49. Which do you think is better? Invalidation or update?
  • 50. Invalidation vs update Multiple writes to the same location invalidation: only the first time update: must broadcast each write (which includes new variable value) Invalidation generally performs better: it generates less bus traffic
  • 51. Invalidation protocols This was just the basic invalidation protocol More sophisticated protocols use extra cache state bits MSI, MESI (Modified, Exclusive, Shared, Invalid)
  • 52. Programming for multi-core Programmers must use threads or processes Spread the workload across multiple cores Write parallel algorithms OS will map threads/processes to cores
  • 53. Thread safety very important Pre-emptive context switching: context switch can happen AT ANY TIME True concurrency, not just uniprocessor time-slicing Concurrency bugs exposed much faster with multi-core
  • 54. However: Need to use synchronization even if only time-slicing on a uniprocessor int counter=0; void thread1() { int temp1=counter; counter = temp1 + 1; } void thread2() { int temp2=counter; counter = temp2 + 1; }
  • 55. Need to use synchronization even if only time-slicing on a uniprocessor temp1=counter; counter = temp1 + 1; temp2=counter; counter = temp2 + 1 temp1=counter; temp2=counter; counter = temp1 + 1; counter = temp2 + 1 gives counter=2 gives counter=1
  • 56. Assigning threads to the cores Each thread/process has an affinity mask Affinity mask specifies what cores the thread is allowed to run on Different threads can have different masks Affinities are inherited across fork()
  • 57. Affinity masks are bit vectors Example: 4-way multi-core, without SMT 1 0 1 1 core 3 core 2 core 1 core 0 Process/thread is allowed to run on cores 0,2,3, but not on core 1
  • 58. Affinity masks when multi-core and SMT combined Separate bits for each simultaneous thread Example: 4-way multi-core, 2 threads per core 1 core 3 core 2 core 1 core 0 1 0 0 1 0 1 1 thread 1 Core 2 can’t run the process Core 1 can only use one simultaneous thread thread 0 thread 1 thread 0 thread 1 thread 0 thread 1 thread 0
  • 59. Default Affinities Default affinity mask is all 1s: all threads can run on all processors Then, the OS scheduler decides what threads run on what core OS scheduler detects skewed workloads, migrating threads to less busy processors
  • 60. Process migration is costly Need to restart the execution pipeline Cached data is invalidated OS scheduler tries to avoid migration as much as possible: it tends to keeps a thread on the same core This is called soft affinity
  • 61. Hard affinities The programmer can prescribe her own affinities (hard affinities) Rule of thumb: use the default scheduler unless a good reason not to
  • 62. When to set your own affinities Two (or more) threads share data-structures in memory map to same core so that can share cache Real-time threads: Example: a thread running a robot controller: - must not be context switched, or else robot can go unstable - dedicate an entire core just to this thread Source: Sensable.com
  • 63. Kernel scheduler API #include <sched.h> int sched_getaffinity (pid_t pid, unsigned int len, unsigned long * mask); Retrieves the current affinity mask of process ‘pid’ and stores it into space pointed to by ‘mask’. ‘ len’ is the system word size: sizeof(unsigned int long)
  • 64. Kernel scheduler API #include <sched.h> int sched_setaffinity (pid_t pid, unsigned int len, unsigned long * mask); Sets the current affinity mask of process ‘pid’ to *mask ‘ len’ is the system word size: sizeof(unsigned int long) To query affinity of a running process: [barbic@bonito ~]$ taskset -p 3935 pid 3935's current affinity mask: f
  • 65. Windows Task Manager core 2 core 1
  • 66. Legal licensing issues Will software vendors charge a separate license per each core or only a single license per chip? Microsoft, Red Hat Linux, Suse Linux will license their OS per chip, not per core
  • 67. Conclusion Multi-core chips an important new trend in computer architecture Several new multi-core chips in design phases Parallel programming techniques likely to gain importance