SlideShare a Scribd company logo
Multiprocessor Systems

      Chapter 8, 8.1




                         1
CPU clock-rate increase slowing
       10,000.00



        1,000.00



         100.00

 MHz
          10.00



            1.00



            0.10
                1960   1970   1980   1990   2000   2010   2020

                                                                 2
                              Year
Multiprocessor System
• We will look at shared-memory multiprocessors
  – More than one processor sharing the same memory
• A single CPU can only go so fast
  – Use more than one CPU to improve performance
  – Assumes
     • Workload can be parallelised
     • Workload is not I/O-bound or memory-bound
• Disks and other hardware can be expensive
  – Can share hardware between CPUs


                                                   3
Amdahl’s law
• Given a proportion P of a program that
  can be made parallel, and the
  remaining serial portion (1-P), speedup
  by using N processors           1
                                       P
                            (1 − P ) +
                                       N
         1 Processor                    2 Processors
     Serial       Parallel           Serial      Parallel

       50        50          ⇒           50    25
      Time                                       Timenew
              Speedup = 1/(0.5 + 0.5/2) = 1.33…
Amdahl’s law
• Given a proportion P of a program that
  can be made parallel, and the
  remaining serial portion (1-P), speedup
  by using N processors           1
                                       P
                            (1 − P ) +
                                       N
         1 Processor                     ∞ Processors
     Serial       Parallel            Serial

       50        50          ⇒            50
      Time                                              Timenew
                   Speedup = 1/(0.5 + 0) = 2
Types of Multiprocessors (MPs)
• UMA MP
  – Uniform Memory Access
    • Access to all memory occurs at the same speed
      for all processors.
• NUMA MP
  – Non-uniform memory access
    • Access to some parts of memory is faster for some
      processors than other parts of memory
• We will focus on UMA
                                                      6
Bus Based UMA
Simplest MP is more than one processor on
  a single bus connect to memory (a)
  – Bus bandwidth becomes a bottleneck with
    more than just a few CPUs




                  COMP3231 04s1               7
Bus Based UMA
• Each processor has a cache to reduce its
  need for access to memory (b)
  – Hope is most accesses are to the local cache
  – Bus bandwidth still becomes a bottleneck with
    many CPUs




                   COMP3231 04s1               8
Cache Consistency
• What happens if one CPU writes to
  address 0x1234 (and it is stored in its
  cache) and another CPU reads from the
  same address (and gets what is in its
  cache)?




                 COMP3231 04s1              9
Cache Consistency
• Cache consistency is usually handled by
  the hardware.
  – Writes to one cache propagate to, or
    invalidate appropriate entries on other caches
  – Cache transactions also consume bus
    bandwidth




                   COMP3231 04s1               10
Bus Based UMA
• To further scale the number processors, we give
  each processor private local memory
  – Keep private data local on off the shared memory bus
  – Bus bandwidth still becomes a bottleneck with many
    CPUs with shared data
  – Complicate application development
     • We have to partition between private and shared variables




                        COMP3231 04s1                          11
Multi-core Processor




                       12
Bus Based UMA
• With only a single shared bus, scalability is
  limited by the bus bandwidth of the single
  bus
  – Caching only helps so much
• Alternative bus architectures do exist.




                                            13
UMA Crossbar Switch




      COMP3231 04s1   14
UMA Crossbar Switch
• Pro
  – Any CPU can access any
    available memory with
    less blocking
• Con
  – Number of switches
    required scales with n2.
     • 1000 CPUs need 1000000
       switches

                                15
Summary
• Multiprocessors can
  – Increase computation power beyond that available
    from a single CPU
  – Share resources such as disk and memory
• However
  – Shared buses (bus bandwidth) limit scalability
     • Can be reduced via hardware design
     • Can be reduced by carefully crafted software behaviour
        – Good cache locality together with private data where possible
• Question
  – How do we construct an OS for a multiprocessor?
     • What are some of the issues?
                                                                     16
Each CPU has its own OS
• Statically allocate physical memory to
  each CPU
• Each CPU runs its own independent OS
• Share peripherals
• Each CPU (OS) handles its processes
  system calls




                COMP3231 04s1              17
Each CPU has its own OS
• Used in early multiprocessor systems to
  ‘get them going’
  – Simpler to implement
  – Avoids concurrency issues by not sharing




                  COMP3231 04s1                18
Issues
• Each processor has its own scheduling queue
  – We can have one processor overloaded, and the rest
    idle
• Each processor has its own memory partition
  – We can a one processor thrashing, and the others
    with free memory
     • No way to move free memory from one OS to another
• Consistency is an issue with independent disk
  buffer caches and potentially shared files




                       COMP3231 04s1                       19
Master-Slave Multiprocessors
• OS (mostly) runs on a single fixed CPU
  – All OS tables, queues, buffers are
    present/manipulated on CPU 1
• User-level apps run on the other CPUs
  – And CPU 1 if there is spare CPU time
• All system calls are passed to CPU 1 for
  processing




                      COMP3231 04s1          20
Master-Slave Multiprocessors
• Very little synchronisation required
   – Only one CPU accesses majority of kernel data
• Simple to implement
• Single, centralised scheduler
   – Keeps all processors busy
• Memory can be allocated as needed to all CPUs




                      COMP3231 04s1                  21
Issue
• Master CPU can become the bottleneck
• Cross CPU traffic is slow compare to local




                 COMP3231 04s1            22
Symmetric Multiprocessors (SMP)
• OS kernel run on all processors
   – Load and resource are balance between all processors
       • Including kernel execution
• Issue: Real concurrency in the kernel
   – Need carefully applied synchronisation primitives to avoid
     disaster




                             COMP3231 04s1                        23
Symmetric Multiprocessors (SMP)
• One alternative: A single mutex that make the entire
  kernel a large critical section
   – Only one CPU can be in the kernel at a time
   – Only slight better solution than master slave
       • Better cache locality
       • The “big lock” becomes a bottleneck when in-kernel processing
         exceed what can be done on a single CPU




                            COMP3231 04s1                                24
Symmetric Multiprocessors (SMP)
• Better alternative: identify largely independent parts of
  the kernel and make each of them their own critical
  section
   – Allows more parallelism in the kernel

• Issue: Difficult task
   – Code is mostly similar to uniprocessor code
   – Hard part is identifying independent parts that don’t interfere with
     each other




                            COMP3231 04s1                            25
Symmetric Multiprocessors (SMP)
• Example:
  – Associate a mutex with independent parts of the kernel
  – Some kernel activities require more than one part of the kernel
      • Need to acquire more than one mutex
      • Great opportunity to deadlock!!!!
  – Results in potentially complex lock ordering schemes that must
    be adhered to




                          COMP3231 04s1                          26
Symmetric Multiprocessors (SMP)
• Example:
  – Given a “big lock” kernel, we divide the kernel into two
    independent parts with a lock each
      • Good chance that one of those locks will become the next
        bottleneck
      • Leads to more subdivision, more locks, more complex lock
        acquisition rules
          – Subdivision in practice is (in reality) making more code multithreaded
            (parallelised)




                              COMP3231 04s1                                     27
Real life Scalability Example
• Early 1990’s, CSE wanted to run 80 X-Terminals off one
  or more server machines
• Winning tender was a 4-CPU bar-fridge-sized machine
  with 256M of RAM
   – Eventual config 6-CPU and 512M of RAM
   – Machine ran fine in all pre-session testing




                           COMP3231 04s1             28
Real life Scalability Example
• Students + assignment deadline = machine unusable




                     COMP3231 04s1                    29
Real life Scalability Example
• To fix the problem, the tenderer supplied more CPUs to
  improve performance (number increased to 8)
   – No change????


• Eventually, machine was replaced with
   –   Three 2-CPU pizza-box-sized machines, each with 256M RAM
   –   Cheaper overall
   –   Performance was dramatically improved!!!!!
   –   Why?




                          COMP3231 04s1                      30
Real life Scalability Example
• Paper:
   – Ramesh Balan and Kurt Gollhardt, “A Scalable Implementation
     of Virtual Memory HAT Layer for Shared Memory Multiprocessor
     Machines”, Proc. 1992 Summer USENIX conference


• The 4-8 CPU machine hit a bottleneck in the single
  threaded VM code
   – Adding more CPUs simply added them to the wait queue for the
     VM locks, and made others wait longer
• The 2 CPU machines did not generate that much lock
  contention and performed proportionally better.

                         COMP3231 04s1                        31
Lesson Learned
• Building scalable multiprocessor kernels is
  hard
• Lock contention can limit overall system
  performance




                  COMP3231 04s1            32
SMP Linux similar evolution
•   Linux 2.0 Single kernel big lock
•   Linux 2.2 Big lock with interrupt handling locks
•   Linux 2.4 Big lock plus some subsystem locks
•   Linux 2.6 most code now outside the big lock,
    data-based locking, lots of scalability tuning, etc,
    etc..




                                                     33
Multiprocessor Synchronisation
• Given we need synchronisation, how can
  we achieve it on a multiprocessor
  machine?
  – Unlike a uniprocessor, disabling interrupts
    does not work.
    • It does not prevent other CPUs from running in
      parallel
  – Need special hardware support


                    COMP3231 04s1                      34
Recall Mutual Exclusion
        with Test-and-Set




Entering and leaving a critical region using the
               TSL instruction
                  COMP3231 04s1                35
Test-and-Set
• Hardware guarantees that the instruction
  executes atomically.
     • Atomically: As an indivisible unit.
  – The instruction can not stop half way through




                       COMP3231 04s1           36
Test-and-Set on SMP
• It does not work without some extra
  hardware support




                 COMP3231 04s1          37
Test-and-Set on SMP
• A solution:
  – Hardware locks the bus during the TSL instruction to
    prevent memory accesses by any other CPU




                      COMP3231 04s1                   38
Test-and-Set on SMP
• Test-and Set is a busy-wait
  synchronisation primitive
  – Called a spinlock
• Issue:
  – Lock contention leads to spinning on the lock
     • Spinning on a lock requires bus locking which
       slows all other CPUs down
        – Independent of whether other CPUs need a lock or not
        – Causes bus contention

                                                             39
Test-and-Set on SMP
• Caching does not help reduce bus contention
  – Either TSL still locks the bus
  – Or TSL requires exclusive access to an entry in the
    local cache
     • Requires invalidation of same entry in other caches, and
       loading entry into local cache
     • Many CPUs performing TSL simply bounce a single
       exclusive entry between all caches using the bus




                         COMP3231 04s1                            40
Reducing Bus Contention
• Read before TSL
   – Spin reading the lock variable         start:
     waiting for it to change
   – When it does, use TSL to acquire       while (lock == 1);
     the lock
• Allows lock to be shared read-only        r = TSL(lock)
  in all caches until its released          if (r == 1)
   – no bus traffic until actual release
• No race conditions, as acquisition          goto start;
  is still with TSL.




                                   COMP3231 04s1            41
Thomas Anderson, “The Performance of
 Spin Lock Alternatives for Shared-Memory
 Multiprocessors”, IEEE Transactions on
 Parallel and Distributed Systems, Vol 1,
 No. 1, 1990




                                       42
Compares Simple Spinlocks
• Test and Set
void lock (volatile lock_t *l) {
  while (test_and_set(l)) ;
}



• Read before Test and Set
void lock (volatile lock_t *l) {
  while (*l == BUSY || test_and_set(l)) ;
}



                                            43
Benchmark
for i = 1 .. 1,000,000 {
   lock(l)
   crit_section()
   unlock()
   compute()
}
• Compute chosen from uniform random
  distribution of mean 5 times critical section
• Measure elapsed time on Sequent Symmetry
  (20 CPU 30386, coherent write-back invalidate
  caches)

                                              44
45
Results
• Test and set performs poorly once there is enough
  CPUs to cause contention for lock
   – Expected
• Test and Test and Set performs better
   – Performance less than expected
   – Still significant contention on lock when CPUs notice release
     and all attempt acquisition
• Critical section performance degenerates
   – Critical section requires bus traffic to modify shared structure
   – Lock holder competes with CPU that missed as they test and
     set ) lock holder is slower
   – Slower lock holder results in more contention


                                                                   46
• John Mellor-Crummey and Michael Scott,
  “Algorithms for Scalable Synchronisation
  on Shared-Memory Multiprocessors”, ACM
  Transactions on Computer Systems, Vol.
  9, No. 1, 1991




                                       47
MCS Locks
• Each CPU enqueues its own private lock variable into a queue and
  spins on it
    – No contention
• On lock release, the releaser unlocks the next lock in the queue
    – Only have bus contention on actual unlock
    – No starvation (order of lock acquisitions defined by the list)




                                  COMP9242                             48
MCS Lock
• Requires
  – compare_and_swap()
  – exchange()
    • Also called fetch_and_store()




                      COMP9242        49
COMP9242   50
51
Selected Benchmark
• Compared
  – test and test and set
  – Others in paper
     • Anderson’s array based queue
     • test and set with exponential back-off
  – MCS




                       COMP9242                 52
COMP9242   53
Confirmed Trade-off
• Queue locks scale well but have higher
  overhead
• Spin Locks have low overhead but don’t
  scale well




                  COMP9242                 54
Other Hardware Provided SMP
        Synchronisation Primitives
• Atomic Add/Subtract
   – Can be used to implement counting semaphores
• Exchange
• Compare and Exchange
• Load linked; Store conditionally
   – Two separate instructions
      • Load value using load linked
      • Modify, and store using store conditionally
      • If value changed by another processor, or an interrupt occurred,
        then store conditionally failed
   – Can be used to implement all of the above primitives
   – Implemented without bus locking

                                                                           55
Spinning versus Switching
• Remember spinning (busy-waiting) on a lock
  made little sense on a uniprocessor
  – The was no other running process to release the lock
  – Blocking and (eventually) switching to the lock holder
    is the only option.
• On SMP systems, the decision to spin or block is
  not as clear.
  – The lock is held by another running CPU and will be
    freed without necessarily blocking the requestor


                                                       56
Spinning versus Switching
  – Blocking and switching
     • to another process takes time
         – Save context and restore another
         – Cache contains current process not new process
             » Adjusting the cache working set also takes time
         – TLB is similar to cache
     • Switching back when the lock is free encounters the same again
  – Spinning wastes CPU time directly
• Trade off
  – If lock is held for less time than the overhead of switching
    to and back
  ⇒It’s more efficient to spin
⇒Spinlocks expect critical sections to be short
                                                                 57
Preemption and Spinlocks
• Critical sections synchronised via spinlocks are expected
  to be short
   – Avoid other CPUs wasting cycles spinning
• What happens if the spinlock holder is preempted at end
  of holder’s timeslice
   – Mutual exclusion is still guaranteed
   – Other CPUs will spin until the holder is scheduled again!!!!!
⇒ Spinlock implementations disable interrupts in addition to
  acquiring locks to avoid lock-holder preemption



                                                                     58
Multiprocessor Scheduling
• Given X processes (or threads) and Y
  CPUs,
  – how do we allocate them to the CPUs




                                          59
A Single Shared Ready Queue
• When a CPU goes idle, it take the highest
  priority process from the shared ready queue




                   COMP3231 04s1                 60
Single Shared Ready Queue
• Pros
  – Simple
  – Automatic load balancing
• Cons
  – Lock contention on the ready queue can be a
    major bottleneck
    • Due to frequent scheduling or many CPUs or both
  – Not all CPUs are equal
    • The last CPU a process ran on is likely to have
      more related entries in the cache.
                                                        61
Affinity Scheduling
• Basic Idea
  – Try hard to run a process on the CPU it ran
    on last time


• One approach: Two-level scheduling




                                                  62
Two-level Scheduling
• Each CPU has its own ready queue
• Top-level algorithm assigns process to CPUs
  – Defines their affinity, and roughly balances the load
• The bottom-level scheduler:
  – Is the frequently invoked scheduler (e.g. on blocking
    on I/O, a lock, or exhausting a timeslice)
  – Runs on each CPU and selects from its own ready
    queue
     • Ensures affinity
  – If nothing is available from the local ready queue, it
    runs a process from another CPUs ready queue
    rather than go idle
                                                             63
Two-level Scheduling
• Pros
  – No lock contention on per-CPU ready queues
    in the (hopefully) common case
  – Load balancing to avoid idle queues
  – Automatic affinity to a single CPU for more
    cache friendly behaviour




                                             64

More Related Content

PPTX
Graphics processing uni computer archiecture
PPT
Multiprocessor Architecture for Image Processing
PPT
Multicore Processors
PDF
27 multicore
PPTX
Multicore processor by Ankit Raj and Akash Prajapati
PDF
High Performance Computer Architecture
DOC
Introduction to multi core
PPT
Linux memory consumption
Graphics processing uni computer archiecture
Multiprocessor Architecture for Image Processing
Multicore Processors
27 multicore
Multicore processor by Ankit Raj and Akash Prajapati
High Performance Computer Architecture
Introduction to multi core
Linux memory consumption

What's hot (20)

PPT
Chapter 08
PPT
并行计算与分布式计算的区别
PPT
Lecture4
PPTX
Computer architecture multi processor
PPTX
Computer System Architecture Lecture Note 8.1 primary Memory
PPT
chap 18 multicore computers
PPT
Lecture 6
PPT
Hardware multithreading
PPT
04 cache memory
PPT
Multicore computers
PDF
13. multiprocessing
PPT
NUMA overview
PPT
Multi-core architectures
PDF
Multiprocessor
PPTX
Modeling & design multi-core NUMA simulator
PPTX
Hardware Multi-Threading
PPTX
Refining Linux
PPTX
Lecture1
PDF
internal_memory
PPT
Multi core-architecture
Chapter 08
并行计算与分布式计算的区别
Lecture4
Computer architecture multi processor
Computer System Architecture Lecture Note 8.1 primary Memory
chap 18 multicore computers
Lecture 6
Hardware multithreading
04 cache memory
Multicore computers
13. multiprocessing
NUMA overview
Multi-core architectures
Multiprocessor
Modeling & design multi-core NUMA simulator
Hardware Multi-Threading
Refining Linux
Lecture1
internal_memory
Multi core-architecture
Ad

Viewers also liked (20)

PDF
実績、事例紹介マップ御提案書
PDF
Mconf - BigBlueButton Summit
PPTX
Digital Marketing & Social Media Opportunities - Japan
PPTX
Portfolio 2013
PPTX
Presentació curs mitjà xixona 2014
PDF
ইভটিজিং প্রতিরোধে দক্ষতাভিত্তিক শিক্ষা
PPTX
Slide bio4206
PPT
Prof Arnold Taylor: The significant experiments of Robert Hooke - 8 June 2015
PDF
Chapter4 high-level-design
ODP
Sand
PDF
Lect07
PDF
Lect13
PDF
শিক্ষার্থীদের ঝরে পড়া প্রতিকারে প্রয়োাজন সমন্বিত উদ্যোগ
PDF
Lect03
PDF
Удержание пользователей (User Retention) // CPA Network Russia
PDF
OSC2015 Tokyo/Spring セミナー「初めてのLibreOffice L10N UI/ヘルプ翻訳」予告編
PPTX
From Performance to Health: Wearables for the Rest of Us.
PDF
コドモノガタリ概要資料120507
PDF
CCRMA - 2011
PPSX
Amsterdã
実績、事例紹介マップ御提案書
Mconf - BigBlueButton Summit
Digital Marketing & Social Media Opportunities - Japan
Portfolio 2013
Presentació curs mitjà xixona 2014
ইভটিজিং প্রতিরোধে দক্ষতাভিত্তিক শিক্ষা
Slide bio4206
Prof Arnold Taylor: The significant experiments of Robert Hooke - 8 June 2015
Chapter4 high-level-design
Sand
Lect07
Lect13
শিক্ষার্থীদের ঝরে পড়া প্রতিকারে প্রয়োাজন সমন্বিত উদ্যোগ
Lect03
Удержание пользователей (User Retention) // CPA Network Russia
OSC2015 Tokyo/Spring セミナー「初めてのLibreOffice L10N UI/ヘルプ翻訳」予告編
From Performance to Health: Wearables for the Rest of Us.
コドモノガタリ概要資料120507
CCRMA - 2011
Amsterdã
Ad

Similar to Lect18 (20)

PPT
Multiprocessor_YChen.ppt
PPTX
Multiprocessor.pptx
PDF
07-multiprocessors-cccffw whoo ofsvnk cfchjMF.pdf
PPT
module4.ppt
PPTX
Chip Multithreading Systems Need a New Operating System Scheduler
PDF
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
PPTX
Multiprocessors and Special Processors_Group9.pptx
PPTX
parellelisum edited_jsdnsfnjdnjfnjdn.pptx
PPTX
Multithreading computer architecture
PPT
Introduction to symmetric multiprocessor
PDF
Cpu Caches
PDF
CPU Caches - Jamie Allen
PPTX
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
PPTX
CA UNIT IV.pptx
PPTX
Computer architecture multi core processor
PPT
Multiprocessors Characters coherence.ppt
PPT
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
PDF
operating system design for new computer architecture
PPT
MICROCOMPUTER application introduction.ppt
PDF
27 multicore
Multiprocessor_YChen.ppt
Multiprocessor.pptx
07-multiprocessors-cccffw whoo ofsvnk cfchjMF.pdf
module4.ppt
Chip Multithreading Systems Need a New Operating System Scheduler
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
Multiprocessors and Special Processors_Group9.pptx
parellelisum edited_jsdnsfnjdnjfnjdn.pptx
Multithreading computer architecture
Introduction to symmetric multiprocessor
Cpu Caches
CPU Caches - Jamie Allen
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
CA UNIT IV.pptx
Computer architecture multi core processor
Multiprocessors Characters coherence.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
operating system design for new computer architecture
MICROCOMPUTER application introduction.ppt
27 multicore

More from Vin Voro (20)

PDF
Tele3113 tut6
PDF
Tele3113 tut5
PDF
Tele3113 tut4
PDF
Tele3113 tut1
PDF
Tele3113 tut3
PDF
Tele3113 tut2
PDF
Tele3113 wk11tue
PDF
Tele3113 wk10wed
PDF
Tele3113 wk10tue
PDF
Tele3113 wk11wed
PDF
Tele3113 wk7wed
PDF
Tele3113 wk9tue
PDF
Tele3113 wk8wed
PDF
Tele3113 wk9wed
PDF
Tele3113 wk7wed
PDF
Tele3113 wk7wed
PDF
Tele3113 wk7tue
PDF
Tele3113 wk6wed
PDF
Tele3113 wk6tue
PDF
Tele3113 wk5tue
Tele3113 tut6
Tele3113 tut5
Tele3113 tut4
Tele3113 tut1
Tele3113 tut3
Tele3113 tut2
Tele3113 wk11tue
Tele3113 wk10wed
Tele3113 wk10tue
Tele3113 wk11wed
Tele3113 wk7wed
Tele3113 wk9tue
Tele3113 wk8wed
Tele3113 wk9wed
Tele3113 wk7wed
Tele3113 wk7wed
Tele3113 wk7tue
Tele3113 wk6wed
Tele3113 wk6tue
Tele3113 wk5tue

Recently uploaded (20)

PPTX
Cell Types and Its function , kingdom of life
PDF
Complications of Minimal Access Surgery at WLH
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Institutional Correction lecture only . . .
PDF
01-Introduction-to-Information-Management.pdf
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Pharma ospi slides which help in ospi learning
Cell Types and Its function , kingdom of life
Complications of Minimal Access Surgery at WLH
102 student loan defaulters named and shamed – Is someone you know on the list?
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
RMMM.pdf make it easy to upload and study
Anesthesia in Laparoscopic Surgery in India
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Final Presentation General Medicine 03-08-2024.pptx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Institutional Correction lecture only . . .
01-Introduction-to-Information-Management.pdf
Sports Quiz easy sports quiz sports quiz
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
O7-L3 Supply Chain Operations - ICLT Program
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
VCE English Exam - Section C Student Revision Booklet
Module 4: Burden of Disease Tutorial Slides S2 2025
Pharma ospi slides which help in ospi learning

Lect18

  • 1. Multiprocessor Systems Chapter 8, 8.1 1
  • 2. CPU clock-rate increase slowing 10,000.00 1,000.00 100.00 MHz 10.00 1.00 0.10 1960 1970 1980 1990 2000 2010 2020 2 Year
  • 3. Multiprocessor System • We will look at shared-memory multiprocessors – More than one processor sharing the same memory • A single CPU can only go so fast – Use more than one CPU to improve performance – Assumes • Workload can be parallelised • Workload is not I/O-bound or memory-bound • Disks and other hardware can be expensive – Can share hardware between CPUs 3
  • 4. Amdahl’s law • Given a proportion P of a program that can be made parallel, and the remaining serial portion (1-P), speedup by using N processors 1 P (1 − P ) + N 1 Processor 2 Processors Serial Parallel Serial Parallel 50 50 ⇒ 50 25 Time Timenew Speedup = 1/(0.5 + 0.5/2) = 1.33…
  • 5. Amdahl’s law • Given a proportion P of a program that can be made parallel, and the remaining serial portion (1-P), speedup by using N processors 1 P (1 − P ) + N 1 Processor ∞ Processors Serial Parallel Serial 50 50 ⇒ 50 Time Timenew Speedup = 1/(0.5 + 0) = 2
  • 6. Types of Multiprocessors (MPs) • UMA MP – Uniform Memory Access • Access to all memory occurs at the same speed for all processors. • NUMA MP – Non-uniform memory access • Access to some parts of memory is faster for some processors than other parts of memory • We will focus on UMA 6
  • 7. Bus Based UMA Simplest MP is more than one processor on a single bus connect to memory (a) – Bus bandwidth becomes a bottleneck with more than just a few CPUs COMP3231 04s1 7
  • 8. Bus Based UMA • Each processor has a cache to reduce its need for access to memory (b) – Hope is most accesses are to the local cache – Bus bandwidth still becomes a bottleneck with many CPUs COMP3231 04s1 8
  • 9. Cache Consistency • What happens if one CPU writes to address 0x1234 (and it is stored in its cache) and another CPU reads from the same address (and gets what is in its cache)? COMP3231 04s1 9
  • 10. Cache Consistency • Cache consistency is usually handled by the hardware. – Writes to one cache propagate to, or invalidate appropriate entries on other caches – Cache transactions also consume bus bandwidth COMP3231 04s1 10
  • 11. Bus Based UMA • To further scale the number processors, we give each processor private local memory – Keep private data local on off the shared memory bus – Bus bandwidth still becomes a bottleneck with many CPUs with shared data – Complicate application development • We have to partition between private and shared variables COMP3231 04s1 11
  • 13. Bus Based UMA • With only a single shared bus, scalability is limited by the bus bandwidth of the single bus – Caching only helps so much • Alternative bus architectures do exist. 13
  • 14. UMA Crossbar Switch COMP3231 04s1 14
  • 15. UMA Crossbar Switch • Pro – Any CPU can access any available memory with less blocking • Con – Number of switches required scales with n2. • 1000 CPUs need 1000000 switches 15
  • 16. Summary • Multiprocessors can – Increase computation power beyond that available from a single CPU – Share resources such as disk and memory • However – Shared buses (bus bandwidth) limit scalability • Can be reduced via hardware design • Can be reduced by carefully crafted software behaviour – Good cache locality together with private data where possible • Question – How do we construct an OS for a multiprocessor? • What are some of the issues? 16
  • 17. Each CPU has its own OS • Statically allocate physical memory to each CPU • Each CPU runs its own independent OS • Share peripherals • Each CPU (OS) handles its processes system calls COMP3231 04s1 17
  • 18. Each CPU has its own OS • Used in early multiprocessor systems to ‘get them going’ – Simpler to implement – Avoids concurrency issues by not sharing COMP3231 04s1 18
  • 19. Issues • Each processor has its own scheduling queue – We can have one processor overloaded, and the rest idle • Each processor has its own memory partition – We can a one processor thrashing, and the others with free memory • No way to move free memory from one OS to another • Consistency is an issue with independent disk buffer caches and potentially shared files COMP3231 04s1 19
  • 20. Master-Slave Multiprocessors • OS (mostly) runs on a single fixed CPU – All OS tables, queues, buffers are present/manipulated on CPU 1 • User-level apps run on the other CPUs – And CPU 1 if there is spare CPU time • All system calls are passed to CPU 1 for processing COMP3231 04s1 20
  • 21. Master-Slave Multiprocessors • Very little synchronisation required – Only one CPU accesses majority of kernel data • Simple to implement • Single, centralised scheduler – Keeps all processors busy • Memory can be allocated as needed to all CPUs COMP3231 04s1 21
  • 22. Issue • Master CPU can become the bottleneck • Cross CPU traffic is slow compare to local COMP3231 04s1 22
  • 23. Symmetric Multiprocessors (SMP) • OS kernel run on all processors – Load and resource are balance between all processors • Including kernel execution • Issue: Real concurrency in the kernel – Need carefully applied synchronisation primitives to avoid disaster COMP3231 04s1 23
  • 24. Symmetric Multiprocessors (SMP) • One alternative: A single mutex that make the entire kernel a large critical section – Only one CPU can be in the kernel at a time – Only slight better solution than master slave • Better cache locality • The “big lock” becomes a bottleneck when in-kernel processing exceed what can be done on a single CPU COMP3231 04s1 24
  • 25. Symmetric Multiprocessors (SMP) • Better alternative: identify largely independent parts of the kernel and make each of them their own critical section – Allows more parallelism in the kernel • Issue: Difficult task – Code is mostly similar to uniprocessor code – Hard part is identifying independent parts that don’t interfere with each other COMP3231 04s1 25
  • 26. Symmetric Multiprocessors (SMP) • Example: – Associate a mutex with independent parts of the kernel – Some kernel activities require more than one part of the kernel • Need to acquire more than one mutex • Great opportunity to deadlock!!!! – Results in potentially complex lock ordering schemes that must be adhered to COMP3231 04s1 26
  • 27. Symmetric Multiprocessors (SMP) • Example: – Given a “big lock” kernel, we divide the kernel into two independent parts with a lock each • Good chance that one of those locks will become the next bottleneck • Leads to more subdivision, more locks, more complex lock acquisition rules – Subdivision in practice is (in reality) making more code multithreaded (parallelised) COMP3231 04s1 27
  • 28. Real life Scalability Example • Early 1990’s, CSE wanted to run 80 X-Terminals off one or more server machines • Winning tender was a 4-CPU bar-fridge-sized machine with 256M of RAM – Eventual config 6-CPU and 512M of RAM – Machine ran fine in all pre-session testing COMP3231 04s1 28
  • 29. Real life Scalability Example • Students + assignment deadline = machine unusable COMP3231 04s1 29
  • 30. Real life Scalability Example • To fix the problem, the tenderer supplied more CPUs to improve performance (number increased to 8) – No change???? • Eventually, machine was replaced with – Three 2-CPU pizza-box-sized machines, each with 256M RAM – Cheaper overall – Performance was dramatically improved!!!!! – Why? COMP3231 04s1 30
  • 31. Real life Scalability Example • Paper: – Ramesh Balan and Kurt Gollhardt, “A Scalable Implementation of Virtual Memory HAT Layer for Shared Memory Multiprocessor Machines”, Proc. 1992 Summer USENIX conference • The 4-8 CPU machine hit a bottleneck in the single threaded VM code – Adding more CPUs simply added them to the wait queue for the VM locks, and made others wait longer • The 2 CPU machines did not generate that much lock contention and performed proportionally better. COMP3231 04s1 31
  • 32. Lesson Learned • Building scalable multiprocessor kernels is hard • Lock contention can limit overall system performance COMP3231 04s1 32
  • 33. SMP Linux similar evolution • Linux 2.0 Single kernel big lock • Linux 2.2 Big lock with interrupt handling locks • Linux 2.4 Big lock plus some subsystem locks • Linux 2.6 most code now outside the big lock, data-based locking, lots of scalability tuning, etc, etc.. 33
  • 34. Multiprocessor Synchronisation • Given we need synchronisation, how can we achieve it on a multiprocessor machine? – Unlike a uniprocessor, disabling interrupts does not work. • It does not prevent other CPUs from running in parallel – Need special hardware support COMP3231 04s1 34
  • 35. Recall Mutual Exclusion with Test-and-Set Entering and leaving a critical region using the TSL instruction COMP3231 04s1 35
  • 36. Test-and-Set • Hardware guarantees that the instruction executes atomically. • Atomically: As an indivisible unit. – The instruction can not stop half way through COMP3231 04s1 36
  • 37. Test-and-Set on SMP • It does not work without some extra hardware support COMP3231 04s1 37
  • 38. Test-and-Set on SMP • A solution: – Hardware locks the bus during the TSL instruction to prevent memory accesses by any other CPU COMP3231 04s1 38
  • 39. Test-and-Set on SMP • Test-and Set is a busy-wait synchronisation primitive – Called a spinlock • Issue: – Lock contention leads to spinning on the lock • Spinning on a lock requires bus locking which slows all other CPUs down – Independent of whether other CPUs need a lock or not – Causes bus contention 39
  • 40. Test-and-Set on SMP • Caching does not help reduce bus contention – Either TSL still locks the bus – Or TSL requires exclusive access to an entry in the local cache • Requires invalidation of same entry in other caches, and loading entry into local cache • Many CPUs performing TSL simply bounce a single exclusive entry between all caches using the bus COMP3231 04s1 40
  • 41. Reducing Bus Contention • Read before TSL – Spin reading the lock variable start: waiting for it to change – When it does, use TSL to acquire while (lock == 1); the lock • Allows lock to be shared read-only r = TSL(lock) in all caches until its released if (r == 1) – no bus traffic until actual release • No race conditions, as acquisition goto start; is still with TSL. COMP3231 04s1 41
  • 42. Thomas Anderson, “The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors”, IEEE Transactions on Parallel and Distributed Systems, Vol 1, No. 1, 1990 42
  • 43. Compares Simple Spinlocks • Test and Set void lock (volatile lock_t *l) { while (test_and_set(l)) ; } • Read before Test and Set void lock (volatile lock_t *l) { while (*l == BUSY || test_and_set(l)) ; } 43
  • 44. Benchmark for i = 1 .. 1,000,000 { lock(l) crit_section() unlock() compute() } • Compute chosen from uniform random distribution of mean 5 times critical section • Measure elapsed time on Sequent Symmetry (20 CPU 30386, coherent write-back invalidate caches) 44
  • 45. 45
  • 46. Results • Test and set performs poorly once there is enough CPUs to cause contention for lock – Expected • Test and Test and Set performs better – Performance less than expected – Still significant contention on lock when CPUs notice release and all attempt acquisition • Critical section performance degenerates – Critical section requires bus traffic to modify shared structure – Lock holder competes with CPU that missed as they test and set ) lock holder is slower – Slower lock holder results in more contention 46
  • 47. • John Mellor-Crummey and Michael Scott, “Algorithms for Scalable Synchronisation on Shared-Memory Multiprocessors”, ACM Transactions on Computer Systems, Vol. 9, No. 1, 1991 47
  • 48. MCS Locks • Each CPU enqueues its own private lock variable into a queue and spins on it – No contention • On lock release, the releaser unlocks the next lock in the queue – Only have bus contention on actual unlock – No starvation (order of lock acquisitions defined by the list) COMP9242 48
  • 49. MCS Lock • Requires – compare_and_swap() – exchange() • Also called fetch_and_store() COMP9242 49
  • 50. COMP9242 50
  • 51. 51
  • 52. Selected Benchmark • Compared – test and test and set – Others in paper • Anderson’s array based queue • test and set with exponential back-off – MCS COMP9242 52
  • 53. COMP9242 53
  • 54. Confirmed Trade-off • Queue locks scale well but have higher overhead • Spin Locks have low overhead but don’t scale well COMP9242 54
  • 55. Other Hardware Provided SMP Synchronisation Primitives • Atomic Add/Subtract – Can be used to implement counting semaphores • Exchange • Compare and Exchange • Load linked; Store conditionally – Two separate instructions • Load value using load linked • Modify, and store using store conditionally • If value changed by another processor, or an interrupt occurred, then store conditionally failed – Can be used to implement all of the above primitives – Implemented without bus locking 55
  • 56. Spinning versus Switching • Remember spinning (busy-waiting) on a lock made little sense on a uniprocessor – The was no other running process to release the lock – Blocking and (eventually) switching to the lock holder is the only option. • On SMP systems, the decision to spin or block is not as clear. – The lock is held by another running CPU and will be freed without necessarily blocking the requestor 56
  • 57. Spinning versus Switching – Blocking and switching • to another process takes time – Save context and restore another – Cache contains current process not new process » Adjusting the cache working set also takes time – TLB is similar to cache • Switching back when the lock is free encounters the same again – Spinning wastes CPU time directly • Trade off – If lock is held for less time than the overhead of switching to and back ⇒It’s more efficient to spin ⇒Spinlocks expect critical sections to be short 57
  • 58. Preemption and Spinlocks • Critical sections synchronised via spinlocks are expected to be short – Avoid other CPUs wasting cycles spinning • What happens if the spinlock holder is preempted at end of holder’s timeslice – Mutual exclusion is still guaranteed – Other CPUs will spin until the holder is scheduled again!!!!! ⇒ Spinlock implementations disable interrupts in addition to acquiring locks to avoid lock-holder preemption 58
  • 59. Multiprocessor Scheduling • Given X processes (or threads) and Y CPUs, – how do we allocate them to the CPUs 59
  • 60. A Single Shared Ready Queue • When a CPU goes idle, it take the highest priority process from the shared ready queue COMP3231 04s1 60
  • 61. Single Shared Ready Queue • Pros – Simple – Automatic load balancing • Cons – Lock contention on the ready queue can be a major bottleneck • Due to frequent scheduling or many CPUs or both – Not all CPUs are equal • The last CPU a process ran on is likely to have more related entries in the cache. 61
  • 62. Affinity Scheduling • Basic Idea – Try hard to run a process on the CPU it ran on last time • One approach: Two-level scheduling 62
  • 63. Two-level Scheduling • Each CPU has its own ready queue • Top-level algorithm assigns process to CPUs – Defines their affinity, and roughly balances the load • The bottom-level scheduler: – Is the frequently invoked scheduler (e.g. on blocking on I/O, a lock, or exhausting a timeslice) – Runs on each CPU and selects from its own ready queue • Ensures affinity – If nothing is available from the local ready queue, it runs a process from another CPUs ready queue rather than go idle 63
  • 64. Two-level Scheduling • Pros – No lock contention on per-CPU ready queues in the (hopefully) common case – Load balancing to avoid idle queues – Automatic affinity to a single CPU for more cache friendly behaviour 64