SlideShare a Scribd company logo
Multicore Computers
Mr. A. B. Shinde
Electronics Engineering
Contents…
 Hardware performance issues,
 Software performance issues,
 Multicore organization,
 Intel x 86 multicore organizations,
 ARM11 MPC core 699
2
Introduction
 A multicore computer, combines two or more processors on a
single computer chip.
 The use of more complex single-processor chips has reached a
limit due to hardware performance issues, (limits in instruction-level
parallelism & power limitations).
 The multicore architecture poses challenges to software developers
to exploit the capability for multithreading across multiple cores.
 The main variables in a multicore organization are the number of
processors on the chip, the number of levels of cache memory, and
the extent to which cache memory is shared.
3
Introduction
 A multicore computer, also known as a chip multiprocessor,
combines two or more processors (called cores) on a single piece of
silicon (called a die).
 Each core consists of all of the components of an independent
processor, such as registers, ALU, pipeline hardware, and control
unit, plus L1 instruction and data caches.
 In addition to the multiple cores, contemporary multicore chips also
include L2 cache and, in some cases, L3 cache.
4
Hardware Performance Issues
5
Hardware Performance Issues
 Microprocessor systems have experienced a steady, exponential
increase in execution performance for decades.
 This increase is due to refinements in the organization of the
processor on the chip, and the increase in the clock frequency.
6
Intel Microprocessor Performance
Hardware Performance Issues
 Increase in Parallelism:
 The organizational changes in processor design have primarily been
focused on increasing instruction-level parallelism, so that more work
could be done in each clock cycle.
7
Superscalar
Hardware Performance Issues
 Increase in Parallelism:
 The organizational changes in processor design have primarily been
focused on increasing instruction-level parallelism, so that more work
could be done in each clock cycle.
8
Simultaneous multithreading
Hardware Performance Issues
 Increase in Parallelism:
 The organizational changes in processor design have primarily been
focused on increasing instruction-level parallelism, so that more work
could be done in each clock cycle.
9
Multicore
Hardware Performance Issues
 Increase in Parallelism:
 Pipelining: Individual instructions are executed through a pipeline of
stages so that while one instruction is executing in one stage of the
pipeline, another instruction is executing in another stage of the
pipeline.
 Superscalar: Multiple pipelines are constructed by replicating
execution resources. This enables parallel execution of instructions in
parallel pipelines, so long as hazards are avoided.
 Simultaneous multithreading (SMT): Register banks are replicated
so that multiple threads can share the use of pipeline resources.
10
Hardware Performance Issues
 Increase in Parallelism:
 In the case of pipelining, simple three-stage pipelines were replaced
by pipelines with five stages, and then many more stages, with some
implementations having over a dozen stages.
 There is a practical limit to how far this trend can be taken, because
with more stages, there is the need for more logic (hardware), more
interconnections and more control signals.
11
Hardware Performance Issues
 Increase in Parallelism:
 With superscalar organization, performance increase can be
achieved by increasing the number of parallel pipelines.
 There are limitations, as the number of pipelines increases. More logic
is required to manage hazards and to stage instruction resources.
 A single thread of execution reaches the point where hazards and
resource dependencies prevent the full use of the multiple pipelines
available.
 As the complexity of managing multiple threads over a set of
pipelines limits the number of threads and number of pipelines that
can be effectively utilized.
12
Hardware Performance Issues
 Increase in Parallelism:
 The increase in complexity to deal with all of the logical issues related
to very long pipelines, multiple superscalar pipelines, and multiple SMT
register banks means that increasing amounts of the chip area is
occupied with coordinating and signal transfer logic.
 This increases the difficulty of designing, fabricating, and
debugging the chips.
 Power issues are another big challenge
13
Hardware Performance Issues
 Power Consumption:
 To maintain the higher performance, the number of transistors per chip
rise and high clock frequencies. Unfortunately, power requirements
have grown exponentially as chip density and clock frequency have
risen.
 One way to control power density is to use more of the chip area for
cache memory.
 Memory transistors are smaller and have a power density an order of
magnitude lower than that of logic.
14
Hardware Performance Issues
 Power Consumption:
15
Figure shows where the power consumption trend is leading.
Assuming about 50–60% of the chip area is devoted to memory, the chip
will support cache memory of about 100 MB and leave over 1 billion
transistors available for logic.
Hardware Performance Issues
 Power Consumption:
 The recent decades the Pollack’s rule was observed, which states that
performance increase is roughly proportional to square root of increase
in complexity.
 If you double the logic in a processor core, then it delivers only 40%
more performance.
 The use of multiple cores has the potential to provide near-linear
performance improvement with the increase in the number of cores.
16
Hardware Performance Issues
 Power Consumption:
 Power considerations provide another motive for moving toward a
multicore organization.
 The chip has such a huge amount of cache memory, it becomes
unlikely that any one thread of execution can effectively use all that
memory.
 In SMT, a number of relatively independent threads or processes has
a greater opportunity to take full advantage of the cache memory.
17
Software Performance Issues
18
Software Performance Issues
 A detailed examination of the software performance issues related to
multicore organization is huge task.
19
Software Performance Issues
 Software on Multicore:
 The performance benefits of a multicore organization depend on the
ability to effectively exploit the parallel resources.
 Consider, single application running on a multicore system.
20
The law assumes a program in which a fraction (1-f) of the execution
time involves code serial and a fraction f that involves code is
infinitely parallelizable with no scheduling overhead.
Software Performance Issues
 A number of classes of applications benefit directly from the ability to
scale throughput with the number of cores.
 Multithreaded native applications: Multithreaded applications are
characterized by having a small number of highly threaded processes.
 Examples of threaded applications include Lotus Domino or Siebel
CRM (Customer Relationship Manager).
 Multiprocess applications: Multiprocess applications are characterized
by the presence of many single-threaded processes.
 Examples of multi-process applications include the Oracle database,
SAP, and PeopleSoft.
21
Software Performance Issues
 Java applications:
 Java language greatly facilitate multithreaded applications.
 Java Virtual Machine is a multithreaded process that provides
scheduling and memory management for Java applications.
 Java applications that can benefit directly from multicore resources
include application servers such as Sun’s Java Application Server, BEA’s
Weblogic, IBM’s Websphere etc.
 All applications that use a Java 2 Platform, Enterprise Edition (J2EE
platform) application server can immediately benefit from multicore
technology.
22
Software Performance Issues
 Multiinstance applications:
 Even if an individual application does not scale to take advantage of a
large number of threads, it is still possible to gain advantage from
multicore architecture by running multiple instances of the
application in parallel.
 If multiple application instances require some degree of isolation,
then virtualization technology can be used to provide each of them
with its own separate and secure environment.
23
Multicore Organization
 The main variables in a multicore organization are as follows:
 The number of core processors on the chip
 The number of levels of cache memory
 The amount of cache memory that is shared
24
Multicore Organization
25
Dedicated L1 cache Dedicated L2 cache
Shared L2 cache Shared L3 cache
General
Organizations of
Multicore systems
Multicore Organization
 Figure shows an organization found in
some of the earlier multicore computer
chips and is still seen in embedded
chips.
 In this organization, the only on-chip
cache is L1 cache, each core having
its own dedicated L1 cache.
 Almost invariably, the L1 cache is
divided into instruction and data
caches.
 An example of this organization is the
ARM11 MPCore.
26
Dedicated L1 cache
Multicore Organization
 The organization is also one in which
there is no on-chip cache sharing.
 In this, there is enough area
available on the chip to allow for
L2 cache.
 An example of this organization is the
AMD Opteron.
27
Dedicated L2 cache
Multicore Organization
 Figure shows allocation of chip space to
memory, but with the use of a shared L2
cache.
 The Intel Core Duo has this organization.
 The amount of cache memory available
on the chip continues to grow,
performance considerations dictate
splitting off a separate, shared L3 cache,
with dedicated L1 and L2 caches for
each core processor.
 The Intel Core i7 is an example of this
organization.
28
Shared L2 cache
Shared L3 cache
Multicore Organization
 Advantages of shared L2/L3 cache on the chip:
1. Constructive interference can reduce overall miss rates.
If a thread on one core accesses a main memory location, this brings the
frame containing the referenced location into the shared cache.
If a thread on another core soon thereafter accesses the same memory
block, the memory locations will already be available in the shared on-
chip cache.
2. Data shared by multiple cores is not replicated at the shared cache
level.
29
Multicore Organization
 Advantages of shared L2 cache on the chip:
3. With proper frame replacement algorithms, the amount of shared
cache allocated to each core is dynamic, so that threads that have a less
locality can employ more cache.
4. Interprocessor communication is easy to implement, via shared
memory locations.
5. The use of a shared L2 cache confines the cache coherency
problem to the L1 cache level, which may provide some additional
performance advantage.
30
Multicore Organization
 A advantage of having only dedicated L2 caches on the chip is that
each core enjoys more rapid access to its private L2 cache.
 As both the amount of memory available and the number of cores
grow, the use of a shared L3 cache combined with either a shared
L2 cache or dedicated percore L2 caches provides better performance
than simply a massive shared L2 cache.
31
Multicore Organization
 Another organizational design decision in a multicore system is
whether the individual cores will be superscalar or will implement
simultaneous multithreading (SMT).
 For example, the Intel Core Duo uses superscalar cores, whereas the
Intel Core i7 uses SMT cores.
 SMT scales up the number of hardware level threads that the
multicore system supports.
 As software is developed to fully exploit parallel resources, an SMT
approach appears to be more attractive than a superscalar approach.
32
Intel X86 Multicore Organization
33
Intel X86 Multicore Organization
 Intel has introduced a number of multicore products in recent years.
 Examples:
1. The Intel Core Duo and
2. The Intel Core i7.
34
Intel X86: Intel Core Duo
 The general structure of the Intel
Core Duo is shown in Figure.
 Like a multicore systems, each core
has its own dedicated L1 cache.
 In this case, each core has a 32-KB
instruction cache and a 32-KB
data cache.
35
Intel Core Duo Block Diagram
Advanced Programmable Interrupt Controller
(APIC)
Intel X86: Intel Core Duo
 Each core has an independent thermal control unit. Thermal
management is a fundamental capability, especially for laptop and mobile
systems.
 The Core Duo thermal control unit is designed to manage chip heat
dissipation to maximize processor performance within thermal
constraints.
36
Intel X86: Intel Core Duo
 The thermal management unit monitors digital sensors for high-
accuracy die temperature measurements.
 The maximum temperature for each thermal zone is reported
separately via dedicated registers that can be polled by software.
 If the temperature in a core exceeds a threshold, the thermal control
unit reduces the clock rate for that core to reduce heat generation.
37
Intel X86: Intel Core Duo
 Advanced Programmable Interrupt Controller (APIC) is the second
key element.
 The APIC performs a number of functions, including the following:
1. The APIC can provide interprocessor interrupts, which allow any
process to interrupt any other processor or set of processors.
In the case of the Core Duo, a thread in one core can generate an
interrupt, which is accepted by the local APIC, routed to the APIC of the
other core, and communicated as an interrupt to the other core.
2. The APIC accepts I/O interrupts and routes these to the appropriate
core.
3. Each APIC includes a timer, which can be set by the OS to generate
an interrupt to the local core.
38
Intel X86: Intel Core Duo
 The power management logic is responsible for reducing power
consumption when possible.
 The power management logic monitors thermal conditions and CPU
activity and adjusts voltage levels and power consumption
appropriately.
 It has an advanced power-gating capability that allows for an ultra
fine-grained logic control that turns on individual processor logic
subsystems only if and when they are needed.
39
Intel X86: Intel Core Duo
 The Core Duo chip includes a shared 2-MB L2 cache.
 The cache logic allows for a dynamic allocation of cache space
based on current core needs, so that one core can be assigned up to
100% of the L2 cache.
 The L2 cache includes logic to support the MESI (Interconnect)
protocol for the attached L1 caches.
40
Intel X86: Intel Core Duo
 A cache line gets the M state when a processor writes to it; if the line
is not in E or M-state prior to writing it, the cache sends a Read-For-
Ownership (RFO) request.
 The Intel Core Duo extends this protocol to take into account the
case when there are multiple Core Duo chips organized as a
symmetric multiprocessor (SMP) system.
41
Intel X86: Intel Core Duo
 The L2 cache controller allow the system to distinguish between a
situation in which data are shared by the two local cores, a situation
in which the data are shared by one or more caches on the die as well as
by an agent on the external bus (can be another processor).
 Core issues an RFO, if the line is shared only by the other cache
within the local die.
 The bus interface connects to the external bus, known as the Front
Side Bus, which connects to main memory, I/O controllers, and other
processor chips.
42
Intel X86: Intel Core i7
 The general structure of
the Intel Core i7 is shown
in Figure.
 Each core has its own
dedicated L2 cache and
the four cores share an
8-MB L3 cache.
43
Intel Core i7 Block Diagram
Intel X86: Intel Core i7
 Table shows the cache access latency, in terms of clock cycles for two
Intel multicore systems running at the same clock frequency.
 The Core 2 Quad has a shared L2 cache, similar to the Core Duo.
 The Core i7 improves on L2 cache performance with the use of the
dedicated L2 caches, and provides a relatively high-speed access to the
L3 cache.
44
Intel X86: Intel Core i7
 The Core i7 chip supports two forms of external communications to
other chips.
 The DDR3 memory controller brings the memory controller for the
DDR main memory2 onto the chip.
 The QuickPath Interconnect (QPI) is a cache-coherent, point-to-
point link based electrical interconnect specification for Intel processors
and chipsets.
 It enables high-speed communications among connected processor
chips. The QPI link operates at 6.4 GT/s (transfers per second).
45
ARM11 MPCore
46
ARM11 MPCore
 The ARM11 MPCore is a multicore product based on the ARM11
processor family.
 The ARM11 MPCore can be configured with up to four processors,
each with its own L1 instruction and data caches, per chip.
47
ARM11 MPCore
48
ARM11 MPCore Processor Block Diagram
 Distributed interrupt controller
(DIC)
 Timer
 Watchdog
 CPU interface
 CPU interface
 CPU
 Vector floating-point (VFP) unit
 L1 cache
 Snoop control unit (SCU)
ARM11 MPCore
 The key elements of the system are as follows:
 Distributed interrupt controller (DIC): Handles interrupt detection
and interrupt prioritization. The DIC distributes interrupts to individual
processors.
 Timer: Each CPU has its own private timer that can generate
interrupts.
 Watchdog: Issues warning alerts in the event of software failures. If
the watchdog is enabled, it is set to a predetermined value and counts
down to 0. If the watchdog value reaches zero, an alert is issued.
 CPU interface: Handles interrupt acknowledgement, interrupt
masking, and interrupt completion acknowledgement.
49
ARM11 MPCore
 The key elements of the system are as follows:
 CPU: A single ARM11 processor. Individual CPUs are referred to as
MP11 CPUs.
 Vector floating-point (VFP) unit: A coprocessor that implements
floating point operations in hardware.
 L1 cache: Each CPU has its own dedicated L1 data cache and L1
instruction cache.
 Snoop control unit (SCU): Responsible for maintaining coherency
among L1 data caches.
50
ARM11 MPCore
51
ARM11 MPCore Configurable Options
ARM11 MPCore
 Interrupt Handling:
 The Distributed Interrupt Controller (DIC) collates interrupts from a large
number of sources.
 It provides:
 Masking of interrupts
 Prioritization of the interrupts
 Distribution of the interrupts to the target MP11 CPUs
 Tracking the status of interrupts
 Generation of interrupts by software
52
ARM11 MPCore
 Interrupt Handling:
 The DIC enables the number of interrupts supported in the system to
be independent of the MP11 CPU design.
 The DIC is memory mapped; that is, control registers for the DIC are
defined relative to a main memory base address.
 The DIC is designed to satisfy two functional requirements:
 Provide a means of routing an interrupt request to a single CPU or
multiple CPUs, as required.
 Provide a means of interprocessor communication so that a thread
on one CPU can cause activity by a thread on another CPU.
53
ARM11 MPCore
 Interrupt Handling:
 The DIC can route an interrupt to one or more CPUs in the following
three ways:
 An interrupt can be directed to a specific processor only.
 An interrupt can be directed to a defined group of processors.
 An interrupt can be directed to all processors.
54
ARM11 MPCore
 Interrupt Handling:
 From the point of view of an MP11 CPU, an interrupt can be
 Inactive: An Inactive interrupt is one that is nonasserted.
 Pending: A Pending interrupt is one that has been asserted, and for
which processing has not started on that CPU.
 Active: An Active interrupt is one that has been started on that CPU, but
processing is not complete.
55
ARM11 MPCore
 Interrupt Handling:
 Interrupts come from the following sources:
 Interprocessor interrupts (IPIs): Each CPU has private interrupts,
(ID0-ID15), triggered by software. The priority of an IPI depends on the
receiving CPU, not the sending CPU.
 Private timer and/or watchdog interrupts: These use interrupt IDs 29
and 30.
 Legacy FIQ line: In legacy IRQ mode, the legacy FIQ pin, bypasses the
Interrupt Distributor logic and directly drives interrupt requests into the
CPU.
 Hardware interrupts: Hardware interrupts are triggered by
programmable events on associated interrupt input lines. Hardware
interrupts start at ID32.
56
ARM11 MPCore
57Block diagram of the DIC
ARM11 MPCore
 The DIC is configurable to support between 0 and 255 hardware
interrupt inputs.
 The DIC maintains a list of interrupts, showing their priority and
status.
 The Interrupt Distributor transmits to each CPU Interface the
highest Pending interrupt for that interface. It receives back the
interrupt acknowledgement, and can then change the status of the
corresponding interrupt.
 The CPU Interface also transmits End of Interrupt Information (EOI),
which enables the Interrupt Distributor to update the status of this
interrupt from Active to Inactive.
58
ARM11 MPCore
 Cache Coherency:
 The MPCore’s Snoop Control Unit (SCU) is designed to resolve most
of the traditional bottlenecks related to access to shared data.
 The SCU introduces three types of optimization:
 Direct data intervention,
 Duplicated tag RAMs, and
 Migratory lines.
59
ARM11 MPCore
 Cache Coherency:
 Direct data intervention (DDI): Enables copying clean data from one
CPU L1 data cache to another CPU L1 data cache without accessing
external memory.
 This reduces read after read activity from the Level 1 cache to the
Level 2 cache.
 Thus, a local L1 cache miss is resolved in a remote L1 cache rather
than from access to the shared L2 cache.
60
ARM11 MPCore
 Cache Coherency:
 Main memory location of each line within a cache is identified by a tag
for that line.
 The tags can be implemented as a separate block of RAM of the same
length as the number of lines in the cache.
 In the SCU, duplicated tag RAMs, are duplicated versions of L1 tag
RAMs.
 Coherency commands are sent only to CPUs that must update their
coherent data cache.
 This reduces the power consumption.
 If the tag data is available locally, SCU limits cache manipulations to
processors that have cache lines in common.
61
62
This presentation is published only for educational purpose
shindesir.pvp@gmail.com

More Related Content

PPTX
CPU Scheduling in OS Presentation
PPTX
Cache coherence
PPT
Webservices
PPT
pipelining
PPTX
Hardware Multi-Threading
PPTX
English 5 types of viewing materials.pptx
PDF
QUANTITATIVE METHODS NOTES.pdf
PPTX
Classes objects in java
CPU Scheduling in OS Presentation
Cache coherence
Webservices
pipelining
Hardware Multi-Threading
English 5 types of viewing materials.pptx
QUANTITATIVE METHODS NOTES.pdf
Classes objects in java

What's hot (20)

PPT
Memory management
PPT
Introduction to Computer Architecture
PPT
Multicore computers
PPT
Memory Management in OS
PPTX
Computer architecture virtual memory
PPT
OS Process and Thread Concepts
PDF
Multithreading
PDF
Disk allocation methods
PPTX
Instruction Set Architecture
PPT
Virtual memory
PPTX
Operating system components
PDF
Linux Memory Management
PPT
Sequential consistency model
PPT
Operating System 2
PPT
Thrashing allocation frames.43
PPTX
Operating System-Memory Management
PPTX
System calls
PPTX
Unit 4-Memory Management - operating systems.pptx
PPTX
Process management os concept
Memory management
Introduction to Computer Architecture
Multicore computers
Memory Management in OS
Computer architecture virtual memory
OS Process and Thread Concepts
Multithreading
Disk allocation methods
Instruction Set Architecture
Virtual memory
Operating system components
Linux Memory Management
Sequential consistency model
Operating System 2
Thrashing allocation frames.43
Operating System-Memory Management
System calls
Unit 4-Memory Management - operating systems.pptx
Process management os concept
Ad

Similar to Multicore Computers (20)

PPTX
Study of various factors affecting performance of multi core processors
PPTX
Slot29-CH18-MultiCoreComputers-18-slides (1).pptx
DOCX
1.multicore processors
PPT
Intel new processors
PDF
IEEExeonmem
PDF
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
DOC
Analysis of Multicore Performance Degradation of Scientific Applications
PDF
I understand that physics and hardware emmaded on the use of finete .pdf
PPTX
Multi processor
PPTX
Trends in computer architecture
PDF
Hyper threading technology
PDF
From Rack scale computers to Warehouse scale computers
PPTX
Multicore processor by Ankit Raj and Akash Prajapati
DOCX
Multi-Core on Chip Architecture *doc - IK
PDF
Ef35745749
PPT
Chap2 slides
PPT
Chapter_1.ppt Peter S Pacheco, Matthew Malensek – An Introduction to Parallel...
PPT
Chapter_1_16_10_2024.pptPeter S Pacheco, Matthew Malensek – An Introduction t...
PPT
Massively Parallel Architectures
PDF
Co question bank LAKSHMAIAH
Study of various factors affecting performance of multi core processors
Slot29-CH18-MultiCoreComputers-18-slides (1).pptx
1.multicore processors
Intel new processors
IEEExeonmem
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Analysis of Multicore Performance Degradation of Scientific Applications
I understand that physics and hardware emmaded on the use of finete .pdf
Multi processor
Trends in computer architecture
Hyper threading technology
From Rack scale computers to Warehouse scale computers
Multicore processor by Ankit Raj and Akash Prajapati
Multi-Core on Chip Architecture *doc - IK
Ef35745749
Chap2 slides
Chapter_1.ppt Peter S Pacheco, Matthew Malensek – An Introduction to Parallel...
Chapter_1_16_10_2024.pptPeter S Pacheco, Matthew Malensek – An Introduction t...
Massively Parallel Architectures
Co question bank LAKSHMAIAH
Ad

More from Dr. A. B. Shinde (20)

PDF
Python Programming Laboratory Manual for Students
PPSX
OOPS Concepts in Python and Exception Handling
PPSX
Python Functions, Modules and Packages
PPSX
Python Data Types, Operators and Control Flow
PPSX
Introduction to Python programming language
PPSX
Communication System Basics
PPSX
MOSFETs: Single Stage IC Amplifier
PPSX
PPSX
Color Image Processing: Basics
PPSX
Edge Detection and Segmentation
PPSX
Image Processing: Spatial filters
PPSX
Image Enhancement in Spatial Domain
DOCX
Resume Format
PDF
Digital Image Fundamentals
PPSX
Resume Writing
PPSX
Image Processing Basics
PPSX
Blooms Taxonomy in Engineering Education
PPSX
ISE 7.1i Software
PDF
VHDL Coding Syntax
PDF
VHDL Programs
Python Programming Laboratory Manual for Students
OOPS Concepts in Python and Exception Handling
Python Functions, Modules and Packages
Python Data Types, Operators and Control Flow
Introduction to Python programming language
Communication System Basics
MOSFETs: Single Stage IC Amplifier
Color Image Processing: Basics
Edge Detection and Segmentation
Image Processing: Spatial filters
Image Enhancement in Spatial Domain
Resume Format
Digital Image Fundamentals
Resume Writing
Image Processing Basics
Blooms Taxonomy in Engineering Education
ISE 7.1i Software
VHDL Coding Syntax
VHDL Programs

Recently uploaded (20)

PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
DOCX
573137875-Attendance-Management-System-original
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
composite construction of structures.pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Digital Logic Computer Design lecture notes
PPTX
Sustainable Sites - Green Building Construction
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Construction Project Organization Group 2.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
573137875-Attendance-Management-System-original
Foundation to blockchain - A guide to Blockchain Tech
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
CH1 Production IntroductoryConcepts.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
bas. eng. economics group 4 presentation 1.pptx
composite construction of structures.pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Digital Logic Computer Design lecture notes
Sustainable Sites - Green Building Construction
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Internet of Things (IOT) - A guide to understanding
Construction Project Organization Group 2.pptx

Multicore Computers

  • 1. Multicore Computers Mr. A. B. Shinde Electronics Engineering
  • 2. Contents…  Hardware performance issues,  Software performance issues,  Multicore organization,  Intel x 86 multicore organizations,  ARM11 MPC core 699 2
  • 3. Introduction  A multicore computer, combines two or more processors on a single computer chip.  The use of more complex single-processor chips has reached a limit due to hardware performance issues, (limits in instruction-level parallelism & power limitations).  The multicore architecture poses challenges to software developers to exploit the capability for multithreading across multiple cores.  The main variables in a multicore organization are the number of processors on the chip, the number of levels of cache memory, and the extent to which cache memory is shared. 3
  • 4. Introduction  A multicore computer, also known as a chip multiprocessor, combines two or more processors (called cores) on a single piece of silicon (called a die).  Each core consists of all of the components of an independent processor, such as registers, ALU, pipeline hardware, and control unit, plus L1 instruction and data caches.  In addition to the multiple cores, contemporary multicore chips also include L2 cache and, in some cases, L3 cache. 4
  • 6. Hardware Performance Issues  Microprocessor systems have experienced a steady, exponential increase in execution performance for decades.  This increase is due to refinements in the organization of the processor on the chip, and the increase in the clock frequency. 6 Intel Microprocessor Performance
  • 7. Hardware Performance Issues  Increase in Parallelism:  The organizational changes in processor design have primarily been focused on increasing instruction-level parallelism, so that more work could be done in each clock cycle. 7 Superscalar
  • 8. Hardware Performance Issues  Increase in Parallelism:  The organizational changes in processor design have primarily been focused on increasing instruction-level parallelism, so that more work could be done in each clock cycle. 8 Simultaneous multithreading
  • 9. Hardware Performance Issues  Increase in Parallelism:  The organizational changes in processor design have primarily been focused on increasing instruction-level parallelism, so that more work could be done in each clock cycle. 9 Multicore
  • 10. Hardware Performance Issues  Increase in Parallelism:  Pipelining: Individual instructions are executed through a pipeline of stages so that while one instruction is executing in one stage of the pipeline, another instruction is executing in another stage of the pipeline.  Superscalar: Multiple pipelines are constructed by replicating execution resources. This enables parallel execution of instructions in parallel pipelines, so long as hazards are avoided.  Simultaneous multithreading (SMT): Register banks are replicated so that multiple threads can share the use of pipeline resources. 10
  • 11. Hardware Performance Issues  Increase in Parallelism:  In the case of pipelining, simple three-stage pipelines were replaced by pipelines with five stages, and then many more stages, with some implementations having over a dozen stages.  There is a practical limit to how far this trend can be taken, because with more stages, there is the need for more logic (hardware), more interconnections and more control signals. 11
  • 12. Hardware Performance Issues  Increase in Parallelism:  With superscalar organization, performance increase can be achieved by increasing the number of parallel pipelines.  There are limitations, as the number of pipelines increases. More logic is required to manage hazards and to stage instruction resources.  A single thread of execution reaches the point where hazards and resource dependencies prevent the full use of the multiple pipelines available.  As the complexity of managing multiple threads over a set of pipelines limits the number of threads and number of pipelines that can be effectively utilized. 12
  • 13. Hardware Performance Issues  Increase in Parallelism:  The increase in complexity to deal with all of the logical issues related to very long pipelines, multiple superscalar pipelines, and multiple SMT register banks means that increasing amounts of the chip area is occupied with coordinating and signal transfer logic.  This increases the difficulty of designing, fabricating, and debugging the chips.  Power issues are another big challenge 13
  • 14. Hardware Performance Issues  Power Consumption:  To maintain the higher performance, the number of transistors per chip rise and high clock frequencies. Unfortunately, power requirements have grown exponentially as chip density and clock frequency have risen.  One way to control power density is to use more of the chip area for cache memory.  Memory transistors are smaller and have a power density an order of magnitude lower than that of logic. 14
  • 15. Hardware Performance Issues  Power Consumption: 15 Figure shows where the power consumption trend is leading. Assuming about 50–60% of the chip area is devoted to memory, the chip will support cache memory of about 100 MB and leave over 1 billion transistors available for logic.
  • 16. Hardware Performance Issues  Power Consumption:  The recent decades the Pollack’s rule was observed, which states that performance increase is roughly proportional to square root of increase in complexity.  If you double the logic in a processor core, then it delivers only 40% more performance.  The use of multiple cores has the potential to provide near-linear performance improvement with the increase in the number of cores. 16
  • 17. Hardware Performance Issues  Power Consumption:  Power considerations provide another motive for moving toward a multicore organization.  The chip has such a huge amount of cache memory, it becomes unlikely that any one thread of execution can effectively use all that memory.  In SMT, a number of relatively independent threads or processes has a greater opportunity to take full advantage of the cache memory. 17
  • 19. Software Performance Issues  A detailed examination of the software performance issues related to multicore organization is huge task. 19
  • 20. Software Performance Issues  Software on Multicore:  The performance benefits of a multicore organization depend on the ability to effectively exploit the parallel resources.  Consider, single application running on a multicore system. 20 The law assumes a program in which a fraction (1-f) of the execution time involves code serial and a fraction f that involves code is infinitely parallelizable with no scheduling overhead.
  • 21. Software Performance Issues  A number of classes of applications benefit directly from the ability to scale throughput with the number of cores.  Multithreaded native applications: Multithreaded applications are characterized by having a small number of highly threaded processes.  Examples of threaded applications include Lotus Domino or Siebel CRM (Customer Relationship Manager).  Multiprocess applications: Multiprocess applications are characterized by the presence of many single-threaded processes.  Examples of multi-process applications include the Oracle database, SAP, and PeopleSoft. 21
  • 22. Software Performance Issues  Java applications:  Java language greatly facilitate multithreaded applications.  Java Virtual Machine is a multithreaded process that provides scheduling and memory management for Java applications.  Java applications that can benefit directly from multicore resources include application servers such as Sun’s Java Application Server, BEA’s Weblogic, IBM’s Websphere etc.  All applications that use a Java 2 Platform, Enterprise Edition (J2EE platform) application server can immediately benefit from multicore technology. 22
  • 23. Software Performance Issues  Multiinstance applications:  Even if an individual application does not scale to take advantage of a large number of threads, it is still possible to gain advantage from multicore architecture by running multiple instances of the application in parallel.  If multiple application instances require some degree of isolation, then virtualization technology can be used to provide each of them with its own separate and secure environment. 23
  • 24. Multicore Organization  The main variables in a multicore organization are as follows:  The number of core processors on the chip  The number of levels of cache memory  The amount of cache memory that is shared 24
  • 25. Multicore Organization 25 Dedicated L1 cache Dedicated L2 cache Shared L2 cache Shared L3 cache General Organizations of Multicore systems
  • 26. Multicore Organization  Figure shows an organization found in some of the earlier multicore computer chips and is still seen in embedded chips.  In this organization, the only on-chip cache is L1 cache, each core having its own dedicated L1 cache.  Almost invariably, the L1 cache is divided into instruction and data caches.  An example of this organization is the ARM11 MPCore. 26 Dedicated L1 cache
  • 27. Multicore Organization  The organization is also one in which there is no on-chip cache sharing.  In this, there is enough area available on the chip to allow for L2 cache.  An example of this organization is the AMD Opteron. 27 Dedicated L2 cache
  • 28. Multicore Organization  Figure shows allocation of chip space to memory, but with the use of a shared L2 cache.  The Intel Core Duo has this organization.  The amount of cache memory available on the chip continues to grow, performance considerations dictate splitting off a separate, shared L3 cache, with dedicated L1 and L2 caches for each core processor.  The Intel Core i7 is an example of this organization. 28 Shared L2 cache Shared L3 cache
  • 29. Multicore Organization  Advantages of shared L2/L3 cache on the chip: 1. Constructive interference can reduce overall miss rates. If a thread on one core accesses a main memory location, this brings the frame containing the referenced location into the shared cache. If a thread on another core soon thereafter accesses the same memory block, the memory locations will already be available in the shared on- chip cache. 2. Data shared by multiple cores is not replicated at the shared cache level. 29
  • 30. Multicore Organization  Advantages of shared L2 cache on the chip: 3. With proper frame replacement algorithms, the amount of shared cache allocated to each core is dynamic, so that threads that have a less locality can employ more cache. 4. Interprocessor communication is easy to implement, via shared memory locations. 5. The use of a shared L2 cache confines the cache coherency problem to the L1 cache level, which may provide some additional performance advantage. 30
  • 31. Multicore Organization  A advantage of having only dedicated L2 caches on the chip is that each core enjoys more rapid access to its private L2 cache.  As both the amount of memory available and the number of cores grow, the use of a shared L3 cache combined with either a shared L2 cache or dedicated percore L2 caches provides better performance than simply a massive shared L2 cache. 31
  • 32. Multicore Organization  Another organizational design decision in a multicore system is whether the individual cores will be superscalar or will implement simultaneous multithreading (SMT).  For example, the Intel Core Duo uses superscalar cores, whereas the Intel Core i7 uses SMT cores.  SMT scales up the number of hardware level threads that the multicore system supports.  As software is developed to fully exploit parallel resources, an SMT approach appears to be more attractive than a superscalar approach. 32
  • 33. Intel X86 Multicore Organization 33
  • 34. Intel X86 Multicore Organization  Intel has introduced a number of multicore products in recent years.  Examples: 1. The Intel Core Duo and 2. The Intel Core i7. 34
  • 35. Intel X86: Intel Core Duo  The general structure of the Intel Core Duo is shown in Figure.  Like a multicore systems, each core has its own dedicated L1 cache.  In this case, each core has a 32-KB instruction cache and a 32-KB data cache. 35 Intel Core Duo Block Diagram Advanced Programmable Interrupt Controller (APIC)
  • 36. Intel X86: Intel Core Duo  Each core has an independent thermal control unit. Thermal management is a fundamental capability, especially for laptop and mobile systems.  The Core Duo thermal control unit is designed to manage chip heat dissipation to maximize processor performance within thermal constraints. 36
  • 37. Intel X86: Intel Core Duo  The thermal management unit monitors digital sensors for high- accuracy die temperature measurements.  The maximum temperature for each thermal zone is reported separately via dedicated registers that can be polled by software.  If the temperature in a core exceeds a threshold, the thermal control unit reduces the clock rate for that core to reduce heat generation. 37
  • 38. Intel X86: Intel Core Duo  Advanced Programmable Interrupt Controller (APIC) is the second key element.  The APIC performs a number of functions, including the following: 1. The APIC can provide interprocessor interrupts, which allow any process to interrupt any other processor or set of processors. In the case of the Core Duo, a thread in one core can generate an interrupt, which is accepted by the local APIC, routed to the APIC of the other core, and communicated as an interrupt to the other core. 2. The APIC accepts I/O interrupts and routes these to the appropriate core. 3. Each APIC includes a timer, which can be set by the OS to generate an interrupt to the local core. 38
  • 39. Intel X86: Intel Core Duo  The power management logic is responsible for reducing power consumption when possible.  The power management logic monitors thermal conditions and CPU activity and adjusts voltage levels and power consumption appropriately.  It has an advanced power-gating capability that allows for an ultra fine-grained logic control that turns on individual processor logic subsystems only if and when they are needed. 39
  • 40. Intel X86: Intel Core Duo  The Core Duo chip includes a shared 2-MB L2 cache.  The cache logic allows for a dynamic allocation of cache space based on current core needs, so that one core can be assigned up to 100% of the L2 cache.  The L2 cache includes logic to support the MESI (Interconnect) protocol for the attached L1 caches. 40
  • 41. Intel X86: Intel Core Duo  A cache line gets the M state when a processor writes to it; if the line is not in E or M-state prior to writing it, the cache sends a Read-For- Ownership (RFO) request.  The Intel Core Duo extends this protocol to take into account the case when there are multiple Core Duo chips organized as a symmetric multiprocessor (SMP) system. 41
  • 42. Intel X86: Intel Core Duo  The L2 cache controller allow the system to distinguish between a situation in which data are shared by the two local cores, a situation in which the data are shared by one or more caches on the die as well as by an agent on the external bus (can be another processor).  Core issues an RFO, if the line is shared only by the other cache within the local die.  The bus interface connects to the external bus, known as the Front Side Bus, which connects to main memory, I/O controllers, and other processor chips. 42
  • 43. Intel X86: Intel Core i7  The general structure of the Intel Core i7 is shown in Figure.  Each core has its own dedicated L2 cache and the four cores share an 8-MB L3 cache. 43 Intel Core i7 Block Diagram
  • 44. Intel X86: Intel Core i7  Table shows the cache access latency, in terms of clock cycles for two Intel multicore systems running at the same clock frequency.  The Core 2 Quad has a shared L2 cache, similar to the Core Duo.  The Core i7 improves on L2 cache performance with the use of the dedicated L2 caches, and provides a relatively high-speed access to the L3 cache. 44
  • 45. Intel X86: Intel Core i7  The Core i7 chip supports two forms of external communications to other chips.  The DDR3 memory controller brings the memory controller for the DDR main memory2 onto the chip.  The QuickPath Interconnect (QPI) is a cache-coherent, point-to- point link based electrical interconnect specification for Intel processors and chipsets.  It enables high-speed communications among connected processor chips. The QPI link operates at 6.4 GT/s (transfers per second). 45
  • 47. ARM11 MPCore  The ARM11 MPCore is a multicore product based on the ARM11 processor family.  The ARM11 MPCore can be configured with up to four processors, each with its own L1 instruction and data caches, per chip. 47
  • 48. ARM11 MPCore 48 ARM11 MPCore Processor Block Diagram  Distributed interrupt controller (DIC)  Timer  Watchdog  CPU interface  CPU interface  CPU  Vector floating-point (VFP) unit  L1 cache  Snoop control unit (SCU)
  • 49. ARM11 MPCore  The key elements of the system are as follows:  Distributed interrupt controller (DIC): Handles interrupt detection and interrupt prioritization. The DIC distributes interrupts to individual processors.  Timer: Each CPU has its own private timer that can generate interrupts.  Watchdog: Issues warning alerts in the event of software failures. If the watchdog is enabled, it is set to a predetermined value and counts down to 0. If the watchdog value reaches zero, an alert is issued.  CPU interface: Handles interrupt acknowledgement, interrupt masking, and interrupt completion acknowledgement. 49
  • 50. ARM11 MPCore  The key elements of the system are as follows:  CPU: A single ARM11 processor. Individual CPUs are referred to as MP11 CPUs.  Vector floating-point (VFP) unit: A coprocessor that implements floating point operations in hardware.  L1 cache: Each CPU has its own dedicated L1 data cache and L1 instruction cache.  Snoop control unit (SCU): Responsible for maintaining coherency among L1 data caches. 50
  • 51. ARM11 MPCore 51 ARM11 MPCore Configurable Options
  • 52. ARM11 MPCore  Interrupt Handling:  The Distributed Interrupt Controller (DIC) collates interrupts from a large number of sources.  It provides:  Masking of interrupts  Prioritization of the interrupts  Distribution of the interrupts to the target MP11 CPUs  Tracking the status of interrupts  Generation of interrupts by software 52
  • 53. ARM11 MPCore  Interrupt Handling:  The DIC enables the number of interrupts supported in the system to be independent of the MP11 CPU design.  The DIC is memory mapped; that is, control registers for the DIC are defined relative to a main memory base address.  The DIC is designed to satisfy two functional requirements:  Provide a means of routing an interrupt request to a single CPU or multiple CPUs, as required.  Provide a means of interprocessor communication so that a thread on one CPU can cause activity by a thread on another CPU. 53
  • 54. ARM11 MPCore  Interrupt Handling:  The DIC can route an interrupt to one or more CPUs in the following three ways:  An interrupt can be directed to a specific processor only.  An interrupt can be directed to a defined group of processors.  An interrupt can be directed to all processors. 54
  • 55. ARM11 MPCore  Interrupt Handling:  From the point of view of an MP11 CPU, an interrupt can be  Inactive: An Inactive interrupt is one that is nonasserted.  Pending: A Pending interrupt is one that has been asserted, and for which processing has not started on that CPU.  Active: An Active interrupt is one that has been started on that CPU, but processing is not complete. 55
  • 56. ARM11 MPCore  Interrupt Handling:  Interrupts come from the following sources:  Interprocessor interrupts (IPIs): Each CPU has private interrupts, (ID0-ID15), triggered by software. The priority of an IPI depends on the receiving CPU, not the sending CPU.  Private timer and/or watchdog interrupts: These use interrupt IDs 29 and 30.  Legacy FIQ line: In legacy IRQ mode, the legacy FIQ pin, bypasses the Interrupt Distributor logic and directly drives interrupt requests into the CPU.  Hardware interrupts: Hardware interrupts are triggered by programmable events on associated interrupt input lines. Hardware interrupts start at ID32. 56
  • 58. ARM11 MPCore  The DIC is configurable to support between 0 and 255 hardware interrupt inputs.  The DIC maintains a list of interrupts, showing their priority and status.  The Interrupt Distributor transmits to each CPU Interface the highest Pending interrupt for that interface. It receives back the interrupt acknowledgement, and can then change the status of the corresponding interrupt.  The CPU Interface also transmits End of Interrupt Information (EOI), which enables the Interrupt Distributor to update the status of this interrupt from Active to Inactive. 58
  • 59. ARM11 MPCore  Cache Coherency:  The MPCore’s Snoop Control Unit (SCU) is designed to resolve most of the traditional bottlenecks related to access to shared data.  The SCU introduces three types of optimization:  Direct data intervention,  Duplicated tag RAMs, and  Migratory lines. 59
  • 60. ARM11 MPCore  Cache Coherency:  Direct data intervention (DDI): Enables copying clean data from one CPU L1 data cache to another CPU L1 data cache without accessing external memory.  This reduces read after read activity from the Level 1 cache to the Level 2 cache.  Thus, a local L1 cache miss is resolved in a remote L1 cache rather than from access to the shared L2 cache. 60
  • 61. ARM11 MPCore  Cache Coherency:  Main memory location of each line within a cache is identified by a tag for that line.  The tags can be implemented as a separate block of RAM of the same length as the number of lines in the cache.  In the SCU, duplicated tag RAMs, are duplicated versions of L1 tag RAMs.  Coherency commands are sent only to CPUs that must update their coherent data cache.  This reduces the power consumption.  If the tag data is available locally, SCU limits cache manipulations to processors that have cache lines in common. 61
  • 62. 62 This presentation is published only for educational purpose shindesir.pvp@gmail.com