SlideShare a Scribd company logo
Intel® Xeon Phi™ coprocessor
(codename Knights Corner)


     George Chrysos
     Senior Principal Engineer
     Hot Chips, August 28, 2012
Legal Disclaimers
Copyright © 2012 Intel Corporation. All rights reserved.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.
EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF
INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU
SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND
REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL
OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall
have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20
Intel, the Intel logo, Xeon, Intel Core and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries Other names and brands may be claimed as the property of others.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and
functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products.
For more complete information about performance and benchmark results, visit Performance Test Disclosure
This document contains information on products in the design phase of development.
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not
guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides
for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804
WARNING: Altering clock frequency and/or voltage may: (i) reduce system stability and useful life of the system and processor; (ii) cause the processor and other system components to fail; (iii) cause reductions in system performance; (iv) cause additional heat or other
damage; and (v) affect system data integrity. Intel has not tested, and does not warranty, the operation of the processor beyond its specif ications. Intel assumes no responsibility that the processor, including if used with altered clock frequencies and/or voltages, will be fit
for any particular purpose. For more information, visit Overclocking Intel Processors
Warning: Altering PC memory frequency and/or voltage may (i) reduce system stability and use life of the system, memory and processor; (ii) cause the processor and other system components to fail; (iii) cause reductions in system performance; (iv) cause additional
heat or other damage; and (v) affect system data integrity. Intel assumes no responsibility that the memory, included if used with altered clock frequencies and/or voltages, will be fit for any particular purpose. Check with memory manufacturer for warranty and additional
details
Available on select Intel® Core™ Intel® Xeon® and Intel® Xeon Phi™ processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information
including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.
Requires a system with a 64-bit enabled processor, chipset, BIOS and software. Performance will vary depending on the specific hardware and software you use. Consult your PC manufacturer for more information. For more information, visit
http://www.intel.com/info/em64t
Requires a system with Intel® Turbo Boost Technology. Intel Turbo Boost Technology and Intel Turbo Boost Technology 2.0 are only available on select Intel® processors. Consult your PC manufacturer. Performance varies depending on hardware, software, and
system configuration. For more information, visit http://www.intel.com/go/turbo
ENERGY STAR is a system-level energy specification, defined by the Environmental Protection Agency, that relies on all system components, such as processor, chipset, power supply, etc.) For more information, visit http://www.intel.com/technology/epa/index.html
Intel® Many Integrated Core (Intel MIC) Architecture

                                 Targeted at highly parallel HPC workloads
                                          • Physics, Chemistry, Biology, Financial Services

                                 Power efficient cores, support for parallelism
                                          • Cores: less speculation, threads, wider SIMD
                                          • Scalability: high BW on die interconnect and memory


                                 General Purpose Programming Environment
                                          • Runs Linux (full service, open source OS)
                                          • Runs applications written in Fortran, C, C++, …
                                          • Supports X86 memory model, IEEE 754
                                          • x86 collateral (libraries, compilers, Intel® VTune™ debuggers, etc)

3   Visual and Parallel Computing Group                         Copyright © 2012 Intel Corporation. All rights reserved.
Knights Corner Coprocessor

                                                 KNC Card
                                                      KNC Card
                                          TCP/IP
                                          PC e x16
                                                                        GDDR5
                                                                        Channel               …             GDDR5
                                                                                                            Channel
          Intel® Xeon®
           Processor                       PCIe x16




                                                                                                                            Channel
                                                                                                                            GDDR5
                                                                            KN50 Cores
                                                                             >




                                                                                                                            …
                                                                               KN
                                                                                    Linux OS




                                                                                                                            Channel
                                                                                                                            GDDR5
          System Memory
                                                                        GDDR5
                                                                        Channel               …             GDDR5
                                                                                                            Channel


                                                                 >= 8GB GDDR5 memory



4   Visual and Parallel Computing Group                          Copyright © 2012 Intel Corporation. All rights reserved.
Knights Corner – Power Efficient
                                     Performance per Watt of a prototype Knights Corner Cluster
                                         compared to the 2 Top Graphics Accelerated Clusters
                                                           1381                                              1380
                                                                                                                                     1266
                           1400

                           1200
             MFLOPS/Watt




                           1000

                           800

                           600
                                                               +                                               +                      +
                           400

                           200

                              0
                                                  Intel Corp                            Nagasaki Univ.                        Barcelona
                                                                                                                              Supercomputing Center
                                                  Knights Corner                        ATI Radeon                            Nvidia Tesla 2090
    Higher is Better Source: www.green500.org
                                                  Top500 #150                           Top500 #456                           Top500 #177
                                                  72.9 kW                               47 kW                                 81.5 kW
5   Visual and Parallel Computing Group                            Copyright © 2012 Intel Corporation. All rights reserved.
Knights Corner Micro-architecture


                                                      Core            Core                         Core                 Core
                                              PCIe
                                             Client   L2                 L2                          L2                 L2
                                             Logic


                                          GDDR MC     TD                TD                           TD                 TD     GDDR MC
                                          GDDR MC                                                                              GDDR MC
                                                      TD                TD                           TD                 TD



                                                      L2                 L2                          L2                 L2

                                                      Core            Core                         Core                 Core




6   Visual and Parallel Computing Group                      Copyright © 2012 Intel Corporation. All rights reserved.
Knights Corner Core
                                                                                                           PPF                   PF   D0     D1       D2     E        WB


                 T0 IP
                 T1 IP                        L1 TLB      Code Cache Miss
                 T2 IP                       and 32KB
                 T3 IP
                                            Code Cache    TLB Miss
                                                   16B/Cycle (2 IPC)
                4 Threads
                 In-Order                    Decode                     uCode                                                                                     512KB
                                                                                                                    TLB Miss                        HWP
                                                                                                                                                                 L2 Cache
                                                                                                                     Handler
                                   Pipe 0                     Pipe 1                                                                              L2 Ctl
                                                                                                                      L2 TLB


                       VPU RF                X87 RF                    Scalar RF


                      VPU
                                            X87          ALU 0              ALU 1                                                                 To On-Die Interconnect
                    512b SIMD                                                                TLB Miss
                                              L1 TLB and 32KB Data Cache
                                                                                             DCache Miss
                                                                                                                                      Core


                                                          X86 specific logic < 2% of core + L2 area

7   Visual and Parallel Computing Group                                   Copyright © 2012 Intel Corporation. All rights reserved.
Vector Processing Unit
                                                                PPF            PF                  D0                 D1         D2       E     WB

                                                                                                                                 D2       E     VC1   VC2    V1-V4   WB


                                  D2           E         VC1                 VC2                               V1                 V2            V3          V4


                                           VPU                   LD
                              DEC           RF
                                          3R, 1W                                                                                      Vector ALUs
                                                                EMU
                                                                                                                                  16 Wide x 32 bit
                                                    ST                                                                            8 Wide x 64 bit

                                                                                                                                 Fused Multiply Add
                                             Mask              Scatter
                                              RF               Gather




8   Visual and Parallel Computing Group                               Copyright © 2012 Intel Corporation. All rights reserved.
Interconnect

                                                                                                                           BL - 64 Bytes    Data
                                          Core   Core   Core          Core

                                          L2     L2     L2              L2
                                                                                                                               AD          Command and Address

                                                                                                                               AK          Coherence and Credits
                                          TD     TD     TD              TD
                                          TD     TD     TD              TD
                                                                                                                               AK


                                                                                                                               AD
                                          L2     L2     L2              L2

                                          Core   Core   Core          Core
                                                                                                                          BL – 64 Bytes




9   Visual and Parallel Computing Group                        Copyright © 2012 Intel Corporation. All rights reserved.
Distributed Tag Directories


                                           Core   Core    Core          Core

                                           L2     L2       L2             L2                                                TAG Core Valid Mask State
                                                                                                                            TAG Core Valid Mask State

                                           TD     TD       TD             TD
                                           TD     TD       TD             TD



                                           L2     L2       L2             L2                                   Tag Directories track cache-lines in all L2s

                                           Core   Core    Core          Core




10   Visual and Parallel Computing Group                         Copyright © 2012 Intel Corporation. All rights reserved.
Interleaved Memory Access
                                                       Core                 Core




                                                                                                     GDDR MC
                                                       L2                     L2



                                                       TD                     TD




                                                                                                                         Core
                                           GDDR MC




                                                                                                                    L2
                                                                                                               TD




                                                                                                                         Core
                                           Core




                                                                                                                    L2
                                                       TD
                                                  L2




                                                                                                               TD
                                           Core

                                                                                                                    GDDR MC



                                                       TD
                                                  L2
                                                                                  TD                      TD


                                                          GDDR MC
                                                                                  L2                     L2

                                                                                Core                   Core




11   Visual and Parallel Computing Group                Copyright © 2012 Intel Corporation. All rights reserved.
Interconnect: 2X AD/AK

                                                                                                                              BL - 64 Bytes
                                           Core   Core     Core          Core

                                           L2     L2       L2              L2
                                                                                                                                  AD


                                                                                                                                  AK
                                           TD     TD       TD              TD
                                                                                                                                              2x
                                           TD     TD       TD              TD
                                                                                                                                  AK


                                                                                                                                  AD
                                           L2     L2       L2              L2

                                           Core   Core     Core          Core
                                                                                                                             BL – 64 Bytes




12   Visual and Parallel Computing Group                          Copyright © 2012 Intel Corporation. All rights reserved.
Multi-threaded Triad – Saturation for 1 AD/AK Ring



                                  Performance




                                                                                                                                                                      Simulation Data indicates
                                                                                                                                                                      saturation for a single
                                                                                                                                                                      AD/AK ring




                                                0                5             10             15             20                  25                 30                35       40        45             50


                                                                                                                      Cores Running

                                                    Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance




13   Visual and Parallel Computing Group                                                                   Copyright © 2012 Intel Corporation. All rights reserved.
Multi-threaded Triad – Benefit of Doubling AD/AK



                                                                                             Silicon Data for
                                                                                             2 AD + AK rings                                                                                                      > 40%
                                  Performance




                                                                                                                                                                      Simulation Data indicates
                                                                                                                                                                      saturation for a single
                                                                                                                                                                      AD/AK ring




                                                0                5             10             15             20                  25                 30                35       40        45             50


                                                                                                                      Cores Running

                                                    Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance




14   Visual and Parallel Computing Group                                                                   Copyright © 2012 Intel Corporation. All rights reserved.
Streaming Stores

                                           Streams Triad
                                            for (i=0; i<HUGE; i++)
                                                       A[i] = k*B[i] + C[i];

                                           Without Streaming Stores
                                            Read A, B, C, Write A
                                            256 Bytes transferred to/from memory per iteration

                                           With Streaming Stores
                                            Read B, C, Write A
                                            192 Bytes transferred to/from memory per iteration




15   Visual and Parallel Computing Group                        Copyright © 2012 Intel Corporation. All rights reserved.
Multi-threaded Triad — with Streaming Stores

                                                                                                                       Silicon Data
                                                                                                                       Streaming Stores                                                                                 > 30%
                                           Performance




                                                             0              5             10             15               20                  25                 30        35        40             45             50


                                                                                                                                     Cores Running
                                                         Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance




16   Visual and Parallel Computing Group                                                                        Copyright © 2012 Intel Corporation. All rights reserved.
Cache Hierarchy Micro-architecture Choices

        L2 TLB
            64 entry, holds PTEs and PDEs vs. no L2 TLB

        Dcache Capability
            Simultaneous 512b load and 512b store vs. 1 load or store per cycle

        L2 Cache
            512 KB vs. 256 KB

        Hardware Prefetcher
            16 stream detectors, prefetch into the L2 vs. no HWP (rely only on software prefetching)




17   Visual and Parallel Computing Group                  Copyright © 2012 Intel Corporation. All rights reserved.
Per-Core ST Performance Improvement (per cycle)
                                                                                                             Spec FP 2006
                         3.0
                                                                              Performance impact of KNC core uArch improvements
                         2.5

                         2.0

                         1.5

                         1.0

                         0.5

                         0.0




                                                     >1.8x Average Performance/Cycle Improvement – 1 Core, 1 Thread

                                       Results measured in development labs at Intel on Knights Corner and Knights Ferry prototype hardware and systems. For more information go to http://www.intel.com/performance



18   Visual and Parallel Computing Group                                                               Copyright © 2012 Intel Corporation. All rights reserved.
Caches – For or Against?
                                                                             Relative BW                                                                                 Relative BW/Watt
            50
            45
            40                                 Caches:
            35                                  high data BW
            30                                  low energy per byte of data supplied
            25                                  programmer friendly (coherence just works)
            20
            15
            10
               5
               0
                                                Memory BW                                                                   L2 Cache BW                                                              L1 Cache BW

                                                               Coherent Caches are a key MIC Architecture Advantage
     Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance.

19       Visual and Parallel Computing Group                                                                  Copyright © 2012 Intel Corporation. All rights reserved.
Example: Stencils
                                           spatial time-step simulation of a physical system




                                                                                  L2$
                                                                                 Sized




                                             Cache blocking promotes much higher performance
                                               and performance/watt vs. memory streaming

20   Visual and Parallel Computing Group                   Copyright © 2012 Intel Corporation. All rights reserved.
Power Management: All On and Running
                                                     PCIe IO

                                                                  Core             Core                    Core                     Core
                                                          PCIe
                                                         Client   L2                 L2                       L2                    L2
                                                         Logic
                         GDDR5                                                                                                                                 GDDR5
                         GDDR5                                                                                                                                 GDDR5




                                                                                                                                                     GDDR IO
                                                                  TD                 TD                      TD                     TD
                                           GDDR IO
                                                     GDDR MC                                                                               GDDR MC
                         GDDR5                                                                                                                                 GDDR5
                                                     GDDR MC                                                                               GDDR MC
                                                                  TD                 TD                      TD                     TD
                         GDDR5                                                                                                                                 GDDR5
                         GDDR5                                                                                                                                 GDDR5
                                                                  L2                 L2                       L2                    L2

                                                                  Core             Core                    Core                     Core




21   Visual and Parallel Computing Group                                 Copyright © 2012 Intel Corporation. All rights reserved.
Core C1: Clock Gate Core
                                                     PCIe IO

                                                                     Core             Core                    Core                     Core
                                                          PCIe
                                                         Client      L2                 L2                       L2                    L2
                                                         Logic
                         GDDR5                                                                                                                                    GDDR5
                         GDDR5                                                                                                                                    GDDR5




                                                                                                                                                        GDDR IO
                                                                     TD                 TD                      TD                     TD
                                           GDDR IO
                                                     GDDR MC                                                                                  GDDR MC
                         GDDR5                                                                                                                                    GDDR5
                                                     GDDR MC                                                                                  GDDR MC
                                                                     TD                 TD                      TD                     TD
                         GDDR5                                                                                                                                    GDDR5
                         GDDR5                                                                                                                                    GDDR5
                                                                     L2                 L2                       L2                    L2

                                                                     Core             Core                    Core                     Core




                                                       When all 4T on a core have halted, core clock gates itself


22   Visual and Parallel Computing Group                                    Copyright © 2012 Intel Corporation. All rights reserved.
Core C6: Power Gate Core
                                                     PCIe IO

                                                                     Core             Core                    Core                     Core
                                                          PCIe
                                                         Client       L2                L2                       L2                    L2
                                                         Logic
                         GDDR5                                                                                                                                    GDDR5
                         GDDR5                                                                                                                                    GDDR5




                                                                                                                                                        GDDR IO
                                                                      TD                TD                      TD                     TD
                                           GDDR IO
                                                     GDDR MC                                                                                  GDDR MC
                         GDDR5                                                                                                                                    GDDR5
                                                     GDDR MC                                                                                  GDDR MC
                                                                      TD                TD                      TD                     TD
                         GDDR5                                                                                                                                    GDDR5
                         GDDR5                                                                                                                                    GDDR5
                                                                      L2                L2                       L2                    L2

                                                                     Core             Core                    Core                     Core




                                                     C1 time-out, power gate core, save leakage, requires core-re-init


23   Visual and Parallel Computing Group                                    Copyright © 2012 Intel Corporation. All rights reserved.
Package Auto C3
                                                     PCIe IO

                                                                       Core             Core                    Core                     Core
                                                          PCIe
                                                         Client        L2                 L2                       L2                    L2
                                                         Logic
                         GDDR5                                                                                                                                      GDDR5
                         GDDR5                                                                                                                                      GDDR5




                                                                                                                                                          GDDR IO
                                                                       TD                 TD                      TD                     TD
                                           GDDR IO
                                                     GDDR MC                                                                                    GDDR MC
                         GDDR5                                                                                                                                      GDDR5
                                                     GDDR MC                                                                                    GDDR MC
                                                                       TD                 TD                      TD                     TD
                         GDDR5                                                                                                                                      GDDR5
                         GDDR5                                                                                                                                      GDDR5
                                                                       L2                 L2                       L2                    L2

                                                                       Core             Core                    Core                     Core




                                                                  Timeout when all cores have been in C6,
                                                                  clock gate the L2 and interconnect

24   Visual and Parallel Computing Group                                      Copyright © 2012 Intel Corporation. All rights reserved.
Package C6
                                                     PCIe IO

                                                                       Core             Core                    Core                     Core
                                                          PCIe
                                                         Client         L2                L2                       L2                    L2
                                                         Logic
                         GDDR5                                                                                                                                      GDDR5
                         GDDR5                                                                                                                                      GDDR5




                                                                                                                                                          GDDR IO
                                                                        TD                TD                      TD                     TD
                                           GDDR IO
                                                     GDDR MC                                                                                    GDDR MC
                         GDDR5                                                                                                                                      GDDR5
                                                     GDDR MC                                                                                    GDDR MC
                                                                        TD                TD                      TD                     TD
                         GDDR5                                                                                                                                      GDDR5
                         GDDR5                                                                                                                                      GDDR5
                                                                        L2                L2                       L2                    L2

                                                                       Core             Core                    Core                     Core




                                                                    Host Driver can initiate Package C6 –
                                                                  Uncore Voltage Off, requires partial restart

25   Visual and Parallel Computing Group                                      Copyright © 2012 Intel Corporation. All rights reserved.
Summary

                                           Intel® Xeon Phi™ coprocessor provides:

                                             Performance and Performance/Watt for highly parallel HPC
                                                   with cores, threads, wide-SIMD, caches, memory BW

                                             Intel Architecture
                                                    general purpose programming environment
                                                    advanced power management technology




                              KNC delivers programmability and performance/watt for highly parallel HPC


26   Visual and Parallel Computing Group                     Copyright © 2012 Intel Corporation. All rights reserved.
Thank You

                                           Knights Corner brought to you by:
                                            IAG (Intel Architecture Group)
                                               • DCSG (Data Center and Systems Group)
                                               • VPG (Visual and Parallel Group) MIC
                                                    – HW Architecture
                                                    – HW Design
                                                    – SW
                                            SSG (Software and Services Group) MIC
                                            IL PCL (Intel Labs – Parallel Computing Lab)


27   Visual and Parallel Computing Group                   Copyright © 2012 Intel Corporation. All rights reserved.
Intel Xeon Phi Hotchips Architecture Presentation
Vector Processor: 512b SIMD Width

                              SP                 SP                  SP                                SP
                              15
                                           DP7
                                                 11
                                                            DP5
                                                                     7
                                                                                DP3
                                                                                                       3
                                                                                                                     DP1
                                                                                                                                           Shared Multiplier
                              SP                 SP                  SP                                SP                                  Circuit for SP/DP
                              14                 10                  6                                 2

                              SP                 SP                  SP                                SP
                              13                 9                   5                                 1
                                           DP6              DP4                 DP2                                  DP0
                              SP                 SP                  SP                                SP
                              12                 8                   4                                 0




                                   RF3                RF2                 RF1                                RF0




                                                                  16 wide SP SIMD, 8 wide DP SIMD
                                                                  2:1 Ratio good for circuit optimization

29   Visual and Parallel Computing Group                                        Copyright © 2012 Intel Corporation. All rights reserved.
Gather/Scatter Address Machinery
     Gather Instruction Loop
      gather-prime
 loop: gather-step; jump-mask-not-zero loop                                                  Vector Register
                                                                                                 Index0              Index1              Index2    Index3   Index4   Index5     Index6   Index7
                         Scalar Register
                                            Base Address

                                                                                                      +                   +                    +     +        +        +           +         +
                                            Mask Register
                                                                                                 Addr0               Addr1               Addr2     Addr3    Addr4    Addr5       Addr6   Addr7



                                                         1 1 1 1 1 1 1 1
                                              Clear


                                                                           Find
                                                                           First
                                                                                                                                                                              To TLB/
                                                                                                                                                   Access Address             DCACHE



                                                      Clear                                               =                                                                              =



                                                                           Gather/Scatter machine takes advantage
                                                                                      of cache-line locality

30    Visual and Parallel Computing Group                                           Copyright © 2012 Intel Corporation. All rights reserved.
Package Deep C3
                                                     PCIe IO

                                                                       Core             Core                    Core                     Core
                                                          PCIe
                                                         Client         L2                L2                       L2                    L2
                                                         Logic
                         GDDR5                                                                                                                                      GDDR5
                         GDDR5                                                                                                                                      GDDR5




                                                                                                                                                          GDDR IO
                                                                        TD                TD                      TD                     TD
                                           GDDR IO
                                                     GDDR MC                                                                                    GDDR MC
                         GDDR5                                                                                                                                      GDDR5
                                                     GDDR MC                                                                                    GDDR MC
                                                                        TD                TD                      TD                     TD
                         GDDR5                                                                                                                                      GDDR5
                         GDDR5                                                                                                                                      GDDR5
                                                                        L2                L2                       L2                    L2

                                                                       Core             Core                    Core                     Core




                                                                  Host Driver Initiated – L2/Ring/TDs dropped
                                                                    to retention V, memory in self refresh

31   Visual and Parallel Computing Group                                      Copyright © 2012 Intel Corporation. All rights reserved.

More Related Content

PDF
Hardware assisted Virtualization in Embedded
PPS
Comp tia a+_session_03
PDF
Trinity press deck 10 2 2012
 
PPS
Comp tia n+_session_02
PPS
Comp tia n+_session_07
PPS
Comp tia a+_session_10
PDF
13.30 hr Hebinck
PPS
Comp tia n+_session_12
Hardware assisted Virtualization in Embedded
Comp tia a+_session_03
Trinity press deck 10 2 2012
 
Comp tia n+_session_02
Comp tia n+_session_07
Comp tia a+_session_10
13.30 hr Hebinck
Comp tia n+_session_12

What's hot (16)

PDF
Os Wardenupdated
PPTX
Intel_Low Power Intelligent Solutions with Intel Atom Processor
PDF
Si Technology Whitepaper
PPS
Comp tia n+_session_03
PDF
Dell latitude 2120-specsheet
PPS
Comp tia n+_session_06
PPS
Comp tia n+_session_04
PDF
Jtag Tools For Linux
PPS
Comp tia n+_session_01
PPS
Comp tia a+_session_15
PPS
Comp tia n+_session_10
PPS
Comp tia n+_session_08
PDF
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
PDF
Intel Technologies for High Performance Computing
PDF
Methods and practices to analyze the performance of your application with Int...
Os Wardenupdated
Intel_Low Power Intelligent Solutions with Intel Atom Processor
Si Technology Whitepaper
Comp tia n+_session_03
Dell latitude 2120-specsheet
Comp tia n+_session_06
Comp tia n+_session_04
Jtag Tools For Linux
Comp tia n+_session_01
Comp tia a+_session_15
Comp tia n+_session_10
Comp tia n+_session_08
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
Intel Technologies for High Performance Computing
Methods and practices to analyze the performance of your application with Int...
Ad

Similar to Intel Xeon Phi Hotchips Architecture Presentation (20)

PDF
Under the Armor of Knights Corner: Intel MIC Architecture at Hotchips 2012
PDF
Performance and scalability of Informix ultimate warehouse edtion on Intel Xe...
PDF
Intel Cloud Summit: Product update
PPTX
Intel_Embedded Intel Core Processors Do More Now and in the Future
PDF
Intel Cloud Summit: Intel Platform Update
PPTX
Emc world 2011 gelsinger finalb
PDF
Vigor Ex
PDF
Achieving Lowest Latencies at Highest Message Rates: Solarflare & Intel webcast
PDF
Workload consolidation on ATCA with the advantech mic 5333 universal platform
PPT
Overall portfolio with m4 bc and pure 8 8-12
PDF
Nucleus RM Rear IO
PDF
Big Data Smarter Networks
PDF
Hp All In 1
PDF
Sun sparc enterprise t5440 server technical presentation
PDF
What's under the hood of Exadata X2-2 and X2-8?
PDF
Desktop board-dq45cb-executive-brief
PDF
Nucleus GP
PDF
Motherboard Manual Ga M61p S3 E
PDF
Sandy bridge platform from ttec
PDF
GA 5000- High Computing Gambling PC for mulit-player
Under the Armor of Knights Corner: Intel MIC Architecture at Hotchips 2012
Performance and scalability of Informix ultimate warehouse edtion on Intel Xe...
Intel Cloud Summit: Product update
Intel_Embedded Intel Core Processors Do More Now and in the Future
Intel Cloud Summit: Intel Platform Update
Emc world 2011 gelsinger finalb
Vigor Ex
Achieving Lowest Latencies at Highest Message Rates: Solarflare & Intel webcast
Workload consolidation on ATCA with the advantech mic 5333 universal platform
Overall portfolio with m4 bc and pure 8 8-12
Nucleus RM Rear IO
Big Data Smarter Networks
Hp All In 1
Sun sparc enterprise t5440 server technical presentation
What's under the hood of Exadata X2-2 and X2-8?
Desktop board-dq45cb-executive-brief
Nucleus GP
Motherboard Manual Ga M61p S3 E
Sandy bridge platform from ttec
GA 5000- High Computing Gambling PC for mulit-player
Ad

More from Chris O'Neal (20)

PDF
Nano hub u-nanoscaletransistors
PDF
236341 Idc How Nations Are Using Hpc August 2012
PPT
My Ocean Breve
PDF
Incite Ir Final 7 19 11
PDF
Ersa11 Holland
PDF
Cloud Computing White Paper
PDF
Dell Hpc Leadership
PPTX
Idc Eu Study Slides 10.9.2010
PDF
Tolly210137 Force10 Networks E1200i Energy
PDF
IDC: EU HPC Strategy
PDF
Tpc Energy Publications July 2 10 B
PDF
Coffee break
PDF
Tachion
PDF
Longbiofuel
PDF
Casl Fact Sht
PDF
Fujitsu_ISC10
PPT
Rogue Wave Corporate Vision(P) 5.19.10
PDF
Hpc R2 Beta2 Press Deck 2010 04 07
PPTX
Q Dell M23 Leap V2x
PPT
Fca Product Overview Feb222010 As
Nano hub u-nanoscaletransistors
236341 Idc How Nations Are Using Hpc August 2012
My Ocean Breve
Incite Ir Final 7 19 11
Ersa11 Holland
Cloud Computing White Paper
Dell Hpc Leadership
Idc Eu Study Slides 10.9.2010
Tolly210137 Force10 Networks E1200i Energy
IDC: EU HPC Strategy
Tpc Energy Publications July 2 10 B
Coffee break
Tachion
Longbiofuel
Casl Fact Sht
Fujitsu_ISC10
Rogue Wave Corporate Vision(P) 5.19.10
Hpc R2 Beta2 Press Deck 2010 04 07
Q Dell M23 Leap V2x
Fca Product Overview Feb222010 As

Intel Xeon Phi Hotchips Architecture Presentation

  • 1. Intel® Xeon Phi™ coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012
  • 2. Legal Disclaimers Copyright © 2012 Intel Corporation. All rights reserved. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20 Intel, the Intel logo, Xeon, Intel Core and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries Other names and brands may be claimed as the property of others. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information about performance and benchmark results, visit Performance Test Disclosure This document contains information on products in the design phase of development. All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 WARNING: Altering clock frequency and/or voltage may: (i) reduce system stability and useful life of the system and processor; (ii) cause the processor and other system components to fail; (iii) cause reductions in system performance; (iv) cause additional heat or other damage; and (v) affect system data integrity. Intel has not tested, and does not warranty, the operation of the processor beyond its specif ications. Intel assumes no responsibility that the processor, including if used with altered clock frequencies and/or voltages, will be fit for any particular purpose. For more information, visit Overclocking Intel Processors Warning: Altering PC memory frequency and/or voltage may (i) reduce system stability and use life of the system, memory and processor; (ii) cause the processor and other system components to fail; (iii) cause reductions in system performance; (iv) cause additional heat or other damage; and (v) affect system data integrity. Intel assumes no responsibility that the memory, included if used with altered clock frequencies and/or voltages, will be fit for any particular purpose. Check with memory manufacturer for warranty and additional details Available on select Intel® Core™ Intel® Xeon® and Intel® Xeon Phi™ processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading. Requires a system with a 64-bit enabled processor, chipset, BIOS and software. Performance will vary depending on the specific hardware and software you use. Consult your PC manufacturer for more information. For more information, visit http://www.intel.com/info/em64t Requires a system with Intel® Turbo Boost Technology. Intel Turbo Boost Technology and Intel Turbo Boost Technology 2.0 are only available on select Intel® processors. Consult your PC manufacturer. Performance varies depending on hardware, software, and system configuration. For more information, visit http://www.intel.com/go/turbo ENERGY STAR is a system-level energy specification, defined by the Environmental Protection Agency, that relies on all system components, such as processor, chipset, power supply, etc.) For more information, visit http://www.intel.com/technology/epa/index.html
  • 3. Intel® Many Integrated Core (Intel MIC) Architecture Targeted at highly parallel HPC workloads • Physics, Chemistry, Biology, Financial Services Power efficient cores, support for parallelism • Cores: less speculation, threads, wider SIMD • Scalability: high BW on die interconnect and memory General Purpose Programming Environment • Runs Linux (full service, open source OS) • Runs applications written in Fortran, C, C++, … • Supports X86 memory model, IEEE 754 • x86 collateral (libraries, compilers, Intel® VTune™ debuggers, etc) 3 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 4. Knights Corner Coprocessor KNC Card KNC Card TCP/IP PC e x16 GDDR5 Channel … GDDR5 Channel Intel® Xeon® Processor PCIe x16 Channel GDDR5 KN50 Cores > … KN Linux OS Channel GDDR5 System Memory GDDR5 Channel … GDDR5 Channel >= 8GB GDDR5 memory 4 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 5. Knights Corner – Power Efficient Performance per Watt of a prototype Knights Corner Cluster compared to the 2 Top Graphics Accelerated Clusters 1381 1380 1266 1400 1200 MFLOPS/Watt 1000 800 600 + + + 400 200 0 Intel Corp Nagasaki Univ. Barcelona Supercomputing Center Knights Corner ATI Radeon Nvidia Tesla 2090 Higher is Better Source: www.green500.org Top500 #150 Top500 #456 Top500 #177 72.9 kW 47 kW 81.5 kW 5 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 6. Knights Corner Micro-architecture Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR MC TD TD TD TD GDDR MC GDDR MC GDDR MC TD TD TD TD L2 L2 L2 L2 Core Core Core Core 6 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 7. Knights Corner Core PPF PF D0 D1 D2 E WB T0 IP T1 IP L1 TLB Code Cache Miss T2 IP and 32KB T3 IP Code Cache TLB Miss 16B/Cycle (2 IPC) 4 Threads In-Order Decode uCode 512KB TLB Miss HWP L2 Cache Handler Pipe 0 Pipe 1 L2 Ctl L2 TLB VPU RF X87 RF Scalar RF VPU X87 ALU 0 ALU 1 To On-Die Interconnect 512b SIMD TLB Miss L1 TLB and 32KB Data Cache DCache Miss Core X86 specific logic < 2% of core + L2 area 7 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 8. Vector Processing Unit PPF PF D0 D1 D2 E WB D2 E VC1 VC2 V1-V4 WB D2 E VC1 VC2 V1 V2 V3 V4 VPU LD DEC RF 3R, 1W Vector ALUs EMU 16 Wide x 32 bit ST 8 Wide x 64 bit Fused Multiply Add Mask Scatter RF Gather 8 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 9. Interconnect BL - 64 Bytes Data Core Core Core Core L2 L2 L2 L2 AD Command and Address AK Coherence and Credits TD TD TD TD TD TD TD TD AK AD L2 L2 L2 L2 Core Core Core Core BL – 64 Bytes 9 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 10. Distributed Tag Directories Core Core Core Core L2 L2 L2 L2 TAG Core Valid Mask State TAG Core Valid Mask State TD TD TD TD TD TD TD TD L2 L2 L2 L2 Tag Directories track cache-lines in all L2s Core Core Core Core 10 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 11. Interleaved Memory Access Core Core GDDR MC L2 L2 TD TD Core GDDR MC L2 TD Core Core L2 TD L2 TD Core GDDR MC TD L2 TD TD GDDR MC L2 L2 Core Core 11 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 12. Interconnect: 2X AD/AK BL - 64 Bytes Core Core Core Core L2 L2 L2 L2 AD AK TD TD TD TD 2x TD TD TD TD AK AD L2 L2 L2 L2 Core Core Core Core BL – 64 Bytes 12 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 13. Multi-threaded Triad – Saturation for 1 AD/AK Ring Performance Simulation Data indicates saturation for a single AD/AK ring 0 5 10 15 20 25 30 35 40 45 50 Cores Running Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance 13 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 14. Multi-threaded Triad – Benefit of Doubling AD/AK Silicon Data for 2 AD + AK rings > 40% Performance Simulation Data indicates saturation for a single AD/AK ring 0 5 10 15 20 25 30 35 40 45 50 Cores Running Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance 14 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 15. Streaming Stores Streams Triad for (i=0; i<HUGE; i++) A[i] = k*B[i] + C[i]; Without Streaming Stores Read A, B, C, Write A 256 Bytes transferred to/from memory per iteration With Streaming Stores Read B, C, Write A 192 Bytes transferred to/from memory per iteration 15 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 16. Multi-threaded Triad — with Streaming Stores Silicon Data Streaming Stores > 30% Performance 0 5 10 15 20 25 30 35 40 45 50 Cores Running Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance 16 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 17. Cache Hierarchy Micro-architecture Choices L2 TLB 64 entry, holds PTEs and PDEs vs. no L2 TLB Dcache Capability Simultaneous 512b load and 512b store vs. 1 load or store per cycle L2 Cache 512 KB vs. 256 KB Hardware Prefetcher 16 stream detectors, prefetch into the L2 vs. no HWP (rely only on software prefetching) 17 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 18. Per-Core ST Performance Improvement (per cycle) Spec FP 2006 3.0 Performance impact of KNC core uArch improvements 2.5 2.0 1.5 1.0 0.5 0.0 >1.8x Average Performance/Cycle Improvement – 1 Core, 1 Thread Results measured in development labs at Intel on Knights Corner and Knights Ferry prototype hardware and systems. For more information go to http://www.intel.com/performance 18 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 19. Caches – For or Against? Relative BW Relative BW/Watt 50 45 40 Caches: 35  high data BW 30  low energy per byte of data supplied 25  programmer friendly (coherence just works) 20 15 10 5 0 Memory BW L2 Cache BW L1 Cache BW Coherent Caches are a key MIC Architecture Advantage Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance. 19 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 20. Example: Stencils spatial time-step simulation of a physical system L2$ Sized Cache blocking promotes much higher performance and performance/watt vs. memory streaming 20 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 21. Power Management: All On and Running PCIe IO Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR5 GDDR5 GDDR5 GDDR5 GDDR IO TD TD TD TD GDDR IO GDDR MC GDDR MC GDDR5 GDDR5 GDDR MC GDDR MC TD TD TD TD GDDR5 GDDR5 GDDR5 GDDR5 L2 L2 L2 L2 Core Core Core Core 21 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 22. Core C1: Clock Gate Core PCIe IO Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR5 GDDR5 GDDR5 GDDR5 GDDR IO TD TD TD TD GDDR IO GDDR MC GDDR MC GDDR5 GDDR5 GDDR MC GDDR MC TD TD TD TD GDDR5 GDDR5 GDDR5 GDDR5 L2 L2 L2 L2 Core Core Core Core When all 4T on a core have halted, core clock gates itself 22 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 23. Core C6: Power Gate Core PCIe IO Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR5 GDDR5 GDDR5 GDDR5 GDDR IO TD TD TD TD GDDR IO GDDR MC GDDR MC GDDR5 GDDR5 GDDR MC GDDR MC TD TD TD TD GDDR5 GDDR5 GDDR5 GDDR5 L2 L2 L2 L2 Core Core Core Core C1 time-out, power gate core, save leakage, requires core-re-init 23 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 24. Package Auto C3 PCIe IO Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR5 GDDR5 GDDR5 GDDR5 GDDR IO TD TD TD TD GDDR IO GDDR MC GDDR MC GDDR5 GDDR5 GDDR MC GDDR MC TD TD TD TD GDDR5 GDDR5 GDDR5 GDDR5 L2 L2 L2 L2 Core Core Core Core Timeout when all cores have been in C6, clock gate the L2 and interconnect 24 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 25. Package C6 PCIe IO Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR5 GDDR5 GDDR5 GDDR5 GDDR IO TD TD TD TD GDDR IO GDDR MC GDDR MC GDDR5 GDDR5 GDDR MC GDDR MC TD TD TD TD GDDR5 GDDR5 GDDR5 GDDR5 L2 L2 L2 L2 Core Core Core Core Host Driver can initiate Package C6 – Uncore Voltage Off, requires partial restart 25 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 26. Summary Intel® Xeon Phi™ coprocessor provides: Performance and Performance/Watt for highly parallel HPC with cores, threads, wide-SIMD, caches, memory BW Intel Architecture general purpose programming environment advanced power management technology KNC delivers programmability and performance/watt for highly parallel HPC 26 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 27. Thank You Knights Corner brought to you by: IAG (Intel Architecture Group) • DCSG (Data Center and Systems Group) • VPG (Visual and Parallel Group) MIC – HW Architecture – HW Design – SW SSG (Software and Services Group) MIC IL PCL (Intel Labs – Parallel Computing Lab) 27 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 29. Vector Processor: 512b SIMD Width SP SP SP SP 15 DP7 11 DP5 7 DP3 3 DP1 Shared Multiplier SP SP SP SP Circuit for SP/DP 14 10 6 2 SP SP SP SP 13 9 5 1 DP6 DP4 DP2 DP0 SP SP SP SP 12 8 4 0 RF3 RF2 RF1 RF0 16 wide SP SIMD, 8 wide DP SIMD 2:1 Ratio good for circuit optimization 29 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 30. Gather/Scatter Address Machinery Gather Instruction Loop gather-prime loop: gather-step; jump-mask-not-zero loop Vector Register Index0 Index1 Index2 Index3 Index4 Index5 Index6 Index7 Scalar Register Base Address + + + + + + + + Mask Register Addr0 Addr1 Addr2 Addr3 Addr4 Addr5 Addr6 Addr7 1 1 1 1 1 1 1 1 Clear Find First To TLB/ Access Address DCACHE Clear = = Gather/Scatter machine takes advantage of cache-line locality 30 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 31. Package Deep C3 PCIe IO Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR5 GDDR5 GDDR5 GDDR5 GDDR IO TD TD TD TD GDDR IO GDDR MC GDDR MC GDDR5 GDDR5 GDDR MC GDDR MC TD TD TD TD GDDR5 GDDR5 GDDR5 GDDR5 L2 L2 L2 L2 Core Core Core Core Host Driver Initiated – L2/Ring/TDs dropped to retention V, memory in self refresh 31 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.