SlideShare a Scribd company logo
Intel’s LarrabeeVipin.p.nairS7-ECRoll no: 24CEK
IntroductionIt is a multicore general purpose graphics processor unit (GPGPU), combines the functions of multi core CPU & GPU.
Larrabee is based on Intel’s x86 architecture.Architectural convergence
FeaturesTexture filtering, rasterization, depth testing and alpha blending entirely in softwareImplement binned renderer to increase parallelism Reduced memory BandwidthParallel processing on image processing, physical simulation, medical & financial analysis.DDR5 RAM supportEach core can execute 32Gigaflops/s with 1GHz clock, results several teraflops/s speed
Differences with CPUOut of order executionVector processing unit supports 16-single precision floating point numbers at a timeTexture sampling units – trilinear /anisotropic filtering & texture decompression1024-bit ring bus between coresCache control instructions4-way multithreading
Difference with GPUx86 instruction set with Larrabee-specific extensions cache coherency across all its coresz-buffering, clipping, and blending without using graphics hardware
Larrabee – Block Diagram
ArchitectureCores communicate on a 1024-bit wide ring bus    - Fast access to memory, I/O interfaces and fixed function blocks    - Fast access for cache coherencyL2 cache is partitioned among the cores    - Provides high aggregate bandwidth    - Allows data replication & sharing Optimized for highly parallel workload using vector processorIn-order CPU Core    Separate scalar & vector units with separate registers
   Vector unit: 16 32-bit ops/clock
   In-order instruction execution
Fast access from 64k L1 cache
   Direct connection to eachcore’s subset of the 256k L2 cachePrefetch instructions load L1and L2 caches
Vector Unit    Vector complete instruction set         – Scatter/gather for vector load/store         – Mask registers select lanes to write,            which allows data-parallel flow control         – Masks also support data compaction    Vector instructions support         – Full speed when data in L1 cache – Fused multiply add (three arguments)         – Int32, Float32 and Float64 data         – Can read 8-bit unorm, 8-bit uint, 16 bit               sine, 16 bit float data & convert it into 32 bit floats/ integers.
Fixed Function LogicMicro codes in place of fixed function logic for post shader alpha blending, rasterizationand interpolation.Includes fixed function texture filter logicVirtual memory for textures
Larrabee’s Binning RendererBinning pipeline– Reduces synchronization– Front end processes vertex & geometry shading– Back end processes pixel shading, stencil testing, blending– Bin FIFO between them• Multi-tasking by cores– Each orange box is a core– Cores run independently– Other cores can run othertasks, e.g. physics
Back-end Rendering a Tile• Orange boxes represent work on separate threads• Three work threads do Z, pixel shader, and blending• Setup thread reads from bins and does pre-processing• Combines task parallel, data parallel, and sequential
Pipeline can be changedParts can move between front end & back end     – Vertex shading, tesselation, rasterization, etc.     – Allows balancing computation vs. bandwidthNew features      – Transparency, shadowing, ray tracing etc.     – Each of these need irregular data structures– Also helps to be able to “repack” the data
TransparencyTransparency with & without pre-resolve effects
Examples of using TasksApplications     – Scene traversal and culling     – Procedural geometry synthesis     – Physics contact group solve     – Data parallel strand groups     – Distribute across threads/cores using task system     – Exploit core resources with SIMDLarrabee can submit work to itself!     – Tasks can spawn other tasks     – Exposed in Larrabee Native programming interface(c/c++compiler)
Application scaling studies
Scalability Studies Based on memory Bandwidth & texture filtering speedPerformance Breakdowns
Binning & Bandwidth StudiesBandwidthImmediate mode use more Bandwidth         -2.4 to 7 times for F.E.A.R      -1.5 to2.6 times more for Gears of War       -1.6 to 1.8 times more for Half Life 2 Episode 2.
Overall performance
Conclusion The Larrabee architecture opens the rich set of opportunities for both graphics rendering and throughput computing and is the appropriate platform for convergence of GPU & CPU
ReferenceIEEE Digital Library- Larrabee: a many- core x86 architecture for visual computing: - Larry Seiler, Doug Carmean, Toni Juan of Intel Corporation, Jeremy Sugerman & Peter Hanrahan – Stanford UniversityIEEE spectrum January 2008ACM transactions on graphics-Article 18www.intel.comwww.wikipedia.com

More Related Content

PPTX
Vector computing
PDF
Aca2 08 new
PPTX
Parallel processing
PPT
Migration To Multi Core - Parallel Programming Models
PDF
Aca2 09 new
PPTX
Parallel Algorithms Advantages and Disadvantages
PPTX
Parallel Processors (SIMD)
PPTX
Parallel Processing
Vector computing
Aca2 08 new
Parallel processing
Migration To Multi Core - Parallel Programming Models
Aca2 09 new
Parallel Algorithms Advantages and Disadvantages
Parallel Processors (SIMD)
Parallel Processing

What's hot (20)

DOCX
Introduction to parallel computing
PDF
Parallel Algorithms
PPT
Icg hpc-user
PPT
Lecture 3
PDF
Array Processor
PDF
Lecture 7 cuda execution model
PPT
PDF
Aca2 07 new
PPT
Parallel Processing Concepts
PPSX
Research Scope in Parallel Computing And Parallel Programming
PPTX
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
PPTX
Application of Parallel Processing
PDF
cis97007
PPTX
Parallel processing coa
PPTX
Advanced computer architecture
ODP
The CLAM Framework
PDF
Feng’s classification
PPT
Computer Architecture: A quantitative approach - Cap4 - Section 8
PDF
Lecture 3 parallel programming platforms
PPTX
High Performance Parallel Computing with Clouds and Cloud Technologies
Introduction to parallel computing
Parallel Algorithms
Icg hpc-user
Lecture 3
Array Processor
Lecture 7 cuda execution model
Aca2 07 new
Parallel Processing Concepts
Research Scope in Parallel Computing And Parallel Programming
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Application of Parallel Processing
cis97007
Parallel processing coa
Advanced computer architecture
The CLAM Framework
Feng’s classification
Computer Architecture: A quantitative approach - Cap4 - Section 8
Lecture 3 parallel programming platforms
High Performance Parallel Computing with Clouds and Cloud Technologies
Ad

Similar to Intel’S Larrabee (20)

PDF
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
PDF
Cliff sugerman
PDF
Ef35745749
PDF
Co question bank LAKSHMAIAH
PPTX
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
PDF
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
PDF
Module 1 of apj Abdul kablam university hpc.pdf
PDF
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
PPTX
processor struct
PPTX
Modern processor art
PDF
Cache performance-x86-2009
PPTX
U I - 4. 80386 Real mode.pptx
PPT
I7 processor
PPT
The Cell Processor
PPTX
Hpc 4 5
PDF
Assisting User’s Transition to Titan’s Accelerated Architecture
PPTX
Modern processor art
PPTX
Danish presentation
PPTX
Project Slides for Website 2020-22.pptx
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Cliff sugerman
Ef35745749
Co question bank LAKSHMAIAH
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Module 1 of apj Abdul kablam university hpc.pdf
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
processor struct
Modern processor art
Cache performance-x86-2009
U I - 4. 80386 Real mode.pptx
I7 processor
The Cell Processor
Hpc 4 5
Assisting User’s Transition to Titan’s Accelerated Architecture
Modern processor art
Danish presentation
Project Slides for Website 2020-22.pptx
Ad

Intel’S Larrabee

  • 2. IntroductionIt is a multicore general purpose graphics processor unit (GPGPU), combines the functions of multi core CPU & GPU.
  • 3. Larrabee is based on Intel’s x86 architecture.Architectural convergence
  • 4. FeaturesTexture filtering, rasterization, depth testing and alpha blending entirely in softwareImplement binned renderer to increase parallelism Reduced memory BandwidthParallel processing on image processing, physical simulation, medical & financial analysis.DDR5 RAM supportEach core can execute 32Gigaflops/s with 1GHz clock, results several teraflops/s speed
  • 5. Differences with CPUOut of order executionVector processing unit supports 16-single precision floating point numbers at a timeTexture sampling units – trilinear /anisotropic filtering & texture decompression1024-bit ring bus between coresCache control instructions4-way multithreading
  • 6. Difference with GPUx86 instruction set with Larrabee-specific extensions cache coherency across all its coresz-buffering, clipping, and blending without using graphics hardware
  • 8. ArchitectureCores communicate on a 1024-bit wide ring bus - Fast access to memory, I/O interfaces and fixed function blocks - Fast access for cache coherencyL2 cache is partitioned among the cores - Provides high aggregate bandwidth - Allows data replication & sharing Optimized for highly parallel workload using vector processorIn-order CPU Core Separate scalar & vector units with separate registers
  • 9. Vector unit: 16 32-bit ops/clock
  • 10. In-order instruction execution
  • 11. Fast access from 64k L1 cache
  • 12. Direct connection to eachcore’s subset of the 256k L2 cachePrefetch instructions load L1and L2 caches
  • 13. Vector Unit Vector complete instruction set – Scatter/gather for vector load/store – Mask registers select lanes to write, which allows data-parallel flow control – Masks also support data compaction Vector instructions support – Full speed when data in L1 cache – Fused multiply add (three arguments) – Int32, Float32 and Float64 data – Can read 8-bit unorm, 8-bit uint, 16 bit sine, 16 bit float data & convert it into 32 bit floats/ integers.
  • 14. Fixed Function LogicMicro codes in place of fixed function logic for post shader alpha blending, rasterizationand interpolation.Includes fixed function texture filter logicVirtual memory for textures
  • 15. Larrabee’s Binning RendererBinning pipeline– Reduces synchronization– Front end processes vertex & geometry shading– Back end processes pixel shading, stencil testing, blending– Bin FIFO between them• Multi-tasking by cores– Each orange box is a core– Cores run independently– Other cores can run othertasks, e.g. physics
  • 16. Back-end Rendering a Tile• Orange boxes represent work on separate threads• Three work threads do Z, pixel shader, and blending• Setup thread reads from bins and does pre-processing• Combines task parallel, data parallel, and sequential
  • 17. Pipeline can be changedParts can move between front end & back end – Vertex shading, tesselation, rasterization, etc. – Allows balancing computation vs. bandwidthNew features – Transparency, shadowing, ray tracing etc. – Each of these need irregular data structures– Also helps to be able to “repack” the data
  • 18. TransparencyTransparency with & without pre-resolve effects
  • 19. Examples of using TasksApplications – Scene traversal and culling – Procedural geometry synthesis – Physics contact group solve – Data parallel strand groups – Distribute across threads/cores using task system – Exploit core resources with SIMDLarrabee can submit work to itself! – Tasks can spawn other tasks – Exposed in Larrabee Native programming interface(c/c++compiler)
  • 21. Scalability Studies Based on memory Bandwidth & texture filtering speedPerformance Breakdowns
  • 22. Binning & Bandwidth StudiesBandwidthImmediate mode use more Bandwidth -2.4 to 7 times for F.E.A.R -1.5 to2.6 times more for Gears of War -1.6 to 1.8 times more for Half Life 2 Episode 2.
  • 24. Conclusion The Larrabee architecture opens the rich set of opportunities for both graphics rendering and throughput computing and is the appropriate platform for convergence of GPU & CPU
  • 25. ReferenceIEEE Digital Library- Larrabee: a many- core x86 architecture for visual computing: - Larry Seiler, Doug Carmean, Toni Juan of Intel Corporation, Jeremy Sugerman & Peter Hanrahan – Stanford UniversityIEEE spectrum January 2008ACM transactions on graphics-Article 18www.intel.comwww.wikipedia.com