SlideShare a Scribd company logo
Introduction
Sony Corporation develops technologies for products used in
graphics operations and scientific calculation using the Cell
Broadband EngineTM
(Cell/B.E.) and the RSX employed in the
PLAYSTATION®
3, which are developed by Sony Computer
Entertainment Incorporated (SCEI). This document explains
the principal technologies.
Trend toward multi-core processors
Drastic improvements in performance in recent years have
enabled microprocessors to handle large amounts of data
at high speed. The target applications are being extended,
and the amount of computation required in each field is
increasing. In addition to conventional general-purpose
processors, recently developed application-specific processors
exploit the full potential of their performance for a given
application. For example, the Graphics Processing Unit (GPU)
is a special processor for processing 2D/3D graphics at high
speed. This provides very high computational performance
that cannot be achieved by conventional general-purpose
processors.
The trend in the microprocessor market is moving to
asymmetric multi-core processors, where different types of
processors optimized for a specific use coexist. Considering
this technology trend, Sony Corporation has developed the
“Cell Computing Unit”, which uses the high computational
performance provided by the Cell/B.E. and the RSX
processors.
This technology provides solutions to multimedia computing,
which requires large number of computations, such as
those encountered in image-processing applications, like
computer graphics, and scientific computations for non-image-
processing applications, such as encryption technology for
security (Fig. 1).
Multi-core processor with powerful
computational performance - Cell/B.E. -
The Cell/B.E. is an asymmetric multi-core processor jointly
developed by Sony/SCEI, Toshiba Corporation, and IBM
Corporation. This processor alone provides high floating-point
arithmetic operations at 230 GFLOPS. In anticipation of real-
time operation on 4K x 2K or High Definition images, the Cell
processor is optimally designed for multimedia processing
and usage in distributed computing environments.
Architecture of the Cell/B.E.
Two types of processor cores
Two types of processor cores are mounted in the Cell/B.E.:
PowerPC Processor Element (PPE), a control-intensive
processor core for general-purpose processing; and Synergistic
Processor Element (SPE), a processor core for high-speed
data processing. The coordinated operation of these processor
cores give rise to the high level of computational performance
of the Cell/B.E..
The Basics of Cell Computing
Technology
GPU
Target area
Cell/B.E.
3D Volume
Rendering
Game
CG
Rendering
Image
Processing
Search
AV Codec
Vision
Computing
Security
(AES/DES)
Science
Computation
Science
Visualization
CAD
GUI
Web
Document
Editor
File
Sysrem Network
Protocol
GPU
Non-Graphic Processing
Monolithic
Graphic Processing
Parallel
Processing
Fig. 1 Target area of Cell Computing Technology
1
WHITE PAPER
2
WHITE PAPER The Basics of Cell Computing Technology
High-speed internal bus
Four ring-shaped broadband buses called Element
Interconnect Bus (EIB) connect the cores of the Cell/B.E. and
peripheral chips. PPE, SPE, main memory (XDR™ DRAM),
and external I/O devices (the RSX, SouthBridge, etc.) are also
connected to the EIB (Fig. 2).
General-purpose processor core - PPE -
PPE is a general-purpose processor core with a 64-bit Power
Architecture™ and can run existing software compatible with
the PowerPC chip. One PPE is mounted in the Cell/B.E.. The
PPE not only executes the operating system (input/output
control to main memory, external devices, etc.) but also
controls the SPEs.
L1/L2 caches
The PPE is equipped with 32 KB of L1 cache for instructions
and data, respectively, and 512 KB of unified L2 cache for
both instructions and data.
Hardware multithreading
Simultaneous multithreading technology is implemented in
the PPE to provide two-way hardware multithreading.
Vector operation unit
A 128-bit vector operation unit supports the Single Instruction
Multiple Data (SIMD) operations by Vector/SIMD Multimedia
Extensions (VMX) technology.
Computation-intensive processor core - SPE -
SPE is a computationally intensive processor core that excels
at multimedia processing, especially for repeated calculations.
The PPE, on the other hand is optimized for complicated
program control. Eight SPEs are employed in the Cell/B.E.,
Each SPE operates independently of the PPE and other SPEs
and provides high computational performance.
Vector oriented RISC processor
The SPU instruction set architecture (ISA) supports 128-bit
SIMD operations. All basic data processing is performed by
this operation. Because of its very simple architecture, the
SPE can realize high-speed processing where calculations can
be predicted. It is very suitable for execution of an application
where real time processing is required.
Abundant register files
As many as 128 of the 128-bit registers are on the SPE
because of the large amount of data processing is likely to
be performed by this device. By taking advantage of the
abundant register files, high-speed arithmetic operations can
be performed on a large volume of data.
SPE-dedicated local memory
Each SPE is equipped with 256 KB of scratch pad type
memory called Local Store (LS). With no cache mounted, an
SPE accesses LS directly at high speed. DMA transfer is used
for data transfer between main memory and LS.
Memory flow controller
Access from SPE to main memory, PPE and other external
I/O devices is performed through a unit called the Memory
Flow Controller (MFC). The MFC is equipped with a DMA
Controller to exchange data with the main memory. Because
the MFC operates independently of program execution in an
SPE, it can transfer data to external devices in parallel.
Graphics engine with high-speed I/O function
- RSX -
The RSX is a graphics engine jointly developed by NVIDIA
Corporation and SCEI. This device can be connected to the
Cell/B.E. using a 128-bit memory interface called FlexIO™.
Programmable shader engine
The RSX processes graphics with vertex shader programs
created by the user and computes fragment colors with
fragment shader programs. The RSX has multiple fragment
shader engines for precise graphics operation without
overloading the Cell/B.E., which is difficult in fixed pipelined
graphics engines.
Rendering pipeline
The rendering pipeline in the RSX consists of the host/
frontend, geometry, raster, fragmentation shader/texture,
L2 texture cache, 2D raster, and Raster Operation (ROP). It
also includes a frame buffer to control memory access and a
display block to scan out from local memory (Fig. 3).
Fig. 2 Overview of the Cell/B.E. and Peripheral Chips
IOIF1/FlexIO™
MIC/XIO IOIF0/FlexIO™
PPE
PPU
L1
L2
SPE1
SPU
LS
MFC
SPE3
SPU
LS
MFC
SPE5
SPU
LS
MFC
SPE7
SPU
LS
MFC
RSX®
XDR™
DRAM
SPE0
SPU
LS
MFC
SPE2
SPU
LS
MFC
SPE4
SPU
LS
MFC
SPE6
SPU
LS
MFC
South
Bridge
EIB
3
The Basics of Cell Computing Technology WHITE PAPER
High-speed I/O function
The Cell/B.E. and the RSX are connected via a high-speed
bus called FlexIO for efficient graphics operations. The RSX
also features an external I/O function with performance
exceeding 4 GB/s. Therefore, it can easily handle the I/O
of high-resolution video signals, such as 4K x 2K and High
Definition. It can also be used in I/F for professional use by
taking advantage of this high-speed I/O performance.
Sony's New Development
-"Cell Computing Unit"
Features of Cell Computing Unit
The newly developed “Cell Computing Unit” is equipped with
the high-performance Cell/B.E. processor with high floating-
point arithmetic operation of 230 GFLOPS. Additional use
of the RSX graphics engine enables much faster operation.
An open system is offered by adopting the Linux operating
system. Its eight SPEs can be freely used to support high-level
of calculations in multimedia applications.
Software configuration
Cell Computing Unit is equipped with two systems: the main
system for general applications and the mini system for
maintenance. Linux is the operating system so that they can
be switched at system startup. Under normal operation, the
main system executes user applications. The mini system for
maintenance is mainly used to update firmware and drivers.
Graphics API
Graphics programs for the RSX support the subset of OpenGL
1.5, which is an industry standard for graphics APIs. They
also support the Cg shader language developed by NVIDIA
Corporation for RSX-native graphics control. Therefore, the
RSX acceleration can be programmed at a low level.
Software development
Applications running on the main system of the Cell
Computing Unit can be developed under Linux on Intel
architecture (x86). A cross-platform toolchain (binutils,
compilers, etc.), debuggers, performance analysis tools, and
integrated development environments are used.
The SPE runtime management library for the Cell/B.E. and the
graphics library for RSX hardware acceleration are currently
being prepared.
Porting and optimization of the existing programs
In optimized porting of existing programs, it is important
to understand how multiple SPEs are used. The basic
optimization process of existing programs with the Cell/B.E.
consist of three processes: (1) porting to PPE, (2) performance
measurement and profiling, and (3) offloading to SPEs
(Fig. 4). SPE offloading ports part of the processing in a
program to the various SPEs. In the optimization process,
a spiral type process is generally used, where profiling is
performed again after optimization. Then, further optimization
in performance is possible.
Faster programs
One can drastically improve the processing performance of
programs with the Cell/B.E. by using a variety of optimization
techniques. The most important optimization technique
is how parallel they are. Programs are designed and
implemented so that the following three parallelisms may be
improved in every aspect.
Host/FE
Geometry
Raster 2D Raster
Fragment Shader
& Texture
L2 Texture
Cache
FB
ROP
(Raster Operation)
IOIF/FlexIO™
Cell/B.E.
Fig. 3 Major blocks in rendering pipeline
Fig. 4 Basic process of optimization
Existing Programs
Porting to PPE
Performance Analysis/Profiling
Decide optimazation strategies
SPE Offloading
Other Architectures
Power Architecture™
Cell/B.E. Architecture
4
WHITE PAPER The Basics of Cell Computing Technology
Parallelism at the thread level
The PPE can execute two threads in parallel with
simultaneous multithreading. When eight SPEs are used, it
can execute up to ten threads in parallel. This is because,
in general, the PPE thread can directly access external
I/O devices and handle such interruptions as hardware
exceptions, it controls the entire system, while the SPE thread
mainly handles data calculations. What kind of processing is
performed in each thread must be considered to distribute the
load of the PPE and SPE threads.
Parallelism at the data level
The streaming data operation repeats the same processing
for multiple data. In the operation, one SIMD instruction
can process multiple data in parallel (Fig. 5). All instructions
in the SPE, including
arithmetic, comparison,
and bit operation
instructions, support
SIMD operation. The
SIMD operation is also
available in the PPE by
using VMX instructions
extended from AltiVec
technology.
Parallelism at the instruction level
An SPE has two long pipelines (Even/Odd) with 26 stages.
The instructions executed in these two pipelines are
predetermined: arithmetic operations are executed in the
Even pipeline, and load/store instructions are executed in the
Odd pipeline. By optimizing the instruction sequence of a
program, two instructions can be issued simultaneously per
cycle (Dual Issue).
Conclusion
Sony Corporation will continue to develop technologies that
enable large-scale calculation at high speed, which could not
be obtained from traditional servers and workstations. We will
strive for development of state-of-the-art technologies in high
performance computing, such as CG rendering and scientific
calculations by utilizing the maximum performance of the
high-end Cell/B.E. processor and the high-speed GPU RSX.
Trademarks
Sony is a registered trademark of Sony Corporation.
PLAYSTATION is a registered trademark of Sony Computer Entertainment Inc.
mental images, mental ray and mental mill are registered trademarks and
MetaSL is a trademark of mental images GmbH.
All other trademarks are the properties of their respective owners.
Disclaimer
© 2008 Sony Corporation, All rights reserved.
Reproduction in whole or in part without written permission of Sony
Corporation is prohibited.
Features and specifications are subject to change without notice.
Fig. 5 SIMD operation (ADD instruction)
a3a2a1a0
++++
b3b2b1b0
a3+b3a2+b2a1+b1a0+b0

More Related Content

PPT
Intel Core i7 Processors
PDF
Computer Organization and Architecture 10th Edition by Stallings Test Bank
PPTX
directCell - Cell/B.E. tightly coupled via PCI Express
PDF
Co question bank LAKSHMAIAH
PDF
HPC Platform options: Cell BE and GPU
PDF
01. introduction to embedded systems
PDF
Computer Oraganisation and Architecture
PDF
Designing of telecommand system using system on chip soc for spacecraft contr...
Intel Core i7 Processors
Computer Organization and Architecture 10th Edition by Stallings Test Bank
directCell - Cell/B.E. tightly coupled via PCI Express
Co question bank LAKSHMAIAH
HPC Platform options: Cell BE and GPU
01. introduction to embedded systems
Computer Oraganisation and Architecture
Designing of telecommand system using system on chip soc for spacecraft contr...

What's hot (20)

DOCX
Virtual Memory In Contemporary Microprocessors And 64-Bit Microprocessors Arc...
PDF
Computer Organization Lecture Notes
DOC
Avionics Paperdoc
PPTX
Processors and its Types
PDF
Cliff sugerman
PPTX
Computer processing
DOCX
parallel processing
PDF
Bus Standards and Networking
PDF
Computer Organization (Unit-1)
PDF
Sybsc cs sem 3 physical computing and iot programming unit 1
PPSX
System on chip architectures
PDF
iPhone Architecture - Review
PDF
Introduction to Microprocessors
PDF
Lecture 1
PPTX
generations of computer
PPSX
Coa presentation4
PPTX
Computer Architecture and organization
PPTX
Ca lecture 03
DOCX
computer organization and architecture notes
Virtual Memory In Contemporary Microprocessors And 64-Bit Microprocessors Arc...
Computer Organization Lecture Notes
Avionics Paperdoc
Processors and its Types
Cliff sugerman
Computer processing
parallel processing
Bus Standards and Networking
Computer Organization (Unit-1)
Sybsc cs sem 3 physical computing and iot programming unit 1
System on chip architectures
iPhone Architecture - Review
Introduction to Microprocessors
Lecture 1
generations of computer
Coa presentation4
Computer Architecture and organization
Ca lecture 03
computer organization and architecture notes
Ad

Similar to The Basics of Cell Computing Technology (20)

PDF
Toshiba's Approach to Consumer Product Applications by Cell and Desire/Challe...
PDF
Synergistic processing in cell's multicore architecture
PDF
An area and power efficient on chip communication architectures for image enc...
PPTX
computer processors intel and amd
PPT
The Cell Processor
PDF
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
PDF
Designing of telecommand system using system on chip soc for spacecraft contr...
DOCX
Glossary of terms (assignment...)
PPT
System_on_Chip_SOC.ppt
PDF
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
DOCX
UNIT 1.docx
DOCX
Glossary of terms (assignment...)
PPTX
Embedded System basic and classifications
PPSX
Electronics Engineer Portfolio
PDF
The Best Programming Practice for Cell/B.E.
PDF
Implementation of RISC-Based Architecture for Low power applications
PDF
Physical computing and iot programming final with cp sycs sem 3
PPT
chameleon chip
PPTX
An introduction to digital signal processors 1
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
Toshiba's Approach to Consumer Product Applications by Cell and Desire/Challe...
Synergistic processing in cell's multicore architecture
An area and power efficient on chip communication architectures for image enc...
computer processors intel and amd
The Cell Processor
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Designing of telecommand system using system on chip soc for spacecraft contr...
Glossary of terms (assignment...)
System_on_Chip_SOC.ppt
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
UNIT 1.docx
Glossary of terms (assignment...)
Embedded System basic and classifications
Electronics Engineer Portfolio
The Best Programming Practice for Cell/B.E.
Implementation of RISC-Based Architecture for Low power applications
Physical computing and iot programming final with cp sycs sem 3
chameleon chip
An introduction to digital signal processors 1
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
Ad

More from Slide_N (20)

PDF
IBM: Introduction to the Cell Multiprocessor
PDF
IBM: Introduction to the Cell Broadband Engine Architecture
PDF
AMD: The Next Generation of Microprocessors
PDF
Cryptologic Applications of the PlayStation 3: Cell SPEED
PDF
Toward an Open and Unified Model for Heterogeneous and Accelerated Multicore ...
PDF
Roadrunner: Heterogeneous Petascale Computing for Predictive Simulation
PDF
Driving a Hybrid in the Fast-lane: The Petascale Roadrunner System at Los Alamos
PDF
Petascale Visualization: Approaches and Initial Results
PDF
The Cell at Los Alamos: From Ray Tracing to Roadrunner
PDF
Roadrunner and hybrid computing - Conference on High-Speed Computing
PDF
Roadrunner Tutorial: An Introduction to Roadrunner and the Cell Processor
PDF
Deferred Pixel Shading on the PlayStation 3
PDF
POWER9: IBM’s Next Generation POWER Processor
PDF
IBM POWER8 Systems Technology Group Development
PDF
IBM POWER8: The first OpenPOWER processor
PDF
Efficient Usage of Compute Shaders on Xbox One and PS4
PDF
Future Commodity Chip Called CELL for HPC
PDF
Common Software Models and Platform for Cell and SpursEngine
PDF
Towards Cell Broadband Engine - Together with Playstation
PDF
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
IBM: Introduction to the Cell Multiprocessor
IBM: Introduction to the Cell Broadband Engine Architecture
AMD: The Next Generation of Microprocessors
Cryptologic Applications of the PlayStation 3: Cell SPEED
Toward an Open and Unified Model for Heterogeneous and Accelerated Multicore ...
Roadrunner: Heterogeneous Petascale Computing for Predictive Simulation
Driving a Hybrid in the Fast-lane: The Petascale Roadrunner System at Los Alamos
Petascale Visualization: Approaches and Initial Results
The Cell at Los Alamos: From Ray Tracing to Roadrunner
Roadrunner and hybrid computing - Conference on High-Speed Computing
Roadrunner Tutorial: An Introduction to Roadrunner and the Cell Processor
Deferred Pixel Shading on the PlayStation 3
POWER9: IBM’s Next Generation POWER Processor
IBM POWER8 Systems Technology Group Development
IBM POWER8: The first OpenPOWER processor
Efficient Usage of Compute Shaders on Xbox One and PS4
Future Commodity Chip Called CELL for HPC
Common Software Models and Platform for Cell and SpursEngine
Towards Cell Broadband Engine - Together with Playstation
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Cloud computing and distributed systems.
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Empathic Computing: Creating Shared Understanding
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Approach and Philosophy of On baking technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation_ Review paper, used for researhc scholars
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Reach Out and Touch Someone: Haptics and Empathic Computing
Cloud computing and distributed systems.
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation theory and applications.pdf
Spectral efficient network and resource selection model in 5G networks
Review of recent advances in non-invasive hemoglobin estimation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Empathic Computing: Creating Shared Understanding
NewMind AI Weekly Chronicles - August'25-Week II
Approach and Philosophy of On baking technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
Machine learning based COVID-19 study performance prediction
A comparative analysis of optical character recognition models for extracting...
Advanced methodologies resolving dimensionality complications for autism neur...
20250228 LYD VKU AI Blended-Learning.pptx
Assigned Numbers - 2025 - Bluetooth® Document

The Basics of Cell Computing Technology

  • 1. Introduction Sony Corporation develops technologies for products used in graphics operations and scientific calculation using the Cell Broadband EngineTM (Cell/B.E.) and the RSX employed in the PLAYSTATION® 3, which are developed by Sony Computer Entertainment Incorporated (SCEI). This document explains the principal technologies. Trend toward multi-core processors Drastic improvements in performance in recent years have enabled microprocessors to handle large amounts of data at high speed. The target applications are being extended, and the amount of computation required in each field is increasing. In addition to conventional general-purpose processors, recently developed application-specific processors exploit the full potential of their performance for a given application. For example, the Graphics Processing Unit (GPU) is a special processor for processing 2D/3D graphics at high speed. This provides very high computational performance that cannot be achieved by conventional general-purpose processors. The trend in the microprocessor market is moving to asymmetric multi-core processors, where different types of processors optimized for a specific use coexist. Considering this technology trend, Sony Corporation has developed the “Cell Computing Unit”, which uses the high computational performance provided by the Cell/B.E. and the RSX processors. This technology provides solutions to multimedia computing, which requires large number of computations, such as those encountered in image-processing applications, like computer graphics, and scientific computations for non-image- processing applications, such as encryption technology for security (Fig. 1). Multi-core processor with powerful computational performance - Cell/B.E. - The Cell/B.E. is an asymmetric multi-core processor jointly developed by Sony/SCEI, Toshiba Corporation, and IBM Corporation. This processor alone provides high floating-point arithmetic operations at 230 GFLOPS. In anticipation of real- time operation on 4K x 2K or High Definition images, the Cell processor is optimally designed for multimedia processing and usage in distributed computing environments. Architecture of the Cell/B.E. Two types of processor cores Two types of processor cores are mounted in the Cell/B.E.: PowerPC Processor Element (PPE), a control-intensive processor core for general-purpose processing; and Synergistic Processor Element (SPE), a processor core for high-speed data processing. The coordinated operation of these processor cores give rise to the high level of computational performance of the Cell/B.E.. The Basics of Cell Computing Technology GPU Target area Cell/B.E. 3D Volume Rendering Game CG Rendering Image Processing Search AV Codec Vision Computing Security (AES/DES) Science Computation Science Visualization CAD GUI Web Document Editor File Sysrem Network Protocol GPU Non-Graphic Processing Monolithic Graphic Processing Parallel Processing Fig. 1 Target area of Cell Computing Technology 1 WHITE PAPER
  • 2. 2 WHITE PAPER The Basics of Cell Computing Technology High-speed internal bus Four ring-shaped broadband buses called Element Interconnect Bus (EIB) connect the cores of the Cell/B.E. and peripheral chips. PPE, SPE, main memory (XDR™ DRAM), and external I/O devices (the RSX, SouthBridge, etc.) are also connected to the EIB (Fig. 2). General-purpose processor core - PPE - PPE is a general-purpose processor core with a 64-bit Power Architecture™ and can run existing software compatible with the PowerPC chip. One PPE is mounted in the Cell/B.E.. The PPE not only executes the operating system (input/output control to main memory, external devices, etc.) but also controls the SPEs. L1/L2 caches The PPE is equipped with 32 KB of L1 cache for instructions and data, respectively, and 512 KB of unified L2 cache for both instructions and data. Hardware multithreading Simultaneous multithreading technology is implemented in the PPE to provide two-way hardware multithreading. Vector operation unit A 128-bit vector operation unit supports the Single Instruction Multiple Data (SIMD) operations by Vector/SIMD Multimedia Extensions (VMX) technology. Computation-intensive processor core - SPE - SPE is a computationally intensive processor core that excels at multimedia processing, especially for repeated calculations. The PPE, on the other hand is optimized for complicated program control. Eight SPEs are employed in the Cell/B.E., Each SPE operates independently of the PPE and other SPEs and provides high computational performance. Vector oriented RISC processor The SPU instruction set architecture (ISA) supports 128-bit SIMD operations. All basic data processing is performed by this operation. Because of its very simple architecture, the SPE can realize high-speed processing where calculations can be predicted. It is very suitable for execution of an application where real time processing is required. Abundant register files As many as 128 of the 128-bit registers are on the SPE because of the large amount of data processing is likely to be performed by this device. By taking advantage of the abundant register files, high-speed arithmetic operations can be performed on a large volume of data. SPE-dedicated local memory Each SPE is equipped with 256 KB of scratch pad type memory called Local Store (LS). With no cache mounted, an SPE accesses LS directly at high speed. DMA transfer is used for data transfer between main memory and LS. Memory flow controller Access from SPE to main memory, PPE and other external I/O devices is performed through a unit called the Memory Flow Controller (MFC). The MFC is equipped with a DMA Controller to exchange data with the main memory. Because the MFC operates independently of program execution in an SPE, it can transfer data to external devices in parallel. Graphics engine with high-speed I/O function - RSX - The RSX is a graphics engine jointly developed by NVIDIA Corporation and SCEI. This device can be connected to the Cell/B.E. using a 128-bit memory interface called FlexIO™. Programmable shader engine The RSX processes graphics with vertex shader programs created by the user and computes fragment colors with fragment shader programs. The RSX has multiple fragment shader engines for precise graphics operation without overloading the Cell/B.E., which is difficult in fixed pipelined graphics engines. Rendering pipeline The rendering pipeline in the RSX consists of the host/ frontend, geometry, raster, fragmentation shader/texture, L2 texture cache, 2D raster, and Raster Operation (ROP). It also includes a frame buffer to control memory access and a display block to scan out from local memory (Fig. 3). Fig. 2 Overview of the Cell/B.E. and Peripheral Chips IOIF1/FlexIO™ MIC/XIO IOIF0/FlexIO™ PPE PPU L1 L2 SPE1 SPU LS MFC SPE3 SPU LS MFC SPE5 SPU LS MFC SPE7 SPU LS MFC RSX® XDR™ DRAM SPE0 SPU LS MFC SPE2 SPU LS MFC SPE4 SPU LS MFC SPE6 SPU LS MFC South Bridge EIB
  • 3. 3 The Basics of Cell Computing Technology WHITE PAPER High-speed I/O function The Cell/B.E. and the RSX are connected via a high-speed bus called FlexIO for efficient graphics operations. The RSX also features an external I/O function with performance exceeding 4 GB/s. Therefore, it can easily handle the I/O of high-resolution video signals, such as 4K x 2K and High Definition. It can also be used in I/F for professional use by taking advantage of this high-speed I/O performance. Sony's New Development -"Cell Computing Unit" Features of Cell Computing Unit The newly developed “Cell Computing Unit” is equipped with the high-performance Cell/B.E. processor with high floating- point arithmetic operation of 230 GFLOPS. Additional use of the RSX graphics engine enables much faster operation. An open system is offered by adopting the Linux operating system. Its eight SPEs can be freely used to support high-level of calculations in multimedia applications. Software configuration Cell Computing Unit is equipped with two systems: the main system for general applications and the mini system for maintenance. Linux is the operating system so that they can be switched at system startup. Under normal operation, the main system executes user applications. The mini system for maintenance is mainly used to update firmware and drivers. Graphics API Graphics programs for the RSX support the subset of OpenGL 1.5, which is an industry standard for graphics APIs. They also support the Cg shader language developed by NVIDIA Corporation for RSX-native graphics control. Therefore, the RSX acceleration can be programmed at a low level. Software development Applications running on the main system of the Cell Computing Unit can be developed under Linux on Intel architecture (x86). A cross-platform toolchain (binutils, compilers, etc.), debuggers, performance analysis tools, and integrated development environments are used. The SPE runtime management library for the Cell/B.E. and the graphics library for RSX hardware acceleration are currently being prepared. Porting and optimization of the existing programs In optimized porting of existing programs, it is important to understand how multiple SPEs are used. The basic optimization process of existing programs with the Cell/B.E. consist of three processes: (1) porting to PPE, (2) performance measurement and profiling, and (3) offloading to SPEs (Fig. 4). SPE offloading ports part of the processing in a program to the various SPEs. In the optimization process, a spiral type process is generally used, where profiling is performed again after optimization. Then, further optimization in performance is possible. Faster programs One can drastically improve the processing performance of programs with the Cell/B.E. by using a variety of optimization techniques. The most important optimization technique is how parallel they are. Programs are designed and implemented so that the following three parallelisms may be improved in every aspect. Host/FE Geometry Raster 2D Raster Fragment Shader & Texture L2 Texture Cache FB ROP (Raster Operation) IOIF/FlexIO™ Cell/B.E. Fig. 3 Major blocks in rendering pipeline Fig. 4 Basic process of optimization Existing Programs Porting to PPE Performance Analysis/Profiling Decide optimazation strategies SPE Offloading Other Architectures Power Architecture™ Cell/B.E. Architecture
  • 4. 4 WHITE PAPER The Basics of Cell Computing Technology Parallelism at the thread level The PPE can execute two threads in parallel with simultaneous multithreading. When eight SPEs are used, it can execute up to ten threads in parallel. This is because, in general, the PPE thread can directly access external I/O devices and handle such interruptions as hardware exceptions, it controls the entire system, while the SPE thread mainly handles data calculations. What kind of processing is performed in each thread must be considered to distribute the load of the PPE and SPE threads. Parallelism at the data level The streaming data operation repeats the same processing for multiple data. In the operation, one SIMD instruction can process multiple data in parallel (Fig. 5). All instructions in the SPE, including arithmetic, comparison, and bit operation instructions, support SIMD operation. The SIMD operation is also available in the PPE by using VMX instructions extended from AltiVec technology. Parallelism at the instruction level An SPE has two long pipelines (Even/Odd) with 26 stages. The instructions executed in these two pipelines are predetermined: arithmetic operations are executed in the Even pipeline, and load/store instructions are executed in the Odd pipeline. By optimizing the instruction sequence of a program, two instructions can be issued simultaneously per cycle (Dual Issue). Conclusion Sony Corporation will continue to develop technologies that enable large-scale calculation at high speed, which could not be obtained from traditional servers and workstations. We will strive for development of state-of-the-art technologies in high performance computing, such as CG rendering and scientific calculations by utilizing the maximum performance of the high-end Cell/B.E. processor and the high-speed GPU RSX. Trademarks Sony is a registered trademark of Sony Corporation. PLAYSTATION is a registered trademark of Sony Computer Entertainment Inc. mental images, mental ray and mental mill are registered trademarks and MetaSL is a trademark of mental images GmbH. All other trademarks are the properties of their respective owners. Disclaimer © 2008 Sony Corporation, All rights reserved. Reproduction in whole or in part without written permission of Sony Corporation is prohibited. Features and specifications are subject to change without notice. Fig. 5 SIMD operation (ADD instruction) a3a2a1a0 ++++ b3b2b1b0 a3+b3a2+b2a1+b1a0+b0