DOTE

Chain And Rate

Wednesday, April 30, 2014

The Cell Architecture: An Overview

In Randall Hyde's fine series of books, Write Great Code, one of his fundamental lessons is that, for optimal performance, you need to know how your code runs on the target processor. Nowhere is this truer than when programming the Cell. It isn't enough to learn the C/C++ commands for the different cores; you need to understand how the elements communicate with memory and one another. This way, you'll have a bubble-free instruction pipeline, an increased probability of cache hits, and an orderly, non intersecting communication flow between processing elements. What more could anyone ask?.

The primary building blocks of the Cell: the Memory Interface Controller (MIC), the PowerPC Processor Element (PPE), the eight Synergistic Processor Elements (SPEs), the Element Interconnect Bus (EIB), and the Input/Output Interface (IOIF). It's a good idea to see how they function individually and interact as a whole.
cell_processor_top_level_anatomy
Top Level Anatomy Of Cell Processor
The Memory Interface Controller (MIC)
The MIC connects the Cell's system memory to the rest of the chip. It provides two channels to system memory, but because you can't control its operation through code, the discussion of the MIC is limited to this brief treatment. However, you should know that, like the PlayStation 2's Emotion Engine, the first-generation Cell supports connections only to Rambus memory. This memory, called eXtreme Data Rate Dynamic Random Access Memory, or XDR DRAM, differs from conventional DRAM in that it makes eight data transfers per clock cycle rather than the usual two or four. This way, the memory can provide high data bandwidth without needing very high clock frequencies. The XDR interface can support different memory sizes, and the PlayStation 3 uses 256MB of XDR DRAM as its system memory.

The PowerPC Processor Element (PPE)
The PPE is the Cell's control center. It runs the operating system, responds to interrupts, and contains and manages the 512KB L2 cache. It also distributes the processing workload among the SPEs and coordinates their operation. Comparing the Cell to an eight-horse coach, the PPE is the coachman, controlling the cart by feeding the horses and keeping them in line. The PPE consists of two operational blocks. The first is the PowerPC Processor Unit, or PPU. This processor's instruction set is based on the 64-bit PowerPC 970 architecture, used most prominently as the CPU of Apple Computer's Power Mac G5. The PPU executes PPC 970 instructions in addition to other Cell-specific commands, and is the only general-purpose processing unit in the Cell. This is why Linux is installed to run on the PPU and not on the other processing units.

The Synergistic Processor Element (SPE)
The PPU is a powerful processor, but it's the Synergistic Processor Unit (SPU) in each SPE that makes the Cell such a groundbreaking device. These processors are designed for one purpose only: high-speed SIMD operations. Each SPU contains two parallel pipelines that execute instructions at 3.1GHz. In only a handful of cycles, one pipeline can multiply and accumulate 128-bit vectors while the other loads more vectors from memory. SPUs weren't designed for general-purpose processing and aren't well suited to run operating systems. Instead, they receive instructions from the PPU, which also starts and stops their execution. The SPU's instructions, like its data, are stored in a unified 256KB local store (LS). The LS is not cache; it's the SPU's own individual memory for instructions and data. This, along with the SPU's large register file (128 128-bit registers), is the only memory the SPU can directly access, so it's important to have a deep understanding of how the LS works and how to transfer its contents to other elements.

The Element Interconnect Bus (EIB)
The EIB serves as the infrastructure underlying the DMA requests and interelement communication. Functionally, it consists of four rings, two that carry data in the clockwise direction (PPE > SPE1 > SPE3 > SPE5 > SPE7 > IOIF1 > IOIF0 > SPE6 > SPE4 > SPE2 > SPE0 > MIC) and two that transfer data in the counterclockwise direction. Each ring is 16 bytes wide and can support three data transfers simultaneously. Each DMA transfer can hold payload sizes of 1, 2, 4, 8, and 16 bytes, and multiples of 16 bytes up to a maximum of 16KB. Each DMA transfer, no matter how large or small, consists of eight bus transfers (128 bytes).

The Input/Output Interface (IOIF)
As the name implies, IOIF connects the Cell to external peripherals. Like the memory interface, it is based on Rambus technology: FlexIO. The FlexIO connections can be configured for data rates between 400MHz to 8GHz, and with the high number of connections on the Cell, its maximum I/O bandwidth approaches 76.8GB/s. In the PlayStation 3, the I/O is connected to Nvidia's RSX graphic processor. The IOIF can be accessed only by privileged applications.