TPC (Texture Processing Cluster)

The TPC is a concept that is found on NVIDIA GPUs. On G80 and GT200 architectures, a TPC, or Texture / Processor Cluster, is a group made up of several SMs, a texture unit and some logic control.

The SM is a Streaming Multiprocessor and is made up to several SPs (or Streaming Processors), several SFUs (or Special Function Unit – the unit used for transcendental functions such as sine or cosine). A Streaming Processor is also called a CUDA core (in the new Fermi terminology).

The TPC of a G80 GPU has 2 SMs while the TPC of a GT200 has 3 SMs.

A SP includes several ALUs and FPUs. An ALU is an arithmetical and Logical Unit and a FPU is a Floating Point Unit. The SP is the real processing element that acts on vertex or pixel data.

Several TPCs can be grouped in higher level entity called a Streaming Processor Array.

In OpenCL terminology, a SM is called a Compute Unit or CU.

But in NVIDIA’s new GPU, the GF100 / Fermi, the TPC is no longer valid: only remain the SMs. We can also say that on Fermi architecture, a TPC = a SM.

In Fermi architecture, a SM is made up of two SIMD 16-way units. Each SIMD 16-way has 16 SPs then a SM in Fermi has 32 SPs or 32 CUDA cores.

Prior GPUs used IEEE 754-1985 floating point arithmetic. The Fermi architecture implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add (FMA) instruction for both single and double precision arithmetic. FMA improves over a multiply-add (MAD) instruction by doing the multiplication and addition with a single final rounding step, with no loss of precision in the addition. FMA is more accurate than performing the operations
separately. GT200 implemented double precision FMA.

In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result, multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly designed integer ALU supports full 32-bit precision for all instructions, consistent with standard programming language requirements. The integer ALU is also optimized to efficiently support 64-bit and extended precision operations. Various instructions are supported, including
Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population count.

[via]