GPU microarchitecture by Nvidia
Kepler
is the codename for a
GPU
microarchitecture
developed by
Nvidia
, first introduced at retail in April 2012,
[1]
as the successor to the
Fermi
microarchitecture. Kepler was Nvidia's first microarchitecture to focus on
energy efficiency
. Most
GeForce 600 series
, most
GeForce 700 series
, and some
GeForce 800M series
GPUs were based on Kepler, all manufactured in 28 nm. Kepler found use in the GK20A, the GPU component of the
Tegra K1
SoC
, and in the
Quadro
Kxxx series, the Quadro NVS 510, and
Tesla
computing modules.
Kepler was followed by the
Maxwell
microarchitecture and used alongside Maxwell in the
GeForce 700 series
and
GeForce 800M series
.
The architecture is named after
Johannes Kepler
, a German mathematician and key figure in the 17th century
scientific revolution
.
Overview
[
edit
]
The goal of Nvidia's previous architecture was design focused on increasing performance on compute and tessellation. With the Kepler architecture, Nvidia targeted their focus on efficiency, programmability, and performance.
[2]
[3]
The efficiency aim was achieved through the use of a unified GPU clock, simplified static scheduling of instruction and higher emphasis on performance per watt.
[4]
By abandoning the shader clock found in their previous GPU designs, efficiency is increased, even though it requires additional cores to achieve higher levels of performance. This is not only because the cores are more power-friendly (two Kepler cores using 90% power of one Fermi core, according to Nvidia's numbers), but also the change to a unified GPU clock scheme delivers a 50% reduction in power consumption in that area.
[5]
Programmability aim was achieved with Kepler's Hyper-Q, Dynamic Parallelism and multiple new Compute Capabilities 3.x functionality. With it, higher GPU utilization and simplified code management was achievable with GK GPUs thus enabling more flexibility in programming for Kepler GPUs.
[6]
Finally with the performance aim, additional execution resources (more CUDA cores, registers and cache) and with Kepler's ability to achieve a memory clock speed of 7 GHz, increases Kepler's performance when compared to previous Nvidia GPUs.
[5]
[7]
Features
[
edit
]
The GK Series GPU contains features from both the older Fermi and newer Kepler generations. Kepler based members add the following standard features:
- PCI Express 3.0
interface
- DisplayPort
1.2
- HDMI
1.4a 4K x 2K video output
- PureVideo VP5
hardware video acceleration (up to 4K x 2K H.264 decode)
- Hardware
H.265
decoding
[8]
- Hardware
H.264
encoding acceleration block (NVENC)
- Support for up to 4 independent 2D displays, or 3 stereoscopic/3D displays (NV Surround)
- Next Generation Streaming Multiprocessor (SMX)
- Polymorph-Engine 2.0
- Simplified Instruction Scheduler
- Bindless Textures
- CUDA
Compute Capability 3.0 to 3.5
- GPU Boost (Upgraded to 2.0 on GK110)
- TXAA Support
- Manufactured by
TSMC
on a 28 nm process
- New Shuffle Instructions
- Dynamic Parallelism
- Hyper-Q (Hyper-Q's MPI functionality reserve for Tesla only)
- Grid Management Unit
- Nvidia GPUDirect (GPU Direct's RDMA functionality reserve for Tesla only)
Next Generation Streaming Multiprocessor (SMX)
[
edit
]
Kepler employs a new streaming multiprocessor architecture called SMX. CUDA execution core counts were increased from 32 per each of 16 SMs to 192 per each of 8 SMX; the register file was only doubled per SMX to 65,536 x 32-bit for an overall lower ratio; between this and other compromises, despite the 3x overall increase in CUDA cores and clock increase (on the 680 vs. the Fermi 580), the actual performance gains in most operations were well under 3x. Dedicated FP64 CUDA cores are used rather than treating two FP32 cores as a single unit as was done previously, and very few were included on the consumer models resulting in 1/24th speed FP64 calculation compared to FP32.
[9]
On the HPC models, the GK110/210, the SMX count was raised to 13-15 depending on the product, and more FP64 cores were included to bring the compute ratio up to 1/3rd FP32. On the GK110, per-thread register limit was quadrupled over fermi to 255, but this still only allows a thread using half of the registers to parallelize to 1/4 of each SMX. The GK210 (released at the same time) increased the register limit to 512 to improve performance in high register pressure situations like this. Texture cache, which programmers had already been using for compute as a read-only buffer in previous generations, was increased in size and the data path optimized for faster throughput when using this method. All levels of memory including the register file are single-bit ECC as well.
Another notable feature is that while Fermi GPUs could only be accessed by one CPU thread at a time, the HPC Kepler GPUs added multithreading support so high core count processors could open 32 connections and more easily saturate the compute capability.
[10]
Simplified Instruction Scheduler
[
edit
]
Additional die space reduction and power saving was achieved by removing a complex hardware block that handled the prevention of data hazards.
[3]
[5]
[11]
[12]
GPU Boost
[
edit
]
GPU Boost is a new feature which is roughly analogous to turbo boosting of a CPU. The GPU is always guaranteed to run at a minimum clock speed, referred to as the "base clock". This clock speed is set to the level which will ensure that the GPU stays within
TDP
specifications, even at maximum loads.
[3]
When loads are lower, however, there is room for the clock speed to be increased without exceeding the TDP. In these scenarios, GPU Boost will gradually increase the clock speed in steps, until the GPU reaches a predefined power target of 170W by default (on the 680 card).
[5]
By taking this approach, the GPU will ramp its clock up or down dynamically, so that it is providing the maximum amount of speed possible while remaining within TDP specifications.
The power target, as well as the size of the clock increase steps that the GPU will take, are both adjustable via third-party utilities and provide a means of overclocking Kepler-based cards.
[3]
Microsoft Direct3D Support
[
edit
]
Nvidia Fermi and Kepler GPUs in the GeForce 600 series support the Direct3D 11.0 specification. Nvidia originally stated that the Kepler architecture has full
DirectX
11.1 support, which includes the Direct3D 11.1 path.
[13]
The following "Modern UI" Direct3D 11.1 features, however, are not supported:
[14]
[15]
- Target-Independent Rasterization (2D rendering only).
- 16xMSAA Rasterization (2D rendering only).
- Orthogonal Line Rendering Mode.
- UAV (Unordered Access View) in non-pixel-shader stages.
According to the definition by Microsoft,
Direct3D feature level
11_1 must be complete, otherwise the Direct3D 11.1 path can not be executed.
[16]
The integrated Direct3D features of the Kepler architecture are the same as those of the GeForce 400 series Fermi architecture.
[15]
Next Microsoft Direct3D Support
[
edit
]
Nvidia Kepler GPUs of the GeForce 600/700 series support Direct3D 12 feature level 11_0.
[17]
TXAA Support
[
edit
]
Exclusive to Kepler GPUs, TXAA is a new anti-aliasing method from Nvidia that is designed for direct implementation into game engines. TXAA is based on the
MSAA
technique and custom resolve filters. It is designed to address a key problem in games known as shimmering or
temporal aliasing
. TXAA resolves that by smoothing out the scene in motion, making sure that any in-game scene is being cleared of any aliasing and shimmering.
[3]
Shuffle Instructions
[
edit
]
The GK110 had a small number of instructions added to further improve performance. New shuffle instructions allow for threads within a warp to share data amongst themselves with an instruction that completes the normal store and load operations that previously required two accesses to local memory within one instruction, making the process around 6% faster than using local data storage. Atomic operations were also improved, with 9x increases in speed for some instructions and the addition of more atomic 64-bit operations, namely min, max, and, or, and xor.
[11]
Hyper-Q
[
edit
]
Hyper-Q expands GK110 hardware work queues from 1 to 32. The significance of this being that having a single work queue meant that Fermi could be under occupied at times as there wasn't enough work in that queue to fill every SM. By having 32 work queues, GK110 can in many scenarios, achieve higher utilization by being able to put different task streams on what would otherwise be an idle SMX. The simple nature of Hyper-Q is further reinforced by the fact that it's easily mapped to MPI, a common message passing interface frequently used in HPC. As legacy MPI-based algorithms that were originally designed for multi-CPU systems that became bottlenecked by false dependencies now have a solution. By increasing the number of MPI jobs, it's possible to utilize Hyper-Q on these algorithms to improve the efficiency all without changing the code itself.
[11]
Dynamic Parallelism
[
edit
]
Dynamic Parallelism ability is for kernels to be able to dispatch other kernels. With Fermi, only the CPU could dispatch a kernel, which incurs a certain amount of overhead by having to communicate back to the CPU. By giving kernels the ability to dispatch their own child kernels, GK110 can both save time by not having to go back to the CPU, and in the process free up the CPU to work on other tasks.
[11]
Grid Management Unit
[
edit
]
Enabling Dynamic Parallelism requires a new grid management and dispatch control system. The new Grid Management Unit (GMU) manages and prioritizes grids to be executed. The GMU can pause the dispatch of new grids and queue pending and suspended grids until they are ready to execute, providing the flexibility to enable powerful runtimes, such as Dynamic Parallelism. The CUDA Work Distributor in Kepler holds grids that are ready to dispatch, and is able to dispatch 32 active grids, which is double the capacity of the Fermi CWD. The Kepler CWD communicates with the GMU via a bidirectional link that allows the GMU to pause the dispatch of new grids and to hold pending and suspended grids until needed. The GMU also has a direct connection to the Kepler SMX units to permit grids that launch additional work on the GPU via Dynamic Parallelism to send the new work back to GMU to be prioritized and dispatched. If the kernel that dispatched the additional workload pauses, the GMU will hold it inactive until the dependent work has completed.
[12]
Nvidia GPUDirect
[
edit
]
Nvidia GPUDirect is a capability that enables GPUs within a single computer, or GPUs in different servers located across a network, to directly exchange data without needing to go to CPU/system memory. The RDMA feature in GPUDirect allows third party devices such as SSDs, NICs, and IB adapters to directly access memory on multiple GPUs within the same system, significantly decreasing the latency of MPI send and receive messages to/from GPU memory.
[18]
It also reduces demands on system memory bandwidth and frees the GPU DMA engines for use by other CUDA tasks. The Kepler GK110 die also supports other GPUDirect features including Peer?to?Peer and GPUDirect for Video.
Video decompression/compression
[
edit
]
NVDEC
[
edit
]
NVENC
[
edit
]
NVENC is Nvidia's power efficient fixed-function encode that is able to take codecs, decode, preprocess, and encode H.264-based content. NVENC specification input formats are limited to H.264 output. But still, NVENC, through its limited format, can support up to 4096x4096 encode.
[19]
Like Intel's QuickSync, NVENC is currently exposed through a proprietary API, though Nvidia does have plans to provide NVENC usage through CUDA.
[19]
Performance
[
edit
]
The theoretical single-precision processing power of a Kepler GPU in
GFLOPS
is computed as 2 (operations per FMA instruction per CUDA core per cycle) × number of CUDA cores × core clock speed (in GHz). Note that like the previous generation
Fermi
, Kepler is not able to benefit from increased processing power by dual-issuing MAD+MUL like
Tesla
was capable of.
The theoretical double-precision processing power of a Kepler GK110/210 GPU is 1/3 of its single precision performance. This double-precision processing power is however only available on professional
Quadro
,
Tesla
, and high-end Titan-branded
GeForce
cards, while drivers for consumer GeForce cards limit the performance to 1/24 of the single precision performance.
[20]
The lower performance GK10x dies are similarly capped to 1/24 of the single precision performance.
[21]
Kepler dies
[
edit
]
Kepler
|
GK104
|
GK106
|
GK107
|
GK110
|
Variant(s)
|
GK104-200-A2
GK104-300-A2
GK104-325-A2
GK104-400-A2
GK104-425-A2
GK104-850-A2
|
GK106-240-A1
GK107-400-A1
|
GK107-300-A2
GK107-301-A2
GK107-320-A2
GK107-400-A2
GK107-425-A2
GK107-450-A2
GK107-810-A2
|
GK110-300-A1
GK110-400-A1
GK110-425-B1
GK110-885-A1
|
Release date
|
Apr 3, 2012
|
Sep 6, 2012
|
Sep 6, 2012
|
Nov 12, 2012
|
Cores
|
CUDA Cores
|
1536
|
960
|
384
|
2880
|
TMUs
|
128
|
80
|
32
|
240
|
ROPs
|
32
|
24
|
16
|
48
|
Streaming Multiprocessors
|
8
|
5
|
2
|
15
|
GPCs
|
4
|
3
|
1
|
5
|
Cache
|
L1
|
128
KB
|
80
KB
|
32
KB
|
240
KB
|
L2
|
512
KB
|
512
KB
|
256
KB
|
1.5
MB
|
Memory interface
|
256-bit
|
192-bit
|
192-bit
|
384-bit
|
Die size
|
294
mm
2
|
221
mm
2
|
118
mm
2
|
561
mm
2
|
Transistor count
|
3.54
bn.
|
2.54
bn.
|
1.27
bn.
|
7.08
bn.
|
Transistor density
|
12.0
MTr/mm
2
|
11.5
MTr/mm
2
|
10.8
MTr/mm
2
|
12.6
MTr/mm
2
|
Package socket
|
BGA
1745
|
BGA
1425
|
BGA
908
|
BGA
2152
|
Products
|
Consumer
|
Desktop
|
GTX 660
GTX 660 Ti
GTX 670
GTX 680
GTX 690
GTX 760
GTX 760 Ti
GTX 770
|
GTX 650
GTX 650 Ti
GTX 660
GTX 750 Ti
|
GT 630
GTX 650
GT 720
GT 730
GT 740
GT 1030
|
GTX 780
GTX Titan
|
Mobile
|
GTX 670MX
GTX 675MX
GTX 680M
GTX 680MX
GTX 775M
GTX 780M
GTX 860M
GTX 870M
GTX 880M
|
GTX 765M
GTX 770M
|
GT 640M
GTX 640M LE
GT 645M
GT 650M
GTX 660M
GT 740M
GT 745M
GT 750M
GT 755M
GTX 810M
GTX 820M
|
?
|
Workstation
|
Desktop
|
Quadro K4200
Quadro K5000
|
Quadro K4000
Quadro K5000
|
Quadro K410
Quadro K420
Quadro K600
Quadro K2000
Quadro K2000D
|
Quadro K5200
Quadro K6000
|
Mobile
|
Quadro K3000M
Quadro K3100M
Quadro K4000M
Quadro K4100M
Quadro K5000M
Quadro K5100M
|
?
|
Quadro K100M
Quadro K200M
Quadro K500M
Quadro K1000M
Quadro K1100M
Quadro K2000M
|
?
|
Kepler 2.0
See also
[
edit
]
References
[
edit
]
- ^
Mujtaba, Hassan (18 February 2012).
"Nvidia Expected to launch Eight New 28nm Kepler GPU's in April 2012"
.
- ^
"Inside Kepler"
(PDF)
. Retrieved
2015-09-19
.
- ^
a
b
c
d
e
"Introducing The GeForce GTX 680 GPU"
.
Nvidia
. March 22, 2012
. Retrieved
2015-09-19
.
- ^
"Nvidia's Next Generation CUDA Compute Architecture: Kepler TM GK110"
(PDF)
.
Nvidia
.
- ^
a
b
c
d
Smith, Ryan (March 22, 2012).
"Nvidia GeForce GTX 680 Review: Retaking The Performance Crown"
.
AnandTech
. Retrieved
November 25,
2012
.
- ^
"Efficiency Through Hyper-Q, Dynamic Parallelism, & More"
.
Nvidia
. November 12, 2012
. Retrieved
2015-09-19
.
- ^
"GeForce GTX 770 | Specifications | GeForce"
.
Nvidia
. Retrieved
2022-06-07
.
- ^
https://bluesky-soft.com/en/dxvac/deviceInfo/decoder/nvidia.html
- ^
"GeForce 680 (Kepler) Whitepaper"
(PDF)
.
Nvidia
. Retrieved
March 22,
2024
.
- ^
"Nvidia Kepler GK210/110 Architecture White Paper"
(PDF)
.
Nvidia
. Retrieved
22 March
2024
.
- ^
a
b
c
d
Smith, Ryan (November 12, 2012).
"Nvidia Launches Tesla K20 & K20X: GK110 Arrives At Last"
.
AnandTech
. Retrieved
September 19,
2015
.
- ^
a
b
"Nvidia Kepler GK110 Architecture Whitepaper"
(PDF)
.
Nvidia
. Retrieved
2015-09-19
.
- ^
"Nvidia Launches First GeForce GPUs Based on Next-Generation Kepler Architecture"
.
Nvidia
. March 22, 2012. Archived from
the original
on June 14, 2013.
- ^
Edward, James (November 22, 2012).
"Nvidia claims partially support DirectX 11.1"
.
TechNews
. Archived from
the original
on June 28, 2015
. Retrieved
2015-09-19
.
- ^
a
b
"Nvidia Doesn't Fully Support DirectX 11.1 with Kepler GPUs, But… (Web Archive Link)"
. BSN. Archived from
the original
on December 29, 2012.
- ^
"D3D_FEATURE_LEVEL enumeration (Windows)"
. MSDN
. Retrieved
2015-09-19
.
- ^
Moreton, Henry (March 20, 2014).
"DirectX 12: A Major Stride for Gaming"
.
Nvidia
. Retrieved
2015-09-19
.
- ^
"Nvidia GPUDirect"
.
Nvidia Developer
. October 6, 2015
. Retrieved
February 5,
2019
.
- ^
a
b
Angelini, Chris (March 22, 2012).
"Benchmark Results: NVEnc And MediaEspresso 6.5"
.
Tom’s Hardware
. Retrieved
September 19,
2015
.
- ^
Angelini, Chris (November 7, 2013).
"Nvidia GeForce GTX 780 Ti Review: GK110, Fully Unlocked"
.
Tom's Hardware
. p. 1
. Retrieved
December 6,
2015
.
The card's driver deliberately operates GK110's FP64 units at 1/8 of the GPU's clock rate. When you multiply that by the 3:1 ratio of single- to double-precision CUDA cores, you get a 1/24 rate
- ^
Smith, Ryan (13 September 2012).
"The Nvidia GeForce GTX 660 Review: GK106 Fills Out The Kepler Family"
.
AnandTech
. p. 1
. Retrieved
6 December
2015
.
|
---|
|
Software and technologies
|
---|
Multimedia acceleration
| |
---|
Software
| |
---|
Technologies
| |
---|
GPU microarchitectures
| |
---|
|
|
|
|