vector general purpose register,VGPR:矢量通用寄存器
LDS atomics are performed in the LDS hardware. (Thus, although ALUs are not directly used for these operations, latency is incurred by the LDS executing this function.) If the algorithm does not require write-to-read reuse (the data is read only), it usually is better to use the image dataflow (see right side of Figure 5.5) because of the cache hierarchy.
LDS 原子操作在 LDS 硬件中执行。因此,虽然 ALU 不直接用于这些操作,但 LDS 执行此功能会产生延迟。如果算法不需要 write-to-read 重用 (数据是只读的),由于缓存层次结构,通常最好使用图像数据流 (参见 Figure 5.5 的右侧)。
Actually, buffer reads may use L1 and L2. When caching is not used for a buffer, reads from that buffer bypass L2. After a buffer read, the line is invalidated; then, on the next read, it is read again (from the same wavefront or from a different clause). After a buffer write, the changed parts of the cache line are written to memory.
实际上,buffer 读取可能会使用 L1 和 L2。当缓存不用于 buffer 时,从该 buffer 读取绕过 L2。buffer 读取后,该行无效;然后,在下一次读取时,再次读取 (从相同的 wavefront 或不同的 clause)。在 buffer 写入之后,高速缓存行的更改部分被写入内存。
Buffers and images are written through the texture L2 cache, but this is flushed immediately after an image write.
buffer and image 通过 texture L2 缓存写入,但在 image 写入后立即刷新。
In GCN devices, both reads and writes happen through L1 and L2.
在 GCN 设备中,读取和写入都通过 L1 和 L2 发生。
The data in private memory is first placed in registers. If more private memory is used than can be placed in registers, or dynamic indexing is used on private arrays, the overflow data is placed (spilled) into scratch memory. Scratch memory is a private subset of global memory, so performance can be dramatically degraded if spilling occurs.
私有内存中的数据首先放在寄存器中。如果使用的私有内存超过了寄存器的容量,或者在私有数组上使用了动态索引,则溢出数据将被放置 (溢出) 到暂存内存中。暂存内存是全局内存的私有子集,因此如果发生溢出,性能会显着降低。
Global memory can be in the high-speed GPU memory (VRAM) or in the host memory, which is accessed by the PCIe bus. A work-item can access global memory either as a buffer or a memory object. Buffer objects are generally read and written directly by the work-items. Data is accessed through the L2 and L1 data caches on the GPU. This limited form of caching provides read coalescing among work-items in a wavefront. Similarly, writes are executed through the texture L2 cache.
全局内存可以位于高速 GPU 内存 (VRAM) 中,也可以位于主机内存中,由 PCIe 总线访问。工作项可以作为缓冲区或内存对象访问全局内存。buffer 对象通常由工作项直接读取和写入。通过 GPU 上的 L2 和 L1 数据缓存访问数据。这种有限形式的缓存提供了一个 wavefront 中的工作项之间的读取合并。类似地,写入是通过纹理 L2 缓存执行的。
Global atomic operations are executed through the texture L2 cache. Atomic instructions that return a value to the kernel are handled similarly to fetch instructions: the kernel must use S_WAITCNT
to ensure the results have been written to the destination GPR before using the data.
全局原子操作通过纹理 L2 缓存执行。返回一个值到内核的原子指令的处理方式与获取指令类似:内核必须使用 S_WAITCNT
来确保在使用数据之前将结果写入目标 GPR。