GPGPU, since its emerge in 2007 by nVidia, has been continuously changing the paradigm of software/hardware computing. It has already been proven to be a very competitive candidate in supercomputing, compared to conventional CPU processors. Why? Because it has got excellent cost/performance ratio. To invest your precious $1K into a GPGPU card will likely have the performance of a $3k-4k high-end CPU-only machine. Good bargain.
So parallelism is the way to go. The more parallel, the better. AMD and nVidia are having a rocket performance race. The world is changing. But before we praise the new hardware god, there is a big problem -- softwares are still having a hard time shifting to parallel. Algorithms, optimizations we are now using are mostly crafted and customized for single threaded environment. To utilize this new parallel computing environment, people need to either: 1. develop new algorithms for new machines, or 2. port the old algorithms over.
Okay, enough chic chat about the revolution of parallelism. Today let's take a look at one of the interesting paper that deals with the second case. Oh, btw, it is one of the 2 winners of best paper award of CGO 2013:
Bin Ren, Gagan Agrawal, Jim Larus, Todd Mytkowicz, Tomi Poutanen (Microsoft), and Wolfram Schulte for “SIMD Parallelization of Applications that Traverse Irregular Data Structures”
www.cse.ohio-state.edu/~ren/papers/CGO13_BinRen.pdf
To those who are not familiar with it: vector instructions are like issuing a bunch of a same scalar instruction in parallel (a vector). Suppose we want to add array A to array B, and store the results into array C, usually we write:
for (int i = 0; i < n; ++i)
c[i] = a[i] + b[i];
without optimization, each element addition will roughly be translated into:
__asm {
MOV EAX, [header_A + i * 4];
ADD EAX, [header_B + i * 4];
MOV [header_C + i *4], EAX;
}
A vectorizing compiler will first unroll it into:
for (int i = 0; i < n; i+=4)
{
c[i+0] = a[i+0] + b[i+0];
c[i+1] = a[i+1] + b[i+1];
c[i+2] = a[i+2] + b[i+2];
c[i+3] = a[i+3] + b[i+3];
}
this will eliminate 3/4 of the branches in the loop. And then transform it into (or usually idiom recognition will do it in one step):
for (int i = 0; i < n/4; ++i)
__asm{
MOVUPS XMM0, [header_A + i * 16]; //load 128 bit
MOVUPS XMM1, [header_B + i * 16]; //load 128 bit
ADDPS XMM0, XMM1; //add 4 elements
MOVUPS [header_C + i * 16], XMM0; //store 128 bit
}
Those instructions are already been exploited extensively in low-level such as compiler's loop vectorization (gcc supports a whole bunch of them. see: reference * ), video compression, graphic rendering, signal processing etc. Unlike conventional scalar instructions, vector instructions only work with dedicated vector registers (XMM0, XMM1 in the above example), and it requires the vector operands to be contiguous (MOVUPS loads 128 bit of data at a time). In order to exploit SIMD instructions, programmers have to make the data layout of an application to be contiguous.
But what about data structure with pointers? Usually, an abstract data structure, such as linked list, does not have data contiguity, and is not eligible to have SIMD acceleration. But this paper tries to tackle this problem -- to vectorize the execution of operations on abstract data structure.
A vector operation, not only requires data conformity, but also requires no control flow branches. Consider this code:
for (int = 0; i < n; ++i)
{
if (a[i] == b[i]) c[i] = a[i] - b[i];
else c[i] = a[i] + 0;
}
A branch in the loop body tells compiler the answer is NO (but, heh, actually we can still do it manually). What would the code do if a[0] > b[0], while a[1] < b[1] ? It is because a vector instruction is an atomic operation, and within a single thread there is no way to execute two paths in one instruction. Too bad, there is no way we can generate SIMD instructions for this loop.
Let's see how nVidia's CUDA deals with this branching problem. The programming paradigm of nVidia's GPGPU, SIMT (single instruction multiple thread) is a slightly different version of SIMD, as it is scalable. Here is an example of how the branching array addition example works: for n = array's length, the GPU issues n concurrent threads, and each thread will be given an individual but unique id ranging from 0 to n-1. With that id, thread id will be able to identify the address of an individual a[id], b[id] and c[id], and each thread is responsible for operation on a single element, the n threads accomplish the task collectively.
Looks amazing, but the GPGPU hardware is still a SIMD machine. Given a bunch of threads, when encountering a branch, the group execution is serialized: first each thread determines the result of a[i] < b[i], then, those threads with evaluated result true will execution their path, while the rest of the threads idle; then vice versa. GPGPU is designed like this for performance purposes -- SIMD architecture means less transistors, more parallelism.
So in short, SIMD hates control flow dependency. The solution is stated in the paper: turn control dependencies into data dependencies. Looking back at the last example, we can see that the difficulty is that the computing the program does are different in the two branch paths: one is f(c[i]) = a[i]; and the other is g(c[i]) = a[i] + b[i]; If we want to vectorize it we have to unify its interface, such as:
h(c[i]) = a[i] + ( b[i] * (a[i] > b[i] ? -1 : 0) )
And we can make the branching part: (a[i] > b[i] ? 0 : 1) evaluated into a vector compare(_mm_cmpeq_epi32), Voila, thus the branch goes away, and we can continue to generate a SIMD version of it!
Of course I made up this silly example myself, but it at least tells you the idea behind it: unify the interface to eliminate branches. In this paper the authors gave 2 applications, random forest (traversing binary trees) and NFA traversing, both involve decision making (jump to left or right child, evaluate non-leaf node or leaf-node, automata state change). I found it amusing that the NFA SIMD is like a flattened GPGPU implementation, but still, the idea is brilliant. The rest of the paper is not very hard to follow, I would recommend those who are interested in codegen to take a look.
* Auto-vectorization in GCC: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
分享到:
相关推荐
在这个领域中,有三种主要的处理模型:SISD、SIMD和MIMD。 1. **SISD (Single Instruction Stream, Single Data Stream)**:SISD系统中,只有一个处理器核心,且这个核心在同一时间只能执行一条指令,并且这条指令...
### SISD、MIMD、SIMD、MISD 计算机的体系结构 #### 计算平台介绍 1972年,Flynn教授提出了著名的计算平台分类法,这一分类法主要依据指令流(Instruction Stream)与数据流(Data Stream)的不同组合来划分不同的...
* SISD/SIMD 算法:SISD/SIMD 算法是指计算机系统中的SISD/SIMD 算法。包括SISD 算法、SIMD 算法等。 第九章 多功能部件 * 多功能部件:多功能部件是指计算机系统中的多功能部件。包括SISD 算法、SIMD 算法、MIMD ...
1. SISD/SIMD算法:SISD/SIMD算法是指基于单指令流/多数据流(Single Instruction, Multiple Data)或单指令流/单数据流(Single Instruction, Single Data)的并行处理算法。 第九章 1. SISD/多功能部件/SIMD/...
7. SIMD与SISD、MIMD:SIMD是单指令流多数据流,适合并行处理相同操作;SISD是单指令流单数据流,如传统的CPU;MIMD是多指令流多数据流,用于多核或多处理器系统。 8. 奇偶校验:用于检测数据传输错误,偶校验确保...
关于SIMD和SISD:Single Instruction Multiple Data,单指令多数据流。反之SISD是单指令单数据。以加法指令为例,单指令单数据(SISD)的CPU对加法指令译码后,执行部件先访问内存,取得第一个操作数;之后再一次...
* 并行计算机的分类:SISD、SIMD、MISD、MIMD * 并行计算机的互连方式:静态互连(LA、LC、MC、TC、MT、HC、BC、SE)和动态互连(Bus、Crossbar Switcher、MIN) * 并行计算机的特点:并行处理、高速计算、可扩展性 ...
- **混合SISD/SIMD架构**:为了克服SIMD架构的一些限制,现代多处理器系统通常采用混合架构,即同时具备SISD(单一指令单一数据流)和SIMD功能。这种架构能够在处理并行任务的同时,也能够高效处理非并行的任务。 #...
2. 单指令流多数据流(SIMD)系统:SIMD计算机有一组处理器和多个存储器,所有处理器同时执行同一指令,但处理不同的数据。这种设计适用于并行处理大量相似任务,如图像处理和科学计算。 3. 多指令流单数据流(MISD...
- Flynn分类:SISD, SIMD, MISD, MIMD,依据指令流和数据流的组合。 - 冯氏分类:根据最大并行度Pm和字宽W,位宽B将计算机结构分为四种。 - Handler分类:基于并行度和流水线处理程度分为三个层次。 - Kuck分类...
* Flynn 分类:MISD、SISD、SIMD、MIMD 的定义和应用 * 流水线技术:取值、分析、执行的并行展开,流水线周期、吞吐量、加速比、效率的概念和应用 四、存储系统 * 存储系统结构:寄存器、缓存、主存、辅存的组成和...
8. **SISD、SIMD、SPMD和MIMD**:SISD(Single Instruction Single Data)是指单指令流单数据流,SIMD(Single Instruction Multiple Data)是单指令流多数据流,SPMD(Single Program Multiple Data)是单程序多...
- Flynn分类:SISD、SIMD、MISD、MIMD。 6. 操作系统内存管理: - 分页存储:每页是物理存储单位,不利于信息共享。 - 分段存储:每个段是逻辑单位,支持信息共享,无内碎片。 7. 软件质量与维护: - 易理解性...
- SISD、SIMD、MIMD:单指令流单数据、单指令流多数据、多指令流多数据的并行处理模型。 - 多核处理器架构:并行处理单元的设计,包括共享内存和分布式内存模型。 6. 输入/输出(I/O)系统 - I/O中断与DMA(直接...
* SISD:单指令单数据流 * SIMD:单指令多数据流 * MISD:多指令单数据流 * MIMD:多指令多数据流 阵列解决机(Array Processor): * 并行解决机(Parallel Processor) * 通过反复设立大量相同的解决单元 PE...