- 浏览: 161613 次
- 性别:
- 来自: 保山腾冲
文章分类
最新评论
-
TNTDoctor:
谢谢,标记一下,有空来看看
(转)Haskell与范畴论 -
t173625478:
文章一般般,不够透彻,以至于误解了很多函数式特性的由来和作用。 ...
(转)函数式编程另类指南 -
liyiwen007:
学习了!
很受益!
用AGG实现高质量图形输出(二) -
hqs7636:
感谢!!!!!!!
《learn to tango with D》译文 -
rocex:
多谢,刚好用到。
《learn to tango with D》译文
Parallel Programming with Hints(转)
Wouldn’t it be nice to be able to write sequential programs and let the compiler or the runtime automatically find opportunities for parallel execution? Something like this is already being done on a micro scale inside processors. As much as possible, they try to execute individual instructions in parallel. Of course they have to figure out data dependencies and occasionally stall the pipeline or idle while waiting for a memory fetch. More sophisticated processors are even able to speculate–they guess a value that hasn’t been calculated or fetched yet and run speculative execution. When the value is finally available, they compare it with their guess and either discard or commit the execution.
If processors can do it, why can’t a language runtime do the same on a larger scale? It would solve the problem of effectively using all those cores that keep multiplying like rabbits on a chip.
The truth is, we haven’t figured out yet how to do it. Automatic parallelization is, in general, too complex. But if the programmer is willing to provide just enough hints to the compiler, the runtime might figure things out. Such programming model is called semi-implicit parallelism and has been implemented in two very different environments, in Haskell and in .NET. The two relevant papers I’m going to discuss are Runtime Support for Multicore Haskell and The Design of a Task Parallel Library.
In both cases the idea is to tell the compiler that certain calculations may be done in parallel. It doesn’t necessarily mean that the code will be executed in multiple threads–the runtime makes this decision depending on the number of cores and their availability. The important thing is that, other than providing those hints, the programmer doesn’t have to deal with threads or, at least in Haskell, with synchronization. I will start with Haskell but, if you’re not into functional programming, you may skip to .NET and the Task Parallel Library (and hopefully come back to Haskell later).
In Multicore Haskell
In Haskell, you hint at parallel execution using the par
combinator, which you insert between two expressions, like this: e1 `par` e2
. The runtime then creates a spark for the left hand side expression (here, e1
). A spark is a deferred calculation that may be executed in parallel (in the .NET implementation a spark is called a task). Notice that, in Haskell, which is a lazy programming language, all calculations are, by design, deferred until their results are needed; at which point their evaluation is forced. The same mechanism kicks in when the result of a spark is needed–and it hasn’t been calculated in parallel yet. In such a case the spark is immediately evaluated in the current thread (thus forfeiting the chance for parallel execution). The hope is though that enough sparks will be ready before their results are needed, leading to an overall speedup.
To further control when the sparks are evaluated (whether in parallel or not), Haskell provides another combinator, pseq
, which enforces sequencing. You insert it between two expressions, e1 `pseq` e2
, to make sure that the left hand side, e1
, is evaluated before the evaluation of e2
is started.
I’ll show you how to parallelize the standard map
function (in C++ it would be called std::transform
). Map
applies a function, passed to it as the first argument, to each element of a list, which is passed to it as the second argument. As a Haskell refresher, let me walk you through the implementation of map
.
map f [] = [] map f (x:xs) = y:ys where y = f x ys = map f xs
Map
is implemented recursively, so its definition is split into the base case and the recursive case. The base case just states that map
applied to an empty list, []
, returns an empty list (it ignores the first argument, f
).
If, on the other hand, the list is non-empty, it can be split into its head and tail. This is done through pattern matching–the pattern being (x:xs)
, where x
matches the head element and xs
the (possibly empty) tail of the list.
In that case, map
is defined to return a new list, (y:ys)
whose head is y
and tail is ys
. The where
clause defines those two: y
is the result of the application of the function f
to x
, and ys
is the result of the recursive application of map
to the tail of the list, xs
.
The parallel version does the same (it is semantically equivalent to the sequential version), but it gives the runtime the opportunity to perform function applications in paralle. It also waits for the evaluation to finish.
parMap f [] = [] parMap f (x:xs) = y `par` (ys `pseq` y:ys) where y = f x ys = parMap f xs
The important changes are: y
, the new head, may be evaluated in parallel with the tail (the use of the par
combinator). The result, y:ys
, is returned only when the tail part, ys
, has been evaluated (the use of the pseq
combinator).
The tail calculation is also split into parallel computations through recursive calls to parMap
. The net result is that all applications of f
to elements of the list are potentially done in parallel. Because of the use of pseq
, all the elements (except for the very first one) are guaranteed to have been evaluated before parMap
returns.
It’s instructive to walk through the execution of parMap
step-by-step. For simplicity, let’s perform parMap
on a two-element list, [a, b]
.
First we pattern-match this list to x = a
and xs = [b]
. We create the first spark for the evaluation of (y = f a)
and then proceed with the evaluation of the right hand side of par
, (ys `pseq` y:ys)
. Here ys = parMap f [b]
.
Because of the `pseq`
, we must evaluate ys
next. To do that, we call (parMap f [b])
. Now the list [b]
is split into the head, b
, and the empty tail, []
. We create a spark to evaluate y' = f b
and proceed with the right-hand side, (ys' `pseq` y':ys')
.
Again, the `pseq`
waits for the evaluation of ys' = parMap f []
. But this one is easy: we apply the base definition of parMap
, which returns an empty list.
Now we are ready to retrace our steps. The right hand side of the last `pseq`
re-forms the list y':[]
. But that’s the ys
the previous `pseq`
was waiting for. It can now proceed, producing y:(y':[])
, which is the same as [y, y']
or [f a, f b]
, which is what we were expecting.
Notice complete absence of explicit synchronization in this code. This is due to the functional nature of Haskell. There’s no shared mutable state so no locking or atomic operations are needed. (More explicit concurrent models are also available in Haskell, using MVar
s or transactional memory.).
Task Parallel Library in .NET
It’s no coincidence that many ideas from Haskell end up in Microsoft languages. Many Haskell programmers work for Microsoft Research, including the ultimate guru, Simon Peyton Jones. The Microsoft Task Parallel Library (TPL) translates the ideas from Multicore Haskell to .NET. One of its authors, Daan Leijen, is a Haskell programmer who, at some point, collaborated with Simon Peyton Jones. Of course, a .NET language like C# presents a different set of obstacles to parallel programming. It operates on mutable state which needs protection from concurrent access. This protection (which, incidentally, is the hardest part of multithreaded programming) is left to the programmer.
Here’s the example of an algorithm in C# with hidden opportunities for parallel implementation. MatrixMult
multiplies two matrices. It iterates over columns and rows of the result matrix. The value that goes at their intersection is calculated by the innermost loop.
void MatrixMult(int size, double[,] m1,double[,] m2, double[,] result) { for(int i = 0; i < size; i++){ // calculate the i'th column for(int j = 0; j < size; j++){ result[i, j] = 0; for(int k = 0; k < size; k++){ result[i, j] += m1[i, k] * m2[k, j]; } } } }
Each column of the result could potentially be evaluated in parallel. The problem is, the size of the array and the number of processor cores might be unknown until the program is run. Creating a large number of threads when there are only a few cores may lead to a considerable slowdown, which is the opposite of what we want. So the goal of TPL is to let the programmer express the potential for parallel execution but leave it to the runtime to create an optimal number of threads.
The programmer splits the calculation into tasks (the equivalent of Haskell sparks) by making appropriate library calls; and the runtime maps those tasks into OS threads–many-to-one, if necessary.
Here’s how the same function looks with parallelization hooks.
void ParMatrixMult(int size, double[,] m1,double[,] m2, double[,] result) { Parallel.For(0, size, delegate(int i) { for(int j = 0; j < size; j++){ result[i, j] = 0; for(int k = 0; k < size; k++){ result[i, j] += m1[i, k] * m2[k, j]; } } }); }
Because of clever formatting, this version looks very similar to the original. The outer loop is replaced by the call to Parallel.For
, which is one of the parallelizing TPL functions. The inner loops are packed into a delegate.
This delegate is assigned to a task (the analog of Haskell spark) that is potentially run in a separate thread. Here the delegate is actually a closure–it captures local variables, size
, m1
, m2
and result
. The latter is actually modified inside the delegate. This is how shared mutable state sneaks into potentially multi-threaded execution. Luckily, in this case, such sharing doesn’t cause races. Consider however what would happen if we changed the types of the matrices from double[,]
to char[,]
. Parallel updates to neighboring byte-sized array elements may inadvertently encroach on each other and lead to incorrect results. Programmer beware! (This is not a problem in Haskell because of the absence of mutable state.)
But even if the programmer is aware of potential sharing and protects shared variables with locks, it’s not the end of the story. Consider this example:
int sum = 0; Parallel.For(0, 10000, delegate(int i) { if(isPrime(i)){ lock(this) { sum += i; } } });
The captured variable, sum
is protected by the lock, so data races are avoided. This lock, however, becomes a performance bottleneck–it is taken for every prime number in the range.
Now consider the fact that, on a 4-core machine, we’ll be running 10000 tasks distributed between about 4 threads. It would be much more efficient to accumulate the sum in four local variables–no locking necessary–and add them together only at the end of the calculation. This recipe can be expressed abstractly as a map/reduce type of algorithm (a generalization of the C++ std::accumulate
). The tasks are mapped into separate threads, which work in parallel, and the results are then reduced into the final answer.
Here’s how map/reduce is expressed in TPL:
int sum = Parallel.Aggregate( 0, 10000, // domain 0, // initial value delegate(int i){ return (isPrime(i) ? i : 0) }, delegate(int x, int y){ return x+y; } );
The first delegate, which is run by 10000 tasks, does not modify any shared state–it just returns its result, which is internally accumulated in some hidden local variable. The second delegate–the “reduce” part of the algorithm–is called when there’s a need to combine results from two different tasks.
The Place for Functional Programming
Notice that the last example was written in very functional style. In particular you don’t see any mutable state. The delegates are pure functions. This is no coincidence: functional programming has many advantages in parallel programming.
I’ve been doing a lot of multi-threaded programming in C++ lately and I noticed how my style is gradually shifting from object-oriented towards functional. This process is accellerating as functional features keep seeping into the C++ standard. Obviously, lambdas are very useful, but so is move semantics that’s been made possible by rvalue references, especially in passing data between threads. It’s becoming more and more obvious that, in order to be a good C++ programmer, one needs to study other languages. I recommend Haskell and Scala in particular. I’ll be blogging about them in the future.
Bibliography
- Simon Marlow, Simon Peyton Jones, and Satnam Singh, Runtime Support for Multicore Haskell
- Daan Leijen, Wolfram Schulte, and Sebastian Burckhardt, The Design of a Task Parallel Library
- A video introduction to Haskell by Simon Peyton Jones: Part I, Part II, together with the slides.
发表评论
-
haskell趣学指南(链接)
2010-05-16 16:25 888http://fleurer-lee.com/lyah/ 一 ... -
Linguaggio D versione 2(转)
2010-03-31 06:42 915Linguaggio D versione 2 转自http ... -
The Case for D —硬币的另一面
2009-06-20 11:35 1277The Case for D —硬币的另一面 作者:leon ... -
Ownership hierarchy using owner::this
2009-06-19 12:29 1097Ownership hierarchy using owner ... -
The Case for D - the other side of the coin
2009-06-19 12:27 1025The Case for D - the other side ... -
如何手写语法分析器
2009-04-01 15:51 1420如何手写语法分析器 http://www.cppblog.c ... -
学习一种新的编程语言所要做的15个练习
2009-01-29 23:30 853学习一种新的编程语言 ...
相关推荐
Starting with the basics of parallel programming, you will proceed to learn about how to build parallel algorithms and their implementation. You will then gain the expertise to evaluate problem ...
《Professional Parallel Programming with C#》是一本专注于C#编程中的并行处理技术的专业书籍。这本书深入探讨了如何利用C#语言特性以及.NET框架提供的工具来实现高效、可靠的并行编程。以下将详细介绍其中的一些...
PachecoPS_Parallel programming with MPI_Morgan Kaufmann Publishers (1997)
Delphi并行编程相关的经典库OmniThreadLibrary,本书为完整版,2018年2月出版。书中详细介绍了OmniThreadLibrary的用法,并根据不同场景介绍的相关的示例,是一本难得的好书。本人重新整理了书签,推荐下载。
在“Scalable Parallel Programming with CUDA中文版”中,作者深入探讨了如何有效地利用CUDA技术实现大规模并行计算,帮助开发者充分利用GPU的计算能力。 **1. 并行计算基础** 并行计算是指同时执行多个计算任务...
Parallel-Programming-with-Microsoft-Visual-Studio-2010-Step-by-Step, epub版本, 资源来自于互联网
Optimize code for multi-core processors with ...Parallel Programming with Intel Parallel Studio dispels any concerns of difficulty and gets you started creating faster code with Intel Parallel Studio.
在讨论给定文件信息中提及的知识点之前,需要指出,文件内容中存在OCR技术的识别错误,例如“(cid:127)”这样的符号,以及一些可能的漏识别文字,比如"Parallel Programming with OpenACC"被重复提及。然而,为了...
I had been researching parallel programming, multiprocessor, and multicore since 1997, so I couldn’t help installing the fi rst CTP and trying it. It was obvious that it was going to be an exciting ...
instrumented and connected in real-time with the Internet of Things, from our own biology to the world's environment. By some measures, it is projected that by 2020, world data will have grown by more...
7Parallel Programming - MPI
"Parallel programming with Intel Parallel Studio XE code" 指的是使用Intel Parallel Studio XE进行并行编程的相关代码示例。Intel Parallel Studio XE是一款由Intel公司提供的集成开发环境,专为优化并行计算而...
Parallel Programming: Concepts and Practice By 作者: Bertil Schmidt Ph.D. – Jorge Gonzalez-Dominguez Ph.D. – Christian Hundt – Moritz Schlarb ISBN-10 书号: 0128498900 ISBN-13 书号: 9780128498903 ...
Parallel programming with Intel Parallel Studio XE [Blair-Chappell, S. and A. Stokes][Wiley][2012].epub
c# 并行编程书籍,包括了: <Pro dotNET.4.Parallel.Programming.in.C#> <Patterns_of_Parallel_Programming_CSharp> <Parallel.Guide.20100708> <ParallelProgramsinNET4_CodingGuidelines>
Professional Parallel Programming with C# 英文无水印pdf pdf所有页面使用FoxitReader和PDF-XChangeViewer测试都可以打开 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权...
### 并行编程在C语言中的MPI与OpenMP应用 #### 并行编程基础概念 在探讨并行编程以及如何利用MPI(Message Passing Interface)与OpenMP这两种技术进行并行编程之前,我们首先需要理解一些基本的概念。...