`

Parallel Programming with Hints(转)

阅读更多

Parallel Programming with Hints(转)

 

Wouldn’t it be nice to be able to write sequential programs and let the compiler or the runtime automatically find opportunities for parallel execution? Something like this is already being done on a micro scale inside processors. As much as possible, they try to execute individual instructions in parallel. Of course they have to figure out data dependencies and occasionally stall the pipeline or idle while waiting for a memory fetch. More sophisticated processors are even able to speculate–they guess a value that hasn’t been calculated or fetched yet and run speculative execution. When the value is finally available, they compare it with their guess and either discard or commit the execution.

If processors can do it, why can’t a language runtime do the same on a larger scale? It would solve the problem of effectively using all those cores that keep multiplying like rabbits on a chip.

The truth is, we haven’t figured out yet how to do it. Automatic parallelization is, in general, too complex. But if the programmer is willing to provide just enough hints to the compiler, the runtime might figure things out. Such programming model is called semi-implicit parallelism and has been implemented in two very different environments, in Haskell and in .NET. The two relevant papers I’m going to discuss are Runtime Support for Multicore Haskell and The Design of a Task Parallel Library.

In both cases the idea is to tell the compiler that certain calculations may be done in parallel. It doesn’t necessarily mean that the code will be executed in multiple threads–the runtime makes this decision depending on the number of cores and their availability. The important thing is that, other than providing those hints, the programmer doesn’t have to deal with threads or, at least in Haskell, with synchronization. I will start with Haskell but, if you’re not into functional programming, you may skip to .NET and the Task Parallel Library (and hopefully come back to Haskell later).

In Multicore Haskell

In Haskell, you hint at parallel execution using the par combinator, which you insert between two expressions, like this: e1 `par` e2. The runtime then creates a spark for the left hand side expression (here, e1). A spark is a deferred calculation that may be executed in parallel (in the .NET implementation a spark is called a task). Notice that, in Haskell, which is a lazy programming language, all calculations are, by design, deferred until their results are needed; at which point their evaluation is forced. The same mechanism kicks in when the result of a spark is needed–and it hasn’t been calculated in parallel yet. In such a case the spark is immediately evaluated in the current thread (thus forfeiting the chance for parallel execution). The hope is though that enough sparks will be ready before their results are needed, leading to an overall speedup.

To further control when the sparks are evaluated (whether in parallel or not), Haskell provides another combinator, pseq, which enforces sequencing. You insert it between two expressions, e1 `pseq` e2, to make sure that the left hand side, e1, is evaluated before the evaluation of e2 is started.

I’ll show you how to parallelize the standard map function (in C++ it would be called std::transform). Map applies a function, passed to it as the first argument, to each element of a list, which is passed to it as the second argument. As a Haskell refresher, let me walk you through the implementation of map.

map f []     = []
map f (x:xs) = y:ys
    where y  = f x
          ys = map f xs

Map is implemented recursively, so its definition is split into the base case and the recursive case. The base case just states that map applied to an empty list, [], returns an empty list (it ignores the first argument, f).

If, on the other hand, the list is non-empty, it can be split into its head and tail. This is done through pattern matching–the pattern being (x:xs), where x matches the head element and xs the (possibly empty) tail of the list.

In that case, map is defined to return a new list, (y:ys) whose head is y and tail is ys. The where clause defines those two: y is the result of the application of the function f to x, and ys is the result of the recursive application of map to the tail of the list, xs.

The parallel version does the same (it is semantically equivalent to the sequential version), but it gives the runtime the opportunity to perform function applications in paralle. It also waits for the evaluation to finish.

parMap f []     = []
parMap f (x:xs) = y `par` (ys `pseq` y:ys)
    where y  = f x
          ys = parMap f xs

The important changes are: y, the new head, may be evaluated in parallel with the tail (the use of the par combinator). The result, y:ys, is returned only when the tail part, ys, has been evaluated (the use of the pseq combinator).

The tail calculation is also split into parallel computations through recursive calls to parMap. The net result is that all applications of f to elements of the list are potentially done in parallel. Because of the use of pseq, all the elements (except for the very first one) are guaranteed to have been evaluated before parMap returns.

It’s instructive to walk through the execution of parMap step-by-step. For simplicity, let’s perform parMap on a two-element list, [a, b].

First we pattern-match this list to x = a and xs = [b]. We create the first spark for the evaluation of (y = f a) and then proceed with the evaluation of the right hand side of par, (ys `pseq` y:ys). Here ys = parMap f [b].

Because of the `pseq`, we must evaluate ys next. To do that, we call (parMap f [b]). Now the list [b] is split into the head, b, and the empty tail, []. We create a spark to evaluate y' = f b and proceed with the right-hand side, (ys' `pseq` y':ys').

Again, the `pseq` waits for the evaluation of ys' = parMap f []. But this one is easy: we apply the base definition of parMap, which returns an empty list.

Now we are ready to retrace our steps. The right hand side of the last `pseq` re-forms the list y':[]. But that’s the ys the previous `pseq` was waiting for. It can now proceed, producing y:(y':[]), which is the same as [y, y'] or [f a, f b], which is what we were expecting.

Notice complete absence of explicit synchronization in this code. This is due to the functional nature of Haskell. There’s no shared mutable state so no locking or atomic operations are needed. (More explicit concurrent models are also available in Haskell, using MVars or transactional memory.).

Task Parallel Library in .NET

It’s no coincidence that many ideas from Haskell end up in Microsoft languages. Many Haskell programmers work for Microsoft Research, including the ultimate guru, Simon Peyton Jones. The Microsoft Task Parallel Library (TPL) translates the ideas from Multicore Haskell to .NET. One of its authors, Daan Leijen, is a Haskell programmer who, at some point, collaborated with Simon Peyton Jones. Of course, a .NET language like C# presents a different set of obstacles to parallel programming. It operates on mutable state which needs protection from concurrent access. This protection (which, incidentally, is the hardest part of multithreaded programming) is left to the programmer.

Here’s the example of an algorithm in C# with hidden opportunities for parallel implementation. MatrixMult multiplies two matrices. It iterates over columns and rows of the result matrix. The value that goes at their intersection is calculated by the innermost loop.

void MatrixMult(int size, double[,] m1,double[,] m2, double[,] result)
{
   for(int i = 0; i < size; i++){
      // calculate the i'th column
      for(int j = 0; j < size; j++){
         result[i, j] = 0;
         for(int k = 0; k < size; k++){
              result[i, j] += m1[i, k] * m2[k, j];
         }
      }
   }
}

Each column of the result could potentially be evaluated in parallel. The problem is, the size of the array and the number of processor cores might be unknown until the program is run. Creating a large number of threads when there are only a few cores may lead to a considerable slowdown, which is the opposite of what we want. So the goal of TPL is to let the programmer express the potential for parallel execution but leave it to the runtime to create an optimal number of threads.

The programmer splits the calculation into tasks (the equivalent of Haskell sparks) by making appropriate library calls; and the runtime maps those tasks into OS threads–many-to-one, if necessary.

Here’s how the same function looks with parallelization hooks.

void ParMatrixMult(int size, double[,] m1,double[,] m2, double[,] result)
{
   Parallel.For(0, size, delegate(int i)
   {
      for(int j = 0; j < size; j++){
         result[i, j] = 0;
         for(int k = 0; k < size; k++){
              result[i, j] += m1[i, k] * m2[k, j];
         }
      }
   });
}

Because of clever formatting, this version looks very similar to the original. The outer loop is replaced by the call to Parallel.For, which is one of the parallelizing TPL functions. The inner loops are packed into a delegate.

This delegate is assigned to a task (the analog of Haskell spark) that is potentially run in a separate thread. Here the delegate is actually a closure–it captures local variables, size, m1, m2 and result. The latter is actually modified inside the delegate. This is how shared mutable state sneaks into potentially multi-threaded execution. Luckily, in this case, such sharing doesn’t cause races. Consider however what would happen if we changed the types of the matrices from double[,] to char[,]. Parallel updates to neighboring byte-sized array elements may inadvertently encroach on each other and lead to incorrect results. Programmer beware! (This is not a problem in Haskell because of the absence of mutable state.)

But even if the programmer is aware of potential sharing and protects shared variables with locks, it’s not the end of the story. Consider this example:

int sum = 0;
Parallel.For(0, 10000, delegate(int i)
{
   if(isPrime(i)){
      lock(this) { sum += i; }
   }
});

The captured variable, sum is protected by the lock, so data races are avoided. This lock, however, becomes a performance bottleneck–it is taken for every prime number in the range.

Now consider the fact that, on a 4-core machine, we’ll be running 10000 tasks distributed between about 4 threads. It would be much more efficient to accumulate the sum in four local variables–no locking necessary–and add them together only at the end of the calculation. This recipe can be expressed abstractly as a map/reduce type of algorithm (a generalization of the C++ std::accumulate). The tasks are mapped into separate threads, which work in parallel, and the results are then reduced into the final answer.

Here’s how map/reduce is expressed in TPL:

int sum = Parallel.Aggregate(
  0, 10000, // domain
  0, // initial value
  delegate(int i){ return (isPrime(i) ? i : 0) },
  delegate(int x, int y){ return x+y; }
);

The first delegate, which is run by 10000 tasks, does not modify any shared state–it just returns its result, which is internally accumulated in some hidden local variable. The second delegate–the “reduce” part of the algorithm–is called when there’s a need to combine results from two different tasks.

The Place for Functional Programming

Notice that the last example was written in very functional style. In particular you don’t see any mutable state. The delegates are pure functions. This is no coincidence: functional programming has many advantages in parallel programming.

I’ve been doing a lot of multi-threaded programming in C++ lately and I noticed how my style is gradually shifting from object-oriented towards functional. This process is accellerating as functional features keep seeping into the C++ standard. Obviously, lambdas are very useful, but so is move semantics that’s been made possible by rvalue references, especially in passing data between threads. It’s becoming more and more obvious that, in order to be a good C++ programmer, one needs to study other languages. I recommend Haskell and Scala in particular. I’ll be blogging about them in the future.

Bibliography

  1. Simon Marlow, Simon Peyton Jones, and Satnam Singh, Runtime Support for Multicore Haskell
  2. Daan Leijen, Wolfram Schulte, and Sebastian Burckhardt, The Design of a Task Parallel Library
  3. A video introduction to Haskell by Simon Peyton Jones: Part I, Part II, together with the slides.
分享到:
评论

相关推荐

    Doing.More.with.Java.Android.and.Tomcat.B016JAE0XM

    Unlike many programming books, this one helps you gain all the skills to create mobile connected apps. Using HTTP from Android with JSON, Hibernate, and MySQL a complete JSON Web Service can be ...

    CUDA11.0-C-Programming-Guide.pdf

    CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA. It enables software developers and software engineers ...

    Itanium Architecture For Programmers

    Preface Acknowledgments Trademarks Chapter 1. Architecture and Implementation Section 1.1.... Answers and Hints for Selected Exercises Chapter 1 Chapter 2 Chapter 3 Chapter 4 ...

    python3.6.5参考手册 chm

    PEP 484 - Type Hints PEP 471 - os.scandir() function – a better and faster directory iterator PEP 475: Retry system calls failing with EINTR PEP 479: Change StopIteration handling inside ...

    《永磁无刷直流电机控制系统与软件综合研究-集成电机计算软件、电机控制器及电磁设计软件的创新设计与实践》,永磁无刷直流电机计算与控制软件:高效电机控制器与电磁设计工具,永磁无刷直流电机计算软件,电机控

    《永磁无刷直流电机控制系统与软件综合研究——集成电机计算软件、电机控制器及电磁设计软件的创新设计与实践》,永磁无刷直流电机计算与控制软件:高效电机控制器与电磁设计工具,永磁无刷直流电机计算软件,电机控制器,无刷电机设计软件,电机电磁设计软件 ,永磁无刷直流电机计算软件; 电机控制器; 无刷电机设计软件; 电机电磁设计软件,无刷电机设计专家:永磁无刷直流电机计算与控制器设计软件

    新能源汽车VCU开发模型及策略详解:从控制策略到软件设计全面解析,新能源汽车VCU开发模型及策略详解:从控制策略到软件设计全面解析,新能源汽车VCU开发模型及控制策略,MBD电控开发 新能源汽车大势所

    新能源汽车VCU开发模型及策略详解:从控制策略到软件设计全面解析,新能源汽车VCU开发模型及策略详解:从控制策略到软件设计全面解析,新能源汽车VCU开发模型及控制策略,MBD电控开发 新能源汽车大势所向,紧缺VCU电控开发工程师,特别是涉及新能源三电系统,工资仅仅低于无人驾驶、智能驾驶岗位。 ——含控制策略模型 整车控制策略详细文档 通讯协议文档 接口定义 软件设计说明文档 等(超详细,看懂VCU电控策略开发就通了) 内容如下: 新能源汽车整车控制器VCU学习模型,适用于初学者。 1、模型包含高压上下电,行驶模式管理,能量回馈,充电模式管理,附件管理,远程控制,诊断辅助功能。 2、软件说明书(控制策略说明书) 3、模型有部分中文注释 对想着手或刚开始学习整车控制器自动代码生成或刚接触整车控制器有很大帮助。 ,新能源汽车VCU开发模型; 控制策略; MBD电控开发; 模型学习; 代码生成; 整车控制器; 能量回馈; 诊断辅助功能,新能源汽车电控开发详解:VCU控制策略模型及学习手册

    Python读取Excel文件的方法详解及应用场景

    内容概要:本文详细介绍了两种利用 Python 读取 Excel 文件的不同方法,分别是基于 pandas 和 openpyxl。对于想要利用Python 处理 Excel 数据的读者来说,文中不仅提供了简洁明了的具体代码片段以及执行效果展示,还针对每个库的应用特性进行了深度解析。此外,文档提到了一些进阶应用技巧如只读特定的工作薄、过滤某些列等,同时强调了需要注意的地方(像是路径设置、engine 参数调整之类),让读者可以在面对实际项目需求时做出更加明智的选择和技术选型。 适合人群:对 Python 有基本掌握并希望提升数据读取能力的开发人员。 使用场景及目标:适用于任何涉及到批量数据导入或是与 Excel 进行交互的业务流程。无论是做初步的数据探索还是深入挖掘隐藏于电子表格背后的故事,亦或是仅为了简化日常办公自动化任务都可以从中受益。最终目标帮助使用者熟悉两大主流 Excel 解决方案的技术特性和最佳实践。 阅读建议:本文既是一份详尽的学习指南也是一份方便随时查阅的手册。因此初学者应当认真研究所提供的示例,而有一定经验者也可以快速定位到感兴趣的部分查看关键要点。

    毕设springboot基于springboot的医护人员排班系统.zip

    # 医护人员排班系统 ## 1. 项目介绍 本系统是一个基于SpringBoot框架开发的医护人员排班管理系统,用于医院管理医护人员的排班、调班等工作。系统提供了完整的排班管理功能,包括科室管理、人员管理、排班规则配置、自动排班等功能。 ## 2. 系统功能模块 ### 2.1 基础信息管理 - 科室信息管理:维护医院各科室基本信息 - 医护人员管理:管理医生、护士等医护人员信息 - 排班类型管理:配置不同的排班类型(如:早班、中班、晚班等) ### 2.2 排班管理 - 排班规则配置:设置各科室排班规则 - 自动排班:根据规则自动生成排班计划 - 排班调整:手动调整排班计划 - 排班查询:查看各科室排班情况 ### 2.3 系统管理 - 用户管理:管理系统用户 - 角色权限:配置不同角色的操作权限 - 系统设置:管理系统基础配置 ## 3. 技术架构 ### 3.1 开发环境 - JDK 1.8 - Maven 3.6 - MySQL 5.7 - SpringBoot 2.2.2 ### 3.2 技术栈 - 后端框架:SpringBoot - 持久层:MyBatis-Plus - 数据库:MySQL - 前端框架:Vue.js - 权限管理:Spring Security ## 4. 数据库设计 主要数据表: - 科室信息表(keshixinxi) - 医护人员表(yihurengyuan) - 排班类型表(paibanleixing) - 排班信息表(paibanxinxi) - 用户表(user) ## 5. 部署说明 ### 5.1 环境要求 - JDK 1.8+ - MySQL 5.7+ - Maven 3.6+ ### 5.2 部署步骤 1. 创建数据库并导入SQL脚本 2. 修改application.yml中的数据库配置 3. 执行maven打包命令:mvn clean package 4. 运行jar包:java -jar xxx.jar ## 6. 使用说明 ### 6.1 系统登录 - 管理员账号:admin - 初始密码:admin ### 6.2 基本操作流程 1. 维护基础信息(科室、人员等) 2. 配置排班规则 3. 生成排班计划 4. 查看和调整排班 ## 7. 注意事项 1. 首次使用请及时修改管理员密码 2. 定期备份数据库 3. 建议定期检查和优化排班规则

    MATLAB仿真的夫琅禾费衍射强度图:圆孔、圆环、矩形孔定制研究,MATLAB仿真:夫琅禾费衍射强度图的可定制性-以圆孔、圆环及矩形孔为例的研究分析,MATLAB夫琅禾费衍射强度图仿真 圆孔,圆环

    MATLAB仿真的夫琅禾费衍射强度图:圆孔、圆环、矩形孔定制研究,MATLAB仿真:夫琅禾费衍射强度图的可定制性——以圆孔、圆环及矩形孔为例的研究分析,MATLAB夫琅禾费衍射强度图仿真 圆孔,圆环,矩形孔可定制。 ,MATLAB; 夫琅禾费衍射; 强度图仿真; 圆孔; 圆环; 矩形孔; 可定制。,MATLAB仿真夫琅禾费衍射强度图:定制孔型(圆孔/圆环/矩形)

    商道融绿ESG评级20241231.xlsx

    详细介绍及样例数据:https://blog.csdn.net/samLi0620/article/details/145652300

    基于Dugoff轮胎模型与B08-01基础建模的七自由度车辆动力学模型验证:利用MATLAB 2018及以上版本与CarSim 2020.0软件的仿真对比研究,基于Dugoff轮胎模型与B08-01框

    基于Dugoff轮胎模型与B08_01基础建模的七自由度车辆动力学模型验证:利用MATLAB 2018及以上版本与CarSim 2020.0软件的仿真对比研究,基于Dugoff轮胎模型与B08_01框架的七自由度车辆动力学模型验证——使用MATLAB 2018及以上版本与CarSim 2020.0软件进行仿真对比研究,七自由度车辆动力学模型验证(Dugoff轮胎模型,B08_01基础上建模) 1.软件: MATLAB 2018以上;CarSim 2020.0 2.介绍: 基于Dugoff轮胎模型和车身动力学公式,搭建7DOF车辆动力学Simulink模型,对相关变量(质心侧偏角,横摆角速度,纵、横向速度及加速度)进行CarSim对比验证。 ,核心关键词:七自由度车辆动力学模型验证; Dugoff轮胎模型; B08_01建模基础; MATLAB 2018以上; CarSim 2020.0; Simulink模型; 变量对比验证。,基于Dugoff轮胎模型的七自由度车辆动力学模型验证与CarSim对比

    【毕业设计】基于Java+servlet+jsp+css+js+mysql实现“转赚”二手交易平台_pgj.zip

    【毕业设计】基于Java+servlet+jsp+css+js+mysql实现“转赚”二手交易平台_pgj

    恋爱聊妹术V2小程序源码4.1.0多开版.zip

    微猫恋爱聊妹术小程序源码介绍: 微猫恋爱聊妹术小程序源码是一款全新升级的聊天工具,它采用全新主题和UI,完美支持分享朋友圈功能。同时,它的独立后台也进行了大规模更新,让操作更加简单。其中,课堂页面、搜索页面和子话术列表页面等,均增加了流量主展示,具有超多的功能。 安装教程: 您可以先加入微猫恋爱聊妹术小程序源码的赞助群,然后在群内找到魔方安装说明。根据源码编号找到相应的安装说明,非常详细,让您轻松完成安装。

    电气安装工程安全技术规程-蒋凯,杨华甫,马仲范,王清禄译;孙照森校;鞍钢工程技术编委会编.pdf

    电气安装工程安全技术规程_蒋凯,杨华甫,马仲范,王清禄译;孙照森校;鞍钢工程技术编委会编

    基于Copula函数的风光空间相关性联合场景生成与K-means聚类削减MATLAB研究,基于Copula函数的风光空间相关性联合场景生成与K-means聚类削减算法研究,基于copula的风光联合场

    基于Copula函数的风光空间相关性联合场景生成与K-means聚类削减MATLAB研究,基于Copula函数的风光空间相关性联合场景生成与K-means聚类削减算法研究,基于copula的风光联合场景生成?K-means聚类并削减 MATLAB 由于目前大多数研究的是不计风光出力之间的相关性影响,但是地理位置相近的风电机组和光伏机组具有极大的相关性。 因此,采用 Copula 函数作为风电、光伏联合概率分布,生成风、光考虑空间相关性联合出力场景,在此基础上,基于Kmeans算法,分别对风光场景进行聚类,从而实现大规模场景的削减,削减到5个场景,最后得出每个场景的概率与每个对应场景相乘求和得到不确定性出力 ,基于Copula的风光联合场景生成; K-means聚类削减; 空间相关性; 概率分布; 场景削减,基于Copula与K-means的风光联合场景生成与削减研究

    模块化多电平变流器MMC的VSG控制技术研究:基于MATLAB-Simulink的仿真分析与定制实现-支持三相与任意电平数,构网型模块化多电平变流器MMC的VSG控制策略与仿真模型:三相负荷变动下的

    模块化多电平变流器MMC的VSG控制技术研究:基于MATLAB-Simulink的仿真分析与定制实现——支持三相与任意电平数,构网型模块化多电平变流器MMC的VSG控制策略与仿真模型:三相负荷变动下的虚拟同步发电机控制研究,构网型 模块化多电平变流器 MMC 的VSG控制 同步发电机控制 MATLAB–Simulink仿真模型,可按需求定制 10电平.14电平,任意电平可做。 三相MMC,采用VSG控制。 设置负荷变动,调整有功无功,保持电网电压和频率 ,构网型模块化多电平变流器; MMC的VSG控制; 虚拟同步发电机控制; MATLAB–Simulink仿真模型; 任意电平可做; 三相MMC; 负荷变动; 有功无功调整; 电网电压和频率保持。,基于VSG控制的模块化多电平变流器(MMC)的构网型仿真模型

    暗通道算法DCP-Python实现

    暗通道算法DCP-Python实现

    南师大实验室安全准入知识供学习

    南师大实验室安全准入知识供学习

    纯openMV寻迹小车.zip

    纯openMV寻迹小车.zip

    【毕业设计】基于Java mvc架构开发的完整购物网站.zip

    【毕业设计】基于Java mvc架构开发的完整购物网站

Global site tag (gtag.js) - Google Analytics