`

Cray Reminiscences

阅读更多
Kirk Pepperdine's attendence of AMD's performance talk at JavaOne produced a cascade of fascinating memories about Cray optimizations. Here, Kirk relates some of the most interesting optimizations that helped make Cray's superfast - and how that relates to your Java programs.
Published July 2007, Author Kirk Pepperdine

 

Traditionally JavaONE has offered more performance related talks than any other Java conference. This year was no exception, with so many performance related talks it was impossible to attend all of them. One of the more interesting performance related sessions was put on by AMD's Azeem Jiva. The timeless theme of the talk was: make sure your programs are good to your hardware and your hardware will be good to you. I say timeless because as I watched Azeem stroll through the demos, my mind was deluged with memories of my days programming on Cray supercomptuers.

The Cray series of super-computers were an engineering marvel when they were in their prime. The brilliance in the machine architecture wasn't only about speed. The scalar processors were not much faster than what would be found on any other server. Cray's brilliance was about the balance within the machine. As they saw it, there is no point in having a superfast CPU if it was only going to be starved for work. So, much of the extreme engineering that took place was in making sure that the CPUs were never hung on wait conditions.

One of my long time recomendations for Windows users to eliminate virtual memory from their machines (don't do this unless you've got plenty of real RAM) is based on the lack of virtual memory on Cray systems. In a time when memory was both in short supply and expensive, Cray recognized that getting data from a disk created huge wait conditions. So they eliminated virtual memory. To help with the I/O they introduced the use of solid state memory devices and multiple separate channels to move data from one place to another.

Most of these optimizations were performed under the hood and, aside from a few rule of thumbs such as don't do I/O and process in the same loop, one's coding style had little effect on performance. That said, there were other optimizations that could be obliterated if the developer ignored or didn't understand how the underlying hardware was architected and functioned. Out of the many optimizations that a developers coding style had the direct ability to affect, I'd like to mention three. These were: instruction buffer faults; striding through memory; and the ability to utilize the vector processors.

Though rare at the time, some form of the technologies found in Cray's vector processors are now commonplace in modern day processors. For example, pipelining intermediate results through various stages of computation so that the processor can work on multiple pieces of data at the same time is quite common. Things like path prediction are much more advanced now then they were at when I programmed Crays. Back in the late 80s, early 90s, it was fairly easy (and it still is) to obfuscate what you may want to do next. In the worst case, Cray would run your code in scalar mode instead of being able to utilize the much faster and more effecient vector processors. The most common way to obfuscate was to put branch statements in a for loop (vector processors worked best with large for loops). In order to get code to vectorize, one would often separate the data based on the condition in the branch prior to entering the processing loop. Each dataset would then be run through it's own separate loop with the branch removed.

Cray's instruction buffer was big enough to hold 40 instructions. The system would load the next 40 instructions to be executed and when they were exhausted it would load the next 40. It did have the ability to do a predictive pre-fetch but in general, fetching the next set of instructions would most likely be a hold condition (CPU goes hungry). This is yet another case where a developer's coding style could have adverse effects on performance. Of course code that randomly jumped to instructions not in the buffer would have the biggest impact on performance, but there were more subtle conditions than that. Again loops become important. Loops that were larger than 40 instructions and those that spilled over an instruction buffer boundary would result in some (sometimes significant) performance degradation. The obvious solution for the former problem was to write very small tight loops even if that meant looping twice over the same dataset. Crays were very well tuned for doing this, so quite often several single passes worked much better than a single "do all" pass over the dataset.

In retrospect the latter problem should have been handled automatically by having the optimizer align loops on instruction buffer boundries. Cray's solution at the time was to introduce a pragma statement. The pragma told the compiler/linker to align the code following the statement on an instruction buffer boundry. The programmers role in all of this, other than recognizing where to put the pragma's, is to ensure that loops do not span more than 40 instructions. Done right, a couple of short loops will outperform a single loop that does everything.

The most interesting optimization was to support the feeding of the vector processor from main memory. The vector processor was capable of both accepting a single piece of data and returning a result all in the same clock tick. The electronic reality is that memory, once strobed to be read, requires some time before it can be read again. Cray was always careful to make sure that the bank cool off time was 4 clock cycles. They were also careful to arrange memory into 4 different banks and, rather than have contiguous memory in the same bank, adjacent memory locations will arranged in different memory banks. The consequence of this design is that one bank of memory will always be ready to be read.

The developers responsibility in this case is to ensure that any strides through memory hit the cold bank on every clock tick. To do this, you may have to adjust the data structures being used. Again coding style counts. So by now you may be asking, what does all of this have to do with Java? The answer is: more that one would think there to be.

Right now you may be wondering why on earth anyone would be interested in hardware level counters when they are looking at Java. After all, Java runs in an abstraction commonly known as the Java Virtual Machine which places some distance between our code and the hardware. Aside from taking care with our choice of algorithm, what could you possibly do aside from implementing some dangerous premature optimizations that would affect how our code utilizes the underlying hardware? Surpisingly there are some easy changes you can make to your coding style that should help you to better utilize your hardware. More surpisingly, these style optimizations have been with us for longer than Java has.

The style optimization pointed to by Azeem was in respect to striding through a doubly indexed array.

The example presented looked something like

public void transform( int[][] matrix) {
   int j = 0;
   int k = 0;
   for ( ; k <  matrix[ j].length; k++) {
       for ( ; j <  matrix.length; j++) {
           do stuff
       }
   }

 

According to the JLS, arrays are evaluated from left to right. So we can write int[3][] matrix and follow that up with matrix[0] = new int[ 3];. This implies that it is the right most index that will point to a single dimensional array whose elements will be held in a contiguous block of memory. So the above code "jumps" through memory creating a situation that thrashes the CPU's onboard cache. Of course the fix is to reverse the for loops so that the code is running through memory in a more predictable manner. Now this example is a toy so the problem is quite obvious. The question is: do you have some obfsucated code lurking in your application that is doing the same thing?

Another important feature the analyzer was able to detect was lock contention. Lock contention can have some pretty devastating effects on your application's ability to perform. Aside from starving threads from obtaining the CPU, lock contention puts pressure on the operating system. Even more interesting was watching Azeem using the Analyzer to point out how disruptive it was to the processor as well. What I got from this demo is that just as the Cray processors worked best when our code worked in a predictable manner, so too do our modern processors. And, there is nothing quite as disruptive to a processor as having to execute code to acquire a lock. This isn't to say that we shouldn't when we need to, but it does suggest that something that I've known to be true in the past is still true today even in Java, namely that your coding style can have positive effects on performance.

Azeem's talk at JavaOne was TS-9363, "Java Platform Performance on Multicore: Better Performance or Bigger Headache?" Related tools are Intel's VTune and AMD's Analyzer.

分享到:
评论

相关推荐

    The Super Man Cray Computer

    The Superman, The Story Of Seymour Cray And The Technical Wizards Behind The Supercomputer

    Python库 | Cray-0.0.7.tar.gz

    Cray-0.0.7.tar.gz 是一个针对Python的库资源,通过这个压缩包,我们可以获取到Cray库的0.0.7版本。 Cray库可能是一个专为Python设计的特定功能模块,例如数据处理、网络通信、机器学习、图形用户界面(GUI)或者...

    java cray 10

    java cray 02

    java cray 02

    java cray 02

    惠普HPE Cray XD670 GPU算力服务器规格书

    惠普HPE Cray XD670 GPU算力服务器规格书

    Python库 | Cray-0.0.9-py3.7.egg

    资源分类:Python库 所属语言:Python 资源全名:Cray-0.0.9-py3.7.egg 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059

    藏经阁-2017 Spark Summit East.pdf

    本篇资源摘要信息是基于2017 Spark Summit East的会议資料,主要介绍Cray公司的解决方案和成果。 一、Cray公司简介 Cray公司是一家具有40多年经验的高性能计算解决方案提供商,专门解决最复杂的问题。该公司通过...

    cray:Laravel软件包可以帮助您生成几乎完整的CRUD页面,如疯了!

    Cray是一个disposable Laravel软件包,可帮助您生成几乎完整的CRUD页面,如疯了一样。 字面上地。 这也有点自以为是。 如果您更直接地构建CRUD页面,那么手动编写所有相同的逻辑将变得很麻烦。 Cray不仅可以为您...

    Slingshot-The-Interconnect-for-the-Exascale-Era.pdf

    在超级计算领域,Cray公司作为高性能计算(HPC)的领军企业,一直致力于推动技术的边界,创新其计算机架构以适应不断发展的技术需求。本文档介绍了Cray最新一代的Shasta超级计算机,这款计算机代表了Cray超级计算机在...

    libmi:Cray Inc. libmi 的分支。 一个 gdbmi 接口库

    标题提及的是“libmi”,这是一个与Cray Inc.相关的库,它是一个gdbmi接口库的分支。Cray Inc.是一家知名的高性能计算和超级计算机制造商,而gdbmi(GDB Machine Interface)是GNU调试器GDB的一种命令行接口,用于在...

    LuckyBlock:他们的luckyblocks是cray-cray,只为lolz

    幸运方块 简单的幸运块插件和教程。 它是什么? LuckyBlocks 是一个 Minecraft Mod(修改),它允许用户有随机机会获得非常幸运的物品或可能会让你大吃一惊的东西,但谁在乎,对吧? 假设有 3 种类型的块,但您应该...

    cray2:老式的克雷2设计文件

    克雷2(Cray-2)是计算机历史上一款具有里程碑意义的超级计算机,由美国克雷研究公司在1980年代初研发成功。这款机器以其独特的液态冷却系统、高性能计算能力和创新的体系结构闻名于世。在当前,通过Verilog这一硬件...

    chearch:Chearch是一个使用Cray的Chapel语言编写的简单搜索引擎

    查Chearch 是一个用 Cray's Chapel 语言编写的简单搜索引擎。 此应用程序演示了如何使用 Chapel 的各种重要功能,例如语言环境,以及如何通过诸如 local 之类的功能来最小化 RPC 流量。 它还展示了如何仅使用整数...

    计算机系统结构第五章3

    - **CRAY系列**:自1972年CRAY-1交付使用以来,CRAY公司持续创新,推出了多款高性能向量流水处理机,如CRAY-1S、CRAY X-MP、CRAY-2S、CRAY Y-MP、CRAY Y-MPC-90等,这些机器在科学计算、工程模拟等领域发挥了巨大...

    Cray Containment Domains:遏制域是计算弹性的框架-开源

    可以将CD视为一种支持细粒度,层次结构,应用程序导向,不协调的检查点和还原的机制。 创建遏制域(CD)是为了允许应用程序面对各种类型的组件故障进行恢复,包括硬(例如,持久性)和软(例如,瞬态)硬件故障,...

    全球超级计算Top500和Green500榜单概述.docx

    最后,本资源摘要信息还对NERSC(国家能源研究科学计算中心)和Cray(克雷)之间的合作进行了介绍,NERSC一直与Cray合作,Cray为Cori(Cori是NERSC最新的大型计算系统)用户带来Burst Buffer技术。NERSC BurstBuffer...

    计算机系统结构课件:第四章作业.ppt

    * CRAY-1 机是高性能计算机系统,具有向量寄存器和标量寄存器。 * 在 CRAY-1 机上,流水线技术可以提高计算速度和吞吐率。 * 流水线技术在 CRAY-1 机上的应用可以提高计算效率和吞吐率。 本资源提供了计算机系统...

    Slingshot-开源

    《弹弓》游戏不仅是一款策略与模拟相结合的二维游戏,它更是一个展示开源精神和物理科学结合的平台。在《弹弓》的世界中,玩家被邀请进入一个充满未知和挑战的宇宙空间,需要利用各种行星的重力场来进行策略规划,...

    系统结构课后习题答案

    关于计算机系统结构课后习题答案。能对系统结构的习题有一个更好的认识

    mvc 分页源代码

    根据压缩包中的文件名"Cray.MvcPager",我们可以推断这是一个第三方的MVC分页组件。这类组件通常提供了方便的辅助方法,简化了在视图中创建分页链接的过程,同时也可能提供了一些额外的功能,如Ajax分页、自定义样式...

Global site tag (gtag.js) - Google Analytics