- 浏览: 85302 次
文章分类
最新评论
-
bailangfei3344:
自我介绍 -
regionwar:
你好,转化为人为:1、不该加锁的不要加锁:局部变量,单线程占用 ...
关于java锁机制的优化 -
danni505:
希望能交流:
msn:danni-505#hotmail.co ...
关于java锁机制的优化 -
ouspec:
收藏的东西不错。
TOP500 -
willpower:
The idea behind is e-sync IO do ...
Rethink the sync
Kirk Pepperdine's attendence of AMD's performance talk at JavaOne produced a cascade of fascinating memories about Cray optimizations. Here, Kirk relates some of the most interesting optimizations that helped make Cray's superfast - and how that relates to your Java programs.Published July 2007, Author Kirk Pepperdine
Traditionally JavaONE has offered more performance related talks than any other Java conference. This year was no exception, with so many performance related talks it was impossible to attend all of them. One of the more interesting performance related sessions was put on by AMD's Azeem Jiva. The timeless theme of the talk was: make sure your programs are good to your hardware and your hardware will be good to you. I say timeless because as I watched Azeem stroll through the demos, my mind was deluged with memories of my days programming on Cray supercomptuers.
The Cray series of super-computers were an engineering marvel when they were in their prime. The brilliance in the machine architecture wasn't only about speed. The scalar processors were not much faster than what would be found on any other server. Cray's brilliance was about the balance within the machine. As they saw it, there is no point in having a superfast CPU if it was only going to be starved for work. So, much of the extreme engineering that took place was in making sure that the CPUs were never hung on wait conditions.
One of my long time recomendations for Windows users to eliminate virtual memory from their machines (don't do this unless you've got plenty of real RAM) is based on the lack of virtual memory on Cray systems. In a time when memory was both in short supply and expensive, Cray recognized that getting data from a disk created huge wait conditions. So they eliminated virtual memory. To help with the I/O they introduced the use of solid state memory devices and multiple separate channels to move data from one place to another.
Most of these optimizations were performed under the hood and, aside from a few rule of thumbs such as don't do I/O and process in the same loop, one's coding style had little effect on performance. That said, there were other optimizations that could be obliterated if the developer ignored or didn't understand how the underlying hardware was architected and functioned. Out of the many optimizations that a developers coding style had the direct ability to affect, I'd like to mention three. These were: instruction buffer faults; striding through memory; and the ability to utilize the vector processors.
Though rare at the time, some form of the technologies found in Cray's vector processors are now commonplace in modern day processors. For example, pipelining intermediate results through various stages of computation so that the processor can work on multiple pieces of data at the same time is quite common. Things like path prediction are much more advanced now then they were at when I programmed Crays. Back in the late 80s, early 90s, it was fairly easy (and it still is) to obfuscate what you may want to do next. In the worst case, Cray would run your code in scalar mode instead of being able to utilize the much faster and more effecient vector processors. The most common way to obfuscate was to put branch statements in a for
loop (vector processors worked best with large for
loops). In order to get code to vectorize, one would often separate the data based on the condition in the branch prior to entering the processing loop. Each dataset would then be run through it's own separate loop with the branch removed.
Cray's instruction buffer was big enough to hold 40 instructions. The system would load the next 40 instructions to be executed and when they were exhausted it would load the next 40. It did have the ability to do a predictive pre-fetch but in general, fetching the next set of instructions would most likely be a hold condition (CPU goes hungry). This is yet another case where a developer's coding style could have adverse effects on performance. Of course code that randomly jumped to instructions not in the buffer would have the biggest impact on performance, but there were more subtle conditions than that. Again loops become important. Loops that were larger than 40 instructions and those that spilled over an instruction buffer boundary would result in some (sometimes significant) performance degradation. The obvious solution for the former problem was to write very small tight loops even if that meant looping twice over the same dataset. Crays were very well tuned for doing this, so quite often several single passes worked much better than a single "do all" pass over the dataset.
In retrospect the latter problem should have been handled automatically by having the optimizer align loops on instruction buffer boundries. Cray's solution at the time was to introduce a pragma statement. The pragma told the compiler/linker to align the code following the statement on an instruction buffer boundry. The programmers role in all of this, other than recognizing where to put the pragma's, is to ensure that loops do not span more than 40 instructions. Done right, a couple of short loops will outperform a single loop that does everything.
The most interesting optimization was to support the feeding of the vector processor from main memory. The vector processor was capable of both accepting a single piece of data and returning a result all in the same clock tick. The electronic reality is that memory, once strobed to be read, requires some time before it can be read again. Cray was always careful to make sure that the bank cool off time was 4 clock cycles. They were also careful to arrange memory into 4 different banks and, rather than have contiguous memory in the same bank, adjacent memory locations will arranged in different memory banks. The consequence of this design is that one bank of memory will always be ready to be read.
The developers responsibility in this case is to ensure that any strides through memory hit the cold bank on every clock tick. To do this, you may have to adjust the data structures being used. Again coding style counts. So by now you may be asking, what does all of this have to do with Java? The answer is: more that one would think there to be.
Right now you may be wondering why on earth anyone would be interested in hardware level counters when they are looking at Java. After all, Java runs in an abstraction commonly known as the Java Virtual Machine which places some distance between our code and the hardware. Aside from taking care with our choice of algorithm, what could you possibly do aside from implementing some dangerous premature optimizations that would affect how our code utilizes the underlying hardware? Surpisingly there are some easy changes you can make to your coding style that should help you to better utilize your hardware. More surpisingly, these style optimizations have been with us for longer than Java has.
The style optimization pointed to by Azeem was in respect to striding through a doubly indexed array.
The example presented looked something like
public void transform( int[][] matrix) { int j = 0; int k = 0; for ( ; k < matrix[ j].length; k++) { for ( ; j < matrix.length; j++) { do stuff } }
According to the JLS, arrays are evaluated from left to right. So we can write int[3][] matrix and follow that up with matrix[0] = new int[ 3];. This implies that it is the right most index that will point to a single dimensional array whose elements will be held in a contiguous block of memory. So the above code "jumps" through memory creating a situation that thrashes the CPU's onboard cache. Of course the fix is to reverse the for loops so that the code is running through memory in a more predictable manner. Now this example is a toy so the problem is quite obvious. The question is: do you have some obfsucated code lurking in your application that is doing the same thing?
Another important feature the analyzer was able to detect was lock contention. Lock contention can have some pretty devastating effects on your application's ability to perform. Aside from starving threads from obtaining the CPU, lock contention puts pressure on the operating system. Even more interesting was watching Azeem using the Analyzer to point out how disruptive it was to the processor as well. What I got from this demo is that just as the Cray processors worked best when our code worked in a predictable manner, so too do our modern processors. And, there is nothing quite as disruptive to a processor as having to execute code to acquire a lock. This isn't to say that we shouldn't when we need to, but it does suggest that something that I've known to be true in the past is still true today even in Java, namely that your coding style can have positive effects on performance.
Azeem's talk at JavaOne was TS-9363, "Java Platform Performance on Multicore: Better Performance or Bigger Headache?" Related tools are Intel's VTune and AMD's Analyzer.
发表评论
-
字符编码笔记:ASCII,Unicode和UTF-8 (引用)
2009-01-07 10:39 936字符编码笔记:ASCII,Unicode和UTF-8 阮一峰 ... -
How to set up a simple LRU cache using LinkedHash
2008-11-03 18:05 1283How to set up a simple LRU cach ... -
Scalability?
2008-10-07 14:07 833严格上讲,scalability还没有正式定义, 甚至有人觉得 ... -
lock-free
2007-06-18 22:06 9951. http://www.ibm.com/developer ... -
解决java.lang.OutOfMemoryError: PermGen space(转帖)
2007-06-05 18:07 3163解决方案就是:在启动服务器时加上指定PermGen区域的内存大 ... -
Performance...
2007-06-05 15:11 984« I used to work for... | Mai ... -
数据仓库
2007-04-18 10:38 1123... -
Expressions Transform
2007-04-13 11:13 1398Expressions, Conversion and Eva ... -
Java cleanup code
2007-04-03 12:20 1299Java shutdown hook guarantee th ... -
Java performance tunning
2007-04-03 11:37 942http://www.javaperformancetunin ... -
Running IE from command line
2007-04-03 10:58 1118Here's a simple way you can ru ... -
Unicode and UTF8
2007-04-03 10:27 912What is Unicode? Unicode provid ... -
Daemon Thread Notes
2007-04-03 09:16 26521. 只要程式中的non-Daemon thread都結束了. ... -
How to know the main class of a jar file?
2007-04-02 15:18 1031Easy. Here is an implementation ... -
The best chinese BAT tutorial(from www.boofee.net/bigfee/)
2007-03-27 11:58 1329如何创建批处理文件? 不要听了批处理文件就感到很神气 ... -
Basics - Binary search
2007-03-26 15:53 979java 代码 public class Bin ... -
MergeSort
2007-03-23 17:26 855MergeSort is a sample solutio ... -
Graph data structure
2007-03-23 12:04 8781. adjacent matrix good for bor ... -
Functional Programming For The Rest of Us
2007-03-23 10:39 1305I like connect beautiful artic ... -
Functional Programming For The Rest of Us
2007-03-23 10:24 1064I like connect beautiful artic ...
相关推荐
The Superman, The Story Of Seymour Cray And The Technical Wizards Behind The Supercomputer
Cray-0.0.7.tar.gz 是一个针对Python的库资源,通过这个压缩包,我们可以获取到Cray库的0.0.7版本。 Cray库可能是一个专为Python设计的特定功能模块,例如数据处理、网络通信、机器学习、图形用户界面(GUI)或者...
java cray 02
java cray 02
惠普HPE Cray XD670 GPU算力服务器规格书
资源分类:Python库 所属语言:Python 资源全名:Cray-0.0.9-py3.7.egg 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
本篇资源摘要信息是基于2017 Spark Summit East的会议資料,主要介绍Cray公司的解决方案和成果。 一、Cray公司简介 Cray公司是一家具有40多年经验的高性能计算解决方案提供商,专门解决最复杂的问题。该公司通过...
Cray是一个disposable Laravel软件包,可帮助您生成几乎完整的CRUD页面,如疯了一样。 字面上地。 这也有点自以为是。 如果您更直接地构建CRUD页面,那么手动编写所有相同的逻辑将变得很麻烦。 Cray不仅可以为您...
在超级计算领域,Cray公司作为高性能计算(HPC)的领军企业,一直致力于推动技术的边界,创新其计算机架构以适应不断发展的技术需求。本文档介绍了Cray最新一代的Shasta超级计算机,这款计算机代表了Cray超级计算机在...
标题提及的是“libmi”,这是一个与Cray Inc.相关的库,它是一个gdbmi接口库的分支。Cray Inc.是一家知名的高性能计算和超级计算机制造商,而gdbmi(GDB Machine Interface)是GNU调试器GDB的一种命令行接口,用于在...
幸运方块 简单的幸运块插件和教程。 它是什么? LuckyBlocks 是一个 Minecraft Mod(修改),它允许用户有随机机会获得非常幸运的物品或可能会让你大吃一惊的东西,但谁在乎,对吧? 假设有 3 种类型的块,但您应该...
克雷2(Cray-2)是计算机历史上一款具有里程碑意义的超级计算机,由美国克雷研究公司在1980年代初研发成功。这款机器以其独特的液态冷却系统、高性能计算能力和创新的体系结构闻名于世。在当前,通过Verilog这一硬件...
查Chearch 是一个用 Cray's Chapel 语言编写的简单搜索引擎。 此应用程序演示了如何使用 Chapel 的各种重要功能,例如语言环境,以及如何通过诸如 local 之类的功能来最小化 RPC 流量。 它还展示了如何仅使用整数...
- **CRAY系列**:自1972年CRAY-1交付使用以来,CRAY公司持续创新,推出了多款高性能向量流水处理机,如CRAY-1S、CRAY X-MP、CRAY-2S、CRAY Y-MP、CRAY Y-MPC-90等,这些机器在科学计算、工程模拟等领域发挥了巨大...
可以将CD视为一种支持细粒度,层次结构,应用程序导向,不协调的检查点和还原的机制。 创建遏制域(CD)是为了允许应用程序面对各种类型的组件故障进行恢复,包括硬(例如,持久性)和软(例如,瞬态)硬件故障,...
最后,本资源摘要信息还对NERSC(国家能源研究科学计算中心)和Cray(克雷)之间的合作进行了介绍,NERSC一直与Cray合作,Cray为Cori(Cori是NERSC最新的大型计算系统)用户带来Burst Buffer技术。NERSC BurstBuffer...
* CRAY-1 机是高性能计算机系统,具有向量寄存器和标量寄存器。 * 在 CRAY-1 机上,流水线技术可以提高计算速度和吞吐率。 * 流水线技术在 CRAY-1 机上的应用可以提高计算效率和吞吐率。 本资源提供了计算机系统...
《弹弓》游戏不仅是一款策略与模拟相结合的二维游戏,它更是一个展示开源精神和物理科学结合的平台。在《弹弓》的世界中,玩家被邀请进入一个充满未知和挑战的宇宙空间,需要利用各种行星的重力场来进行策略规划,...
关于计算机系统结构课后习题答案。能对系统结构的习题有一个更好的认识
根据压缩包中的文件名"Cray.MvcPager",我们可以推断这是一个第三方的MVC分页组件。这类组件通常提供了方便的辅助方法,简化了在视图中创建分页链接的过程,同时也可能提供了一些额外的功能,如Ajax分页、自定义样式...