锁定老帖子 主题:我所做的Java和C++性能测试
精华帖 (2) :: 良好帖 (2) :: 新手帖 (11) :: 隐藏帖 (7)
|
|
---|---|
作者 | 正文 |
发表时间:2011-05-24
这种测试很难说明问题
我机器上用gcc把new的Test换成栈上生成,-O3的情况下,循环内的doTest甚至直接被优化掉了 生成对象,这里的实现根本不会产生内存碎片,也不存在对查找free块的损耗,并且构造成本对于double那些test运算成本基本是边际成本 去掉jni实际两者的速度差别很小 有意义的是拿查询或者一些算法做测试,确保每行代码都是要起作用的 并且这个和OS,cpu,编译器,编译选项,jvm版本依赖很大,纯计算的代码我笔记本上c++一般都还能保持10%-20%的优势,家里台式机基本都是5%以内,涉及字符串的代码,如果c++不是用char*而是string,经常会比java还慢 |
|
返回顶楼 | |
发表时间:2011-05-24
RednaxelaFX 写道 jellyfish 写道 Java's sin() function is slow, well known.
Slow, compared to what? That's a well-known myth, which is not true for modern high performance JVMs like HotSpot, JRockit and J9. Unless you specify strictfp (which you would seldom see anyone do), these JVMs will take advantage of the floating point instructions of the underlying hardware for maximum performance. For example, this is what Math.sin() looks like when it's called from C2 compiled code, on x64: (C2 is the name of HotSpot's server compiler) StubRoutines::sin [0x00007f89ea1dcf11, 0x00007f89ea1dd029[ (280 bytes) [Disassembling for mach='i386:x86-64'] 0x00007f89ea1dcf11: sub $0x8,%rsp 0x00007f89ea1dcf15: movsd %xmm0,(%rsp) 0x00007f89ea1dcf1a: fldl (%rsp) 0x00007f89ea1dcf1d: fldl 0x496451d(%rip) # 0x00007f89eeb41440 0x00007f89ea1dcf23: fld %st(1) 0x00007f89ea1dcf25: fabs 0x00007f89ea1dcf27: fucomip %st(1),%st 0x00007f89ea1dcf29: ffree %st(0) 0x00007f89ea1dcf2b: fincstp 0x00007f89ea1dcf2d: ja Stub::sin+41 0x0x7f89ea1dcf3a 0x00007f89ea1dcf33: fsin 0x00007f89ea1dcf35: jmpq Stub::sin+267 0x0x7f89ea1dd01c 0x00007f89ea1dcf3a: mov %rsp,-0x28(%rsp) 0x00007f89ea1dcf3f: sub $0x80,%rsp 0x00007f89ea1dcf46: mov %rax,0x78(%rsp) 0x00007f89ea1dcf4b: mov %rcx,0x70(%rsp) 0x00007f89ea1dcf50: mov %rdx,0x68(%rsp) 0x00007f89ea1dcf55: mov %rbx,0x60(%rsp) 0x00007f89ea1dcf5a: mov %rbp,0x50(%rsp) 0x00007f89ea1dcf5f: mov %rsi,0x48(%rsp) 0x00007f89ea1dcf64: mov %rdi,0x40(%rsp) 0x00007f89ea1dcf69: mov %r8,0x38(%rsp) 0x00007f89ea1dcf6e: mov %r9,0x30(%rsp) 0x00007f89ea1dcf73: mov %r10,0x28(%rsp) 0x00007f89ea1dcf78: mov %r11,0x20(%rsp) 0x00007f89ea1dcf7d: mov %r12,0x18(%rsp) 0x00007f89ea1dcf82: mov %r13,0x10(%rsp) 0x00007f89ea1dcf87: mov %r14,0x8(%rsp) 0x00007f89ea1dcf8c: mov %r15,(%rsp) 0x00007f89ea1dcf90: sub $0x8,%rsp 0x00007f89ea1dcf94: fstpl (%rsp) 0x00007f89ea1dcf97: movsd (%rsp),%xmm0 0x00007f89ea1dcf9c: test $0xf,%esp 0x00007f89ea1dcfa2: je Stub::sin+169 0x0x7f89ea1dcfba 0x00007f89ea1dcfa8: sub $0x8,%rsp 0x00007f89ea1dcfac: callq 0x00007f89eea45d76 0x00007f89ea1dcfb1: add $0x8,%rsp 0x00007f89ea1dcfb5: jmpq Stub::sin+174 0x0x7f89ea1dcfbf 0x00007f89ea1dcfba: callq 0x00007f89eea45d76 0x00007f89ea1dcfbf: movsd %xmm0,(%rsp) 0x00007f89ea1dcfc4: fldl (%rsp) 0x00007f89ea1dcfc7: add $0x8,%rsp 0x00007f89ea1dcfcb: mov (%rsp),%r15 0x00007f89ea1dcfcf: mov 0x8(%rsp),%r14 0x00007f89ea1dcfd4: mov 0x10(%rsp),%r13 0x00007f89ea1dcfd9: mov 0x18(%rsp),%r12 0x00007f89ea1dcfde: mov 0x20(%rsp),%r11 0x00007f89ea1dcfe3: mov 0x28(%rsp),%r10 0x00007f89ea1dcfe8: mov 0x30(%rsp),%r9 0x00007f89ea1dcfed: mov 0x38(%rsp),%r8 0x00007f89ea1dcff2: mov 0x40(%rsp),%rdi 0x00007f89ea1dcff7: mov 0x48(%rsp),%rsi 0x00007f89ea1dcffc: mov 0x50(%rsp),%rbp 0x00007f89ea1dd001: mov 0x60(%rsp),%rbx 0x00007f89ea1dd006: mov 0x68(%rsp),%rdx 0x00007f89ea1dd00b: mov 0x70(%rsp),%rcx 0x00007f89ea1dd010: mov 0x78(%rsp),%rax 0x00007f89ea1dd015: add $0x80,%rsp 0x00007f89ea1dd01c: fstpl (%rsp) 0x00007f89ea1dd01f: movsd (%rsp),%xmm0 0x00007f89ea1dd024: add $0x8,%rsp 0x00007f89ea1dd028: retq in the fastest case, the code above boils down to: StubRoutines::sin [0x00007f89ea1dcf11, 0x00007f89ea1dd029[ (280 bytes) [Disassembling for mach='i386:x86-64'] 0x00007f89ea1dcf11: sub $0x8,%rsp 0x00007f89ea1dcf15: movsd %xmm0,(%rsp) 0x00007f89ea1dcf1a: fldl (%rsp) 0x00007f89ea1dcf1d: fldl 0x496451d(%rip) # 0x00007f89eeb41440 0x00007f89ea1dcf23: fld %st(1) 0x00007f89ea1dcf25: fabs 0x00007f89ea1dcf27: fucomip %st(1),%st 0x00007f89ea1dcf29: ffree %st(0) 0x00007f89ea1dcf2b: fincstp 0x00007f89ea1dcf2d: ja Stub::sin+41 0x0x7f89ea1dcf3a 0x00007f89ea1dcf33: fsin 0x00007f89ea1dcf35: jmpq Stub::sin+267 0x0x7f89ea1dd01c # ... 0x00007f89ea1dd01c: fstpl (%rsp) 0x00007f89ea1dd01f: movsd (%rsp),%xmm0 0x00007f89ea1dd024: add $0x8,%rsp 0x00007f89ea1dd028: retq even better, it doesn't have to invoke a stub all the time; the code may be inlined to the call site, resulting in code like this: 0x00007f89ea1fa92b: sub $0x8,%rsp 0x00007f89ea1fa92f: movsd %xmm0,(%rsp) 0x00007f89ea1fa934: data16 0x00007f89ea1fa935: fldl (%rsp) 0x00007f89ea1fa938: fsin 0x00007f89ea1fa93a: fstpl 0x0(%rsp) 0x00007f89ea1fa93e: movsd (%rsp),%xmm0 0x00007f89ea1fa943: add $0x8,%rsp which isn't as slow as you might guess it is. java's sin() is slow comparing to c calls. Several years ago, I saw an article posting the comparing results, my runs came out almost the same results. It was about 60 times slow. Someone did some dig on the c side and found out ms did some optimization using assembly. On the other hand, after digging the code on the java side, folks initiated a discusion with java's grandfather (you could google his name and sin function), then a debate came out as a big news. Not sure about the performance of your code. Of course, there is always a hardware acceleration, such as GPU, but I am assuming we are in the context of the normal pc, since that's where I am working on everyday. You miles may vary. |
|
返回顶楼 | |
发表时间:2011-05-24
skzr.org 写道 刚才吃饭去了,其实就是根据lz的改了下。
用你的代码在我这边的Windows XP, JDK 6 update 25上跑是这样: C:\x>java -server PerformTest start test[一次new]... Program run druation[一次new]: 592.789 ms. start test [每次new]... Program run druation[每次new]: 587.895 ms. start test[一次new]... Program run druation[一次new]: 2100.829 ms. start test [每次new]... Program run druation[每次new]: 346.865 ms. start test[一次new]... Program run druation[一次new]: 335.599 ms. start test [每次new]... Program run druation[每次new]: 42.258 ms. start test[一次new]... Program run druation[一次new]: 335.614 ms. start test [每次new]... Program run druation[每次new]: 42.216 ms. start test[一次new]... Program run druation[一次new]: 335.665 ms. start test [每次new]... Program run druation[每次new]: 42.218 ms. 多加一个参数就不同了: C:\x>java -server -XX:LoopUnrollLimit=60 PerformTest start test[一次new]... Program run druation[一次new]: 598.998 ms. start test [每次new]... Program run druation[每次new]: 596.406 ms. start test[一次new]... Program run druation[一次new]: 560.527 ms. start test [每次new]... Program run druation[每次new]: 350.091 ms. start test[一次new]... Program run druation[一次new]: 42.605 ms. start test [每次new]... Program run druation[每次new]: 42.393 ms. start test[一次new]... Program run druation[一次new]: 42.391 ms. start test [每次new]... Program run druation[每次new]: 42.694 ms. start test[一次new]... Program run druation[一次new]: 42.259 ms. start test [每次new]... Program run druation[每次new]: 42.226 ms. 想说这例子再次印证了microbenchmark杯具的地方。你的环境里LoopUnrollLimit参数的默认值是60,而我的环境里这个值是50,就差那么一点但碰巧影响了一个优化——循环展开。 第三次之后变得特别快的原因,举例来说,main1()被编译后变成了近似这样: public static void main1() { System.out.println("start test [每次new]..."); long start = System.nanoTime(); // 注意:无JNI的额外开销 int j = 0; int i = 2; for (; i < 20001; i += 16) { // 步进为每轮循环16 } for (; i < 20001; i++) { // 剩余的部分步进为每轮循环1 } for (; j < 10000; j++) { } long end = System.nanoTime(); // 注意:无JNI的额外开销 System.out.println("Program run druation[每次new]: " + (end - start)/1000000d + " ms."); } 可以看到,main1()里面调用doTest()、testInt()这些方法确实被内联了,逃逸分析+标量替换确实消除了new PerformTest()分配空间的需求,System.nanoTime()也的JNI调用开销也消除掉了(这是因为它在HotSpot VM里是个intrinsic)。实际上位移啊除法啊啥的时间都没测到,当然快。 可惜的就是那循环还在… 有兴趣看实际生成的代码我可以贴出来。主要是那控制流比较乱,读起来或许会比较费力… |
|
返回顶楼 | |
发表时间:2011-05-24
最后修改:2011-05-24
jellyfish 写道 java's sin() is slow comparing to c calls. Several years ago, I saw an article posting the comparing results, my runs came out almost the same results. It was about 60 times slow. Someone did some dig on the c side and found out ms did some optimization using assembly.
On the other hand, after digging the code on the java side, folks initiated a discusion with java's grandfather (you could google his name and sin function), then a debate came out as a big news. Not sure about the performance of your code. Of course, there is always a hardware acceleration, such as GPU, but I am assuming we are in the context of the normal pc, since that's where I am working on everyday. You miles may vary. That's "your mileage may vary". Learn, YMMV. It's bad micro benchmarks that give people the wrong impression that Java is slow. A Java program may be slower than a well-tuned C program in a real world scenario, but that's got nothing to do with what those bad benchmarks are trying to tell you. Go ahead and disassemble your /lib64/libm.so.6 equivalent, and see what "<sin>" turns into for yourself. As for articles on Java's floating point arithmetic, I guess you're referring to something like this: How Java's Floating-Point Hurts Everyone Everywhere. That's from more than a decade ago, and the statements in it don't hold anymore now. |
|
返回顶楼 | |
发表时间:2011-05-24
底层不懂,落伍了
一直提倡临时变量不要放到for外面,现代jvm已经解决了 现在都是直接在for里面用临时变量 实践中优点: 1. 可读性强 2. jvm帮忙优化了 ^ ^ 这个测试也佐证了 : ) |
|
返回顶楼 | |
发表时间:2011-05-24
RednaxelaFX 写道 jellyfish 写道 java's sin() is slow comparing to c calls. Several years ago, I saw an article posting the comparing results, my runs came out almost the same results. It was about 60 times slow. Someone did some dig on the c side and found out ms did some optimization using assembly.
On the other hand, after digging the code on the java side, folks initiated a discusion with java's grandfather (you could google his name and sin function), then a debate came out as a big news. Not sure about the performance of your code. Of course, there is always a hardware acceleration, such as GPU, but I am assuming we are in the context of the normal pc, since that's where I am working on everyday. You miles may vary. That's "your mileage may vary". Learn, YMMV. It's bad micro benchmarks that give people the wrong impression that Java is slow. A Java program may be slower than a well-tuned C program in a real world scenario, but that's got nothing to do with the those bad benchmarks are trying to tell you. Go ahead and disassemble your /lib64/libm.so.6 equivalent, and see what "<sin>" turns into for yourself. As for articles on Java's floating point arithmetic, I guess you're referring to something like this: How Java's Floating-Point Hurts Everyone Everywhere. That's from more than a decade ago, and the statements in it doesn't hold anymore now. 那么既然JVM是按照操作系统发布的,可以从本地调用和JVM实现和参数等方面调优,到底还存不存在因为跨平台而早成的性能损失呢,第一个想到的是swing,这个是不是也成了神话 |
|
返回顶楼 | |
发表时间:2011-05-24
yangyi 写道 那么既然JVM是按照操作系统发布的,可以从本地调用和JVM实现和参数等方面调优,到底还存不存在因为跨平台而早成的性能损失呢,第一个想到的是swing,这个是不是也成了神话
那倒不是。看你跟什么比来谈“性能损失”,以及如何看待整个系统的各方面的效率。 首先,参数调优不能解决所有性能问题。选择了某种JVM实现后,性能就被它的极限所限制住了。 其次的话,JVM上跑的Java程序为了达到跟C之类的同等级的速度,通常要消耗更多内存。这footprint问题也不是在所有场景里都能睁一只眼闭一只眼放过去的。 然后,觉得最重要的一点,反而是在Java程序员身上:因为Java封装了底层细节,所有许多Java程序员写程序的时候并不会精打细算的对待每行代码。相反,许多写C的人会非常计较每一点的性能,而且也比较可控。结果就是Java用来写所谓的业务代码更轻松些,但性能也经常不知不觉的损失在写得欠考虑的代码上了。这未必就是坏事,反正很多Java程序本来就不需要快到哪里去,当然是省点力气舒服些 |
|
返回顶楼 | |
发表时间:2011-05-24
在家里的jvm,gcc试了下,java比gcc -O2 -O3的代码还要稍快
Java(TM) SE Runtime Environment (build 1.6.0_21-b06) Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode) gcc (GCC) 4.5.2 gcc 47641 ms.-O3 release java 45515 ms. gcc在O3时,也是将doTest,testInt,testDouble内联,如果test对象是在栈上面生成,-O3会直接跳过doTest,甚至连循环都不做,除了系统调用代码,其他无意义的代码都被滤掉 如果是没有冗余的纯算法类代码,java和gcc的O1,O3差别都并不大,java和c++速度很接近 |
|
返回顶楼 | |
发表时间:2011-05-25
RednaxelaFX 写道 jellyfish 写道 java's sin() is slow comparing to c calls. Several years ago, I saw an article posting the comparing results, my runs came out almost the same results. It was about 60 times slow. Someone did some dig on the c side and found out ms did some optimization using assembly.
On the other hand, after digging the code on the java side, folks initiated a discusion with java's grandfather (you could google his name and sin function), then a debate came out as a big news. Not sure about the performance of your code. Of course, there is always a hardware acceleration, such as GPU, but I am assuming we are in the context of the normal pc, since that's where I am working on everyday. You miles may vary. That's "your mileage may vary". Learn, YMMV. It's bad micro benchmarks that give people the wrong impression that Java is slow. A Java program may be slower than a well-tuned C program in a real world scenario, but that's got nothing to do with what those bad benchmarks are trying to tell you. Go ahead and disassemble your /lib64/libm.so.6 equivalent, and see what "<sin>" turns into for yourself. As for articles on Java's floating point arithmetic, I guess you're referring to something like this: How Java's Floating-Point Hurts Everyone Everywhere. That's from more than a decade ago, and the statements in it don't hold anymore now. That reference is really a wild guess. NO, it's not. http://blogs.oracle.com/jag/entry/transcendental_meditation a simple google on "java sin cos slow" can generate a lot of interesting entries, especially in the game arena, so it should be classified as "well known". While I am saying java sin() is slower than C version, I am not saying in general java is slow, did I? In fact, if you can make java sin() as closely fast as C, I would take it. I did quite some coding on how to make those special functions as fast as possible, such as gamma and log gamma. It's just hard. It's so hard that sometimes people accept the inaccuracy as the cost. I've done a lot performance tunings as well, and have seen so many cases for premature optimization. The most common case is that people don't understand the problem itself and still try to optimize/profile it. |
|
返回顶楼 | |
发表时间:2011-05-25
测试方法不平等,
楼主要不测试一下java用RMI调用本地java组件和让c++通过webservice调用本地java组件,看看哪个快 |
|
返回顶楼 | |