- 浏览: 985542 次
- 性别:
- 来自: 广州
最新评论
-
qingchuwudi:
有用,非常感谢!
erlang进程的优先级 -
zfjdiamond:
你好 这条命令 在那里输入??
你们有yum 我有LuaRocks -
simsunny22:
这个是在linux下运行的吧,在window下怎么运行escr ...
escript的高级特性 -
mozhenghua:
http://www.erlang.org/doc/apps/ ...
mnesia 分布协调的几个细节 -
fxltsbl:
A new record of 108000 HTTP req ...
Haproxy 1.4-dev2: barrier of 100k HTTP req/s crossed
原文地址:http://www.lshift.net/blog/2010/02/28/memory-matters-even-in-erlang
作者解决问题的思路非常敬佩! 真没想到hibernation后, 由于对象的移动, 使得内存访问的不连续, 导致内存cahche的失效, 速度可以慢这么多!
Some time ago we got an interesting bug report for RabbitMQ. Surprisingly, unlike other complex bugs, this one is easy to describe:
At some point basic.get suddenly starts being very slow - about 9 times slower!
Basic.get doesn’t do anything complex - it just pops a message from a queue. This behaviour was quite unexpected. Our initial tests confirmed that we have a problem when a queue contains thousands of elements:
queue_length: 90001 basic_get 3333 times took: 1421.250ms
queue_length: 83335 basic_get 3333 times took: 1576.664ms
queue_length: 60004 basic_get 3333 times took: 1403.086ms
queue_length: 53338 basic_get 3333 times took: 9659.434ms [ look at that! ]
queue_length: 50005 basic_get 3333 times took: 9885.598ms
queue_length: 46672 basic_get 3333 times took: 8562.136ms
Let me repeat that. Usually popping a message from a queue takes Xms. At some point, it slows down to 9*Xms.
It turned out that the problem is with the queue:len() function, which is executed during the basic.get. Actually, queue:len() calls only erlang:length() builtin. At some point it switches to the “slow” mode.
Erlang:length() is a builtin that iterates through a linked list and counts it’s length. It’s complexity is O(N), where N is the length of the list. This function is implemented in the VM so it’s expected to be very, very fast.
The problem is not with erlang:length() being slow. It’s about being unpredictably slow. Let’s take a look at Erlang interpreter source code (erl_bif_guard.c:erts_gc_length_1). Here’s the main loop for erlang:length():
i=0
while (is_list(list)) {
i++;
list = CDR(list_val(list));
}
It does nothing unusual - it just iterates through list elements. However, recompiling Erlang with some debugging information confirms that the problem is indeed here:
clock_gettime(CLOCK_REALTIME, &t0);
while (is_list(list)) {
i++;
list = CDR(list_val(list));
}
clock_gettime(CLOCK_REALTIME, &t1);
td_ms = TIMESPEC_NSEC_SUBTRACT(t1, t0) / 1000000.0;
if (i > 200000 || td_ms > 2.0) {
fprintf(stderr, "gc_length_1(%p)=%i %.3fms\n\r", reg[live], i, td_ms);
}
gc_length_1(0x7f4dbfa7fc19)=499999 2.221ms
gc_length_1(0x7f4dbfa7fc19)=499999 2.197ms
gc_length_1(0x7f4dbfa7fc19)=499999 2.208ms
(hibernation)
gc_length_1(0x7f4db0572049)=499999 13.793ms
gc_length_1(0x7f4db0572049)=499999 12.806ms
gc_length_1(0x7f4db0572049)=499999 12.531ms
This confirms Matthias’ initial guess - the slowdown starts after Erlang process hibernation.
For those who aren’t Erlang experts: Hibernation is an operation that compacts an Erlang process. It does aggressive garbage collection and reduces the memory footprint of a process to absolute minimum.
The intended result of hibernation is recovering free memory from the process. However its side effect is a new memory layout of objects allocated on the heap.
Ah, how could I have forgotten! The memory is nowadays slow! What happens, is that before hibernation list elements are aligned differently, more dense. Whereas after hibernation they are sparse. It’s easy to test it - let’s count the average distance between pointers to list elements:
gc_length_1(0x7f5c626fbc19)=499999 2.229ms avg=16.000 dev=0.023
gc_length_1(0x7f5c626fbc19)=499999 3.349ms avg=16.000 dev=0.023
gc_length_1(0x7f5c626fbc19)=499999 3.345ms avg=16.000 dev=0.023
(hibernation)
gc_length_1(0x7f5c61f7d049)=499999 13.800ms avg=136.000 dev=0.266
gc_length_1(0x7f5c61f7d049)=499999 12.726ms avg=136.000 dev=0.266
gc_length_1(0x7f5c61f7d049)=499999 12.367ms avg=136.000 dev=0.266
Confirmed! Standard deviation is surprisingly small, so we can read the numbers as:
* Before hibernation list elements are aligned exactly one after another, values are somewhere else.
* After hibernation list elements are interleaved with values.
This behavior does make sense. In most cases when you traverse the list, you actually do something with the values. After hibernation, when you access list item, the value will be already loaded to the CPU cache.
Knowing the mechanism, it’s easy to write a test case that reproduces the problem.
The average distance between pointers in my case is constant - the standard deviation is negligible. This information has a practical implication - we can “predict” where the next pointer will be. Let’s use that information to “fix” the Erlang VM by prefetching memory!
while (is_list(list)) {
i++;
list2 = CDR(list_val(list));
__builtin_prefetch((char*)list2 + 128*((long)list2-(long)list));
list = list2;
}
Test script running on original Erlang VM:
length: 300001 avg:0.888792ms dev:0.061587ms
length: 300001 avg:0.881030ms dev:0.040961ms
length: 300001 avg:0.875158ms dev:0.019436ms
hibernate
length: 300001 avg:14.861762ms dev:0.150635ms
length: 300001 avg:14.833733ms dev:0.017405ms
length: 300001 avg:14.884861ms dev:0.220119ms
Patched Erlang VM:
length: 300001 avg:0.742822ms dev:0.029322ms
length: 300001 avg:0.739149ms dev:0.012897ms
length: 300001 avg:0.739465ms dev:0.014417ms
hibernate
length: 300001 avg:7.543693ms dev:0.284355ms
length: 300001 avg:7.342802ms dev:0.330158ms
length: 300001 avg:7.265960ms dev:0.053176ms
The test runs only a tiny bit faster for the “fast” case (dense conses) and twice as fast for the “slow” case (sparse conses).
Should this patch be merged into mainline Erlang? Not really. I have set the prefetch multiplier value to 128 and I don’t even know if it’s optimal. This was only an experiment. But it was fun to see how low-level system architecture can affect high-level applications.
和这个命令的调用频度有关?
它这个basic.get命令调用应该很频繁哦。应答会返回对应的msg和剩余msg的数量。
和这个命令的调用频度有关?
作者解决问题的思路非常敬佩! 真没想到hibernation后, 由于对象的移动, 使得内存访问的不连续, 导致内存cahche的失效, 速度可以慢这么多!
Some time ago we got an interesting bug report for RabbitMQ. Surprisingly, unlike other complex bugs, this one is easy to describe:
At some point basic.get suddenly starts being very slow - about 9 times slower!
Basic.get doesn’t do anything complex - it just pops a message from a queue. This behaviour was quite unexpected. Our initial tests confirmed that we have a problem when a queue contains thousands of elements:
queue_length: 90001 basic_get 3333 times took: 1421.250ms
queue_length: 83335 basic_get 3333 times took: 1576.664ms
queue_length: 60004 basic_get 3333 times took: 1403.086ms
queue_length: 53338 basic_get 3333 times took: 9659.434ms [ look at that! ]
queue_length: 50005 basic_get 3333 times took: 9885.598ms
queue_length: 46672 basic_get 3333 times took: 8562.136ms
Let me repeat that. Usually popping a message from a queue takes Xms. At some point, it slows down to 9*Xms.
It turned out that the problem is with the queue:len() function, which is executed during the basic.get. Actually, queue:len() calls only erlang:length() builtin. At some point it switches to the “slow” mode.
Erlang:length() is a builtin that iterates through a linked list and counts it’s length. It’s complexity is O(N), where N is the length of the list. This function is implemented in the VM so it’s expected to be very, very fast.
The problem is not with erlang:length() being slow. It’s about being unpredictably slow. Let’s take a look at Erlang interpreter source code (erl_bif_guard.c:erts_gc_length_1). Here’s the main loop for erlang:length():
i=0
while (is_list(list)) {
i++;
list = CDR(list_val(list));
}
It does nothing unusual - it just iterates through list elements. However, recompiling Erlang with some debugging information confirms that the problem is indeed here:
clock_gettime(CLOCK_REALTIME, &t0);
while (is_list(list)) {
i++;
list = CDR(list_val(list));
}
clock_gettime(CLOCK_REALTIME, &t1);
td_ms = TIMESPEC_NSEC_SUBTRACT(t1, t0) / 1000000.0;
if (i > 200000 || td_ms > 2.0) {
fprintf(stderr, "gc_length_1(%p)=%i %.3fms\n\r", reg[live], i, td_ms);
}
gc_length_1(0x7f4dbfa7fc19)=499999 2.221ms
gc_length_1(0x7f4dbfa7fc19)=499999 2.197ms
gc_length_1(0x7f4dbfa7fc19)=499999 2.208ms
(hibernation)
gc_length_1(0x7f4db0572049)=499999 13.793ms
gc_length_1(0x7f4db0572049)=499999 12.806ms
gc_length_1(0x7f4db0572049)=499999 12.531ms
This confirms Matthias’ initial guess - the slowdown starts after Erlang process hibernation.
For those who aren’t Erlang experts: Hibernation is an operation that compacts an Erlang process. It does aggressive garbage collection and reduces the memory footprint of a process to absolute minimum.
The intended result of hibernation is recovering free memory from the process. However its side effect is a new memory layout of objects allocated on the heap.
Ah, how could I have forgotten! The memory is nowadays slow! What happens, is that before hibernation list elements are aligned differently, more dense. Whereas after hibernation they are sparse. It’s easy to test it - let’s count the average distance between pointers to list elements:
gc_length_1(0x7f5c626fbc19)=499999 2.229ms avg=16.000 dev=0.023
gc_length_1(0x7f5c626fbc19)=499999 3.349ms avg=16.000 dev=0.023
gc_length_1(0x7f5c626fbc19)=499999 3.345ms avg=16.000 dev=0.023
(hibernation)
gc_length_1(0x7f5c61f7d049)=499999 13.800ms avg=136.000 dev=0.266
gc_length_1(0x7f5c61f7d049)=499999 12.726ms avg=136.000 dev=0.266
gc_length_1(0x7f5c61f7d049)=499999 12.367ms avg=136.000 dev=0.266
Confirmed! Standard deviation is surprisingly small, so we can read the numbers as:
* Before hibernation list elements are aligned exactly one after another, values are somewhere else.
* After hibernation list elements are interleaved with values.
This behavior does make sense. In most cases when you traverse the list, you actually do something with the values. After hibernation, when you access list item, the value will be already loaded to the CPU cache.
Knowing the mechanism, it’s easy to write a test case that reproduces the problem.
The average distance between pointers in my case is constant - the standard deviation is negligible. This information has a practical implication - we can “predict” where the next pointer will be. Let’s use that information to “fix” the Erlang VM by prefetching memory!
while (is_list(list)) {
i++;
list2 = CDR(list_val(list));
__builtin_prefetch((char*)list2 + 128*((long)list2-(long)list));
list = list2;
}
Test script running on original Erlang VM:
length: 300001 avg:0.888792ms dev:0.061587ms
length: 300001 avg:0.881030ms dev:0.040961ms
length: 300001 avg:0.875158ms dev:0.019436ms
hibernate
length: 300001 avg:14.861762ms dev:0.150635ms
length: 300001 avg:14.833733ms dev:0.017405ms
length: 300001 avg:14.884861ms dev:0.220119ms
Patched Erlang VM:
length: 300001 avg:0.742822ms dev:0.029322ms
length: 300001 avg:0.739149ms dev:0.012897ms
length: 300001 avg:0.739465ms dev:0.014417ms
hibernate
length: 300001 avg:7.543693ms dev:0.284355ms
length: 300001 avg:7.342802ms dev:0.330158ms
length: 300001 avg:7.265960ms dev:0.053176ms
The test runs only a tiny bit faster for the “fast” case (dense conses) and twice as fast for the “slow” case (sparse conses).
Should this patch be merged into mainline Erlang? Not really. I have set the prefetch multiplier value to 128 and I don’t even know if it’s optimal. This was only an experiment. But it was fun to see how low-level system architecture can affect high-level applications.
评论
5 楼
pizigou
2010-04-20
hack 需要达到这个级别 太深入了。。
4 楼
litaocheng
2010-03-11
mryufeng 写道
litaocheng 写道
呵呵,不了解内部结构,遇到问题,只能大呼诡异啊。。
不过rabbitmq,为什么不用一个变量保存queue的长度啊,每次reply的时候都调用queue:len/1 确实有点不必要。O(N)的复杂度啊..
reply({ok, queue:len(BufferTail), Msg},
State#q{message_buffer = BufferTail,
next_msg_id = NextId + 1});
不过rabbitmq,为什么不用一个变量保存queue的长度啊,每次reply的时候都调用queue:len/1 确实有点不必要。O(N)的复杂度啊..
reply({ok, queue:len(BufferTail), Msg},
State#q{message_buffer = BufferTail,
next_msg_id = NextId + 1});
和这个命令的调用频度有关?
它这个basic.get命令调用应该很频繁哦。应答会返回对应的msg和剩余msg的数量。
3 楼
mryufeng
2010-03-10
litaocheng 写道
呵呵,不了解内部结构,遇到问题,只能大呼诡异啊。。
不过rabbitmq,为什么不用一个变量保存queue的长度啊,每次reply的时候都调用queue:len/1 确实有点不必要。O(N)的复杂度啊..
reply({ok, queue:len(BufferTail), Msg},
State#q{message_buffer = BufferTail,
next_msg_id = NextId + 1});
不过rabbitmq,为什么不用一个变量保存queue的长度啊,每次reply的时候都调用queue:len/1 确实有点不必要。O(N)的复杂度啊..
reply({ok, queue:len(BufferTail), Msg},
State#q{message_buffer = BufferTail,
next_msg_id = NextId + 1});
和这个命令的调用频度有关?
2 楼
litaocheng
2010-03-10
呵呵,不了解内部结构,遇到问题,只能大呼诡异啊。。
不过rabbitmq,为什么不用一个变量保存queue的长度啊,每次reply的时候都调用queue:len/1 确实有点不必要。O(N)的复杂度啊..
reply({ok, queue:len(BufferTail), Msg},
State#q{message_buffer = BufferTail,
next_msg_id = NextId + 1});
不过rabbitmq,为什么不用一个变量保存queue的长度啊,每次reply的时候都调用queue:len/1 确实有点不必要。O(N)的复杂度啊..
reply({ok, queue:len(BufferTail), Msg},
State#q{message_buffer = BufferTail,
next_msg_id = NextId + 1});
1 楼
iso1600
2010-03-10
除了erlang:length() 其他地方应该也有类似现象,patch这个地方治标不治本。
是不是hibernate本身优化解决比较好。
是不是hibernate本身优化解决比较好。
发表评论
-
OTP R14A今天发布了
2010-06-17 14:36 2698以下是这次发布的亮点,没有太大的性能改进, 主要是修理了很多B ... -
R14A实现了EEP31,添加了binary模块
2010-05-21 15:15 3051Erlang的binary数据结构非常强大,而且偏向底层,在作 ... -
如何查看节点的可用句柄数目和已用句柄数
2010-04-08 03:31 4828很多同学在使用erlang的过程中, 碰到了很奇怪的问题, 后 ... -
获取Erlang系统信息的代码片段
2010-04-06 21:49 3488从lib/megaco/src/tcp/megaco_tcp_ ... -
iolist跟list有什么区别?
2010-04-06 20:30 6546看到erlang-china.org上有个 ... -
erlang:send_after和erlang:start_timer的使用解释
2010-04-06 18:31 8410前段时间arksea 同学提出这个问题, 因为文档里面写的很不 ... -
Latest news from the Erlang/OTP team at Ericsson 2010
2010-04-05 19:23 2024参考Talk http://www.erlang-factor ... -
对try 异常 运行的疑问,为什么出现两种结果
2010-04-05 19:22 2856郎咸武<langxianzhe@163.com> ... -
Erlang ERTS Async基础设施
2010-03-19 00:03 2537其实Erts的Async做的很不错的, 相当的完备, 性能又高 ... -
CloudI 0.0.9 Released, A Cloud as an Interface
2010-03-09 22:32 2488基于Erlang的云平台 看了下代码 质量还是不错的 完成了不 ... -
Some simple examples of using Erlang’s XPath implementation
2010-03-08 23:30 2060原文地址 http://www.lshift.net/blog ... -
lcnt 环境搭建
2010-02-26 16:19 2626抄书:otp_doc_html_R13B04/lib/tool ... -
Erlang强大的代码重构工具 tidier
2010-02-25 16:22 2492Jan 29, 2010 We are very happy ... -
[Feb 24 2010] Erlang/OTP R13B04 has been released
2010-02-25 00:31 1397Erlang/OTP R13B04 has been rele ... -
R13B04 Installation
2010-01-28 10:28 1408R13B04后erlang的源码编译为了考虑移植性,就改变了编 ... -
Running tests
2010-01-19 14:51 1502R13B03以后 OTP的模块加入了大量的测试模块,这些模块都 ... -
R13B04在细化Binary heap
2010-01-14 15:11 1514从github otp的更新日志可以清楚的看到otp R13B ... -
R13B03 binary vheap有助减少binary内存压力
2009-11-29 16:07 1673R13B03 binary vheap有助减少binary内存 ... -
erl_nif 扩展erlang的另外一种方法
2009-11-26 01:02 3236我们知道扩展erl有2种方法, driver和port. 这2 ... -
escript的高级特性
2009-11-25 05:42 5956escript Erlang scripting suppor ...
相关推荐
【伪掩模在弱监督语义分割中的重要性】\n\n在计算机视觉领域,语义分割是一项基础任务,它需要耗时的像素级手动注释。为了减轻标注负担,弱监督语义分割方法应运而生,这些方法利用了如涂鸦注释、边界框、点或图像级...
初级入门吉他谱 guitar tab
### Matters Computational: Key Insights and Algorithms #### Overview "Matters Computational: Ideas, Algorithms, Source Code" is a comprehensive resource aimed at computationalists and programmers ...
<Data Transfer Matters for GPU Computing> Abstract—Graphics processing units (GPUs) embrace manycore compute devices where massively parallel compute threads are offloaded from CPUs. This ...
标题 "matters-rest-spring-jersey-tomcat-mybatis" 暗示这是一个关于构建RESTful服务的项目,使用了Spring、Jersey、Tomcat和MyBatis这些技术。描述中的“事项-休息-春天-球衣-tomcat-mybatis 1”可能是项目名的另...
- **In-Place Methods to Apply Permutations to Data**: Techniques for applying permutations directly to data without additional memory. - **Random Permutations**: Generating random permutations ...
语言:Deutsch,English,Français,español,italiano Donetica-您的搜索很重要! 当Donation和Etica在一起时-Donetica。 通过搜索改变世界,搜索很重要! 仅需单击几下,Donetica就会捐赠给慈善,生态,健康与福利...
:triangular_ruler: React本机大小问题 一个React-Native实用程序带,... 那是一件乏味的工作。 react-native-size-matters提供了一些简单的工具,使您的缩放变得更加容易。 这个想法是在标准的约5英寸屏幕移动设备
To make matters worse, I had altered my environment in a way that was incompatible with other software that I use regularly. Reverting those changes took an embarrassingly long time. I distinctly ...
loop Matters: Dual Regression Networks for Single Image Super-Resolution"一文深入探讨了超分辨率技术中的关键挑战,并提出了一种创新的双回归网络架构,旨在解决从LR到HR图像转换中的病态问题和真实世界数据...
语言:English 有机会改变世界,成为支持慈善机构的大社区的一部分。 我们的主要目标是为想要改变世界的人们提供更好的机会来实现这一目标。 当然,您可以通过直接向想要支持的慈善组织捐款来做到这一点,但是如果您...
### 基础架构的重要性:POWER8与XEON x86对比分析 #### 概述 随着企业数字化转型的步伐不断加快,服务器基础架构的选择成为决定业务成功与否的关键因素之一。本报告聚焦于IBM的POWER8架构与Intel XEON x86架构之间...
Tuts +开放式作业: 授课教师:Adi Purdila 这是我最近进行的有关优化网页性能的课程的一项公开作业。 查看课程,观看作业视频,下载工作文件,然后让我们知道您的学习情况! 可在Tuts + 2015年9月10日使用
- **矩阵转置**:在不使用额外内存的情况下,实现矩阵的转置操作。 - **三重反转旋转**:通过三次反转操作来实现数组或列表的旋转。 - **XOR 排列**:一种基于异或操作的特殊排列。 - **格雷码排列**:基于格雷...
绩效检查表 目录 图片 CSS ... JavaScript 字型 后端优化 各种各样的 提示:Chrome中没有扩展名的其他个人资料/用户 使用正确的lang ... 尝试避免重定向,这些重定向可能会带来较高的延迟开销 ...视觉进度的优化非常糟糕...
### Matters Computational: Key Insights, Algorithms, and Source Code #### Overview "Matters Computational" is a comprehensive resource designed for computationalists, including professional ...