- 浏览: 307316 次
文章分类
最新评论
-
shoru:
最新的netbeans scala插件无法安装到6.9中呀,求 ...
Scala-对Java的修正和超越-Presentation -
alanwu:
night_stalker 写道讲一次多少钱啊,能透露下么…… ...
Scala-对Java的修正和超越-Presentation -
wzjin:
暂时没有考虑学习,等有一定市场再看!
Scala-对Java的修正和超越-Presentation -
dcaoyuan:
jasspier 写道对scala有点兴趣,有没有入门的推荐教 ...
Scala-对Java的修正和超越-Presentation -
jasspier:
草原兄,听了你的课,对scala有点兴趣,有没有入门的推荐教程 ...
Scala-对Java的修正和超越-Presentation
Tim Bray's Erlang Exercise on Large Dataset Processing - Round II
- 博客分类:
- /Erlang
Updated Oct 09: Added more benchmark results under linux on other machines.
Updated Oct 07: More concise code.
Updated Oct 06: Fixed bugs: 1. Match "GET /ongoing/When/" instead of "/ongoing/When/"; 2. split_on_last_newline should not reverse Tail.
Backed from a short vacation, and sit down in front of my computer, I'm thinking about Tim Bray's exercise again.
As I realized, the most expensive procedure is splitting dataset to lines. To get the multiple-core benefit, we should parallelize this procedure instead of reading file to binary or macthing process only.
In my previous solution, there are at least two issues:
- Since the file reading is fast in Erlang, then, parallelizing the file reading is not much helpful.
- The buffered_read actually can be merged with the buffered file reading.
And, Per's solution parallelizes process_match procedure only, based on a really fast divide_to_lines, but with hacked binary matching syntax.
After a couple of hours working, I finially get the second version of tbray.erl (with some code from Per's solution).
- Read file to small pieces of binary (about 4096 bytes each chunk), then convert to list.
- Merge the previous tail for each chunk, search this chunk from tail, find the last new line mark, split this chunk to line-bounded data part, and tail part for next chunk.
- The above steps are difficult to parallelize. If we try, there will be about 30 more LOC, and not so readable.
- Spawn a new process at once to split line-bounded chunk to lines, process match and update dict.
- Thus we can go on reading file with non-stop.
- A collect_loop will receive dicts from each process, and merge them.
What I like of this version is, it scales on mutiple-core almost linearly! On my 2.0G 2-core MacBook, it took about 13.522 seconds with non-smp, 7.624 seconds with smp enabled (for a 200M data file, with about 50,000 processes spawned). The 2-core smp result achieves about 77% faster than non-smp result. I'm not sure how will it achieve on an 8-core computer, but we'll finally reach the limit due to the un-parallelized procedures.
The Erlang time results:
$ erlc tbray.erl $ time erl -noshell -run tbray start o1000k.ap -s erlang halt > /dev/null real 0m13.522s user 0m12.265s sys 0m1.199s $ erlc -smp tbray.erl $ time erl -smp +P 60000 -noshell -run tbray start o1000k.ap -s erlang halt > /dev/null real 0m7.624s user 0m13.302s sys 0m1.602s # For 5 million lines, 958.4M size: $ time erl -smp +P 300000 -noshell -run tbray start o5000k.ap -s erlang halt > /dev/null real 0m37.085s user 1m5.605s sys 0m7.554s
And the original Tim's Ruby version:
$ time ruby tbray.rb o1000k.ap > /dev/null real 0m2.447s user 0m2.123s sys 0m0.306s # For 5 million lines, 958.4M size: $ time ruby tbray.rb o5000k.ap > /dev/null real 0m12.115s user 0m10.494s sys 0m1.473s
Erlang time result on 2-core 1.86GHz CPU RedHat linux box, with kernel:
Linux version 2.6.18-1.2798.fc6 (brewbuilder@hs20-bc2-4.build.redhat.com) (gcc v
ersion 4.1.1 20061011 (Red Hat 4.1.1-30)) #1 SMP Mon Oct 16 14:37:32 EDT 2006
is 7.7 seconds.
Erlang time result on 2.80GHz 4-cpu xeon debian box, with kernel:
Linux version 2.6.15.4-big-smp-tidy (root@test) (gcc version 4.0.3 20060128 (prerelease) (Debian 4.0
.2-8)) #1 SMP Sat Feb 25 21:24:23 CST 2006
The smp result on this 4-cpu computer is questionable. It speededup only 50% than non-smp, even worse than my 2.0GHz 2-core MacBook. I also tested the Big Bang on this machine, it speedup less than 50% too.
$ erlc tbray.erl $ time erl -noshell -run tbray start o1000k.ap -s erlang halt > /dev/null real 0m22.279s user 0m21.597s sys 0m0.676s $ erlc -smp tbray.erl $ time erl -smp +S 4 +P 60000 -noshell -run tbray start o1000k.ap -s erlang halt > /dev/null real 0m14.765s user 0m28.722s sys 0m0.840s
Notice:
- All tests run several times to have the better result expressed, so, the status of disk/io cache should be near.
- You may need to compile tbray.erl to two different BEAMs, one for smp version, and one for no-smp version.
- If you'd like to process bigger file, you can use +P processNum to get more simultaneously alive Erlang processes. For BUFFER_SIZE=4096, you can set +P arg as FileSize / 4096, or above. From Erlang's Efficiency Guide:
Processes
The maximum number of simultaneously alive Erlang processes is by default 32768. This limit can be raised up to at most 268435456 processes at startup (see documentation of the system flag +P in the erl(1) documentation). The maximum limit of 268435456 processes will at least on a 32-bit architecture be impossible to reach due to memory
To evaluate with smp enable: (Erlang/OTP R11B-5 for Windows may not support smp yet)
erl -smp +P 60000 > tbray:start("o1000k.ap").
The code: (pretty formatted by ErlyBird 0.15.1)
-module(tbray_blog). -compile([native]). -export([start/1]). %% The best Bin Buffer Size is 4096 -define(BUFFER_SIZE, 4096). start(FileName) -> Start = now(), Main = self(), Collector = spawn(fun () -> collect_loop(Main) end), {ok, File} = file:open(FileName, [raw, binary]), read_file(File, Collector), %% don't terminate, wait here, until all tasks done. receive stop -> io:format("Time: ~10.2f ms~n", [timer:now_diff(now(), Start) / 1000]) end. read_file(File, Collector) -> read_file_1(File, [], 0, Collector). read_file_1(File, PrevTail, I, Collector) -> case file:read(File, ?BUFFER_SIZE) of eof -> Collector ! {chunk_num, I}, file:close(File); {ok, Bin} -> {Data, NextTail} = split_on_last_newline(PrevTail ++ binary_to_list(Bin)), spawn(fun () -> Collector ! {dict, scan_lines(Data)} end), read_file_1(File, NextTail, I + 1, Collector) end. split_on_last_newline(List) -> split_on_last_newline_1(lists:reverse(List), []). split_on_last_newline_1(List, Tail) -> case List of [] -> {lists:reverse(List), []}; [$\n|Rest] -> {lists:reverse(Rest), Tail}; [C|Rest] -> split_on_last_newline_1(Rest, [C | Tail]) end. collect_loop(Main) -> collect_loop_1(Main, dict:new(), undefined, 0). collect_loop_1(Main, Dict, ChunkNum, ChunkNum) -> print_result(Dict), Main ! stop; collect_loop_1(Main, Dict, ChunkNum, ProcessedNum) -> receive {chunk_num, ChunkNumX} -> collect_loop_1(Main, Dict, ChunkNumX, ProcessedNum); {dict, DictX} -> Dict1 = dict:merge(fun (_, V1, V2) -> V1 + V2 end, Dict, DictX), collect_loop_1(Main, Dict1, ChunkNum, ProcessedNum + 1) end. print_result(Dict) -> SortedList = lists:reverse(lists:keysort(2, dict:to_list(Dict))), [io:format("~p\t: ~s~n", [V, K]) || {K, V} <- lists:sublist(SortedList, 10)]. scan_lines(List) -> scan_lines_1(List, [], dict:new()). scan_lines_1(List, Line, Dict) -> case List of [] -> match_and_update_dict(lists:reverse(Line), Dict); [$\n|Rest] -> scan_lines_1(Rest, [], match_and_update_dict(lists:reverse(Line), Dict)); [C|Rest] -> scan_lines_1(Rest, [C | Line], Dict) end. match_and_update_dict(Line, Dict) -> case process_match(Line) of false -> Dict; {true, Word} -> dict:update_counter(Word, 1, Dict) end. process_match(Line) -> case Line of [] -> false; "GET /ongoing/When/"++[_,_,_,$x,$/,Y1,Y2,Y3,Y4,$/,M1,M2,$/,D1,D2,$/|Rest] -> case match_until_space(Rest, []) of [] -> false; Word -> {true, [Y1,Y2,Y3,Y4,$/,M1,M2,$/,D1,D2,$/] ++ Word} end; [_|Rest] -> process_match(Rest) end. match_until_space(List, Word) -> case List of [] -> []; [$.|_] -> []; [$ |_] -> lists:reverse(Word); [C|Rest] -> match_until_space(Rest, [C | Word]) end.
Lessons learnt:
- Split large binary to proper size chunks, then convert to list for further processing
- Parallelize the most expensive part (of course)
- We need a new or more complete Efficent Erlang
发表评论
-
Updated "From Rails to Erlyweb" Part I and Part II
2007-07-30 00:13 868I've updated "From Rails t ... -
From Rails to Erlyweb - Part IV
2007-07-30 00:14 916IV. Support Mysql Spatial Exten ... -
HTML Entity Refs and xmerl
2007-07-30 00:14 970According to [erlang-bugs] xme ... -
From Rails to Erlyweb - Part III
2007-07-30 00:14 8853. The Magic Behind Erlyweb Wit ... -
From Rails to Erlyweb - Part I
2007-08-05 06:34 1236Updated July 20 2007: new param ... -
ErlyBird 0.12.0 released - Erlang IDE based on NetBeans
2007-08-08 18:40 922I'm pleased to announce ErlyBir ... -
A Simple POET State Machine Accepting SAX Events to Build Plain Old Erlang Term
2007-08-20 07:19 1001Per previous blogs: A Simple ... -
A Simple XML State Machine Accepting SAX Events to Build xmerl Compitable XML Tree: icalendar demo
2007-08-20 07:23 1140xmerl is a full XML functionali ... -
recbird - An Erlang Dynamic Record Inferring Parse Transform
2007-08-23 08:31 1062You should have read Yariv's re ... -
From Rails to Erlyweb - Part II
2007-08-23 21:16 1276Updated Aug 23: Please see Fro ... -
From Rails to Erlyweb - Part II Manage Project - Reloaded
2007-08-24 19:20 1354The migrating from Rails to Erl ... -
Parse JSON to xmerl Compitable XML Tree via A Simple XML State Machine
2007-09-02 20:00 1538Updated Aug 16: Fix bugs when j ... -
Tim Bray's Erlang Exercise on Large Dataset Processing
2007-10-10 20:12 1253Updated Oct 10: pread_file/5 s ... -
Reading File in Parallel in Erlang (Was Tim Bray's Erlang Exercise - Round III)
2007-10-15 19:56 1920My first solution for Tim's ex ... -
The Erlang Way (Was Tim Bray's Erlang Exercise - Round IV)
2007-10-17 07:45 1800Playing with Tim's Erlang Exerc ... -
Learning Coding Parallelization (Was Tim's Erlang Exercise - Round V)
2007-10-17 21:58 1569Updated Oct 16: After testing ...
相关推荐
之前我们介绍了计算beta-NTI(beta nearest taxon index)来进行群落构建分析。|beta-NTI| >2说明决定性过程主导,其中beta-NTI >2说明OTU的遗传距离发散,为生物交互作用主导,beta-NTI < -2则说明OUT的遗传距离...
根据提供的文件信息,这份BRAY VAL-SRS系列电动执行器说明书包含了该系列产品的详细参数、特点、接线指导、认证信息、安装指南和维护指南。以下是针对这些方面的详细知识点: 1. 产品特点与优势: - 高输出力和关闭...
Self-modeling as a treatment for increasing on-task behavior SELF-MODELING AS A TREATMENT FOR INCREASING ON-TASK BEHAVIOR susan k. clare Clovis Unified School District, Clovis, California ...
We obtain the isoperimetric ...ing recent work of Bray and Morgan on isoperimetric comparison. We then discuss these results in the context of Bray’s isoperimetric approach to the Penrose inequality.
### Bluetooth Application Developer's Guide 关键知识点解析 #### 标题:Bluetooth Application Developer's Guide - **核心主题**:本指南旨在为蓝牙应用开发者提供全面、深入的技术指导。 - **目标受众**:...
免费下载:Marvel Incredible Records Amazing Powers and Astonishing Stats (Melanie Scott, Adam Bray,
### 代数拓扑学概览 #### 一、引言 代数拓扑是数学的一个分支,它利用代数工具研究拓扑空间的性质。本书由Allen Hatcher撰写,是一本广泛使用的教材,适用于拥有坚实数学背景的研究生以及其它科学领域的专业人士。...
《蓝牙应用开发指南》是一本关于蓝牙技术开发的专业书籍,由David Kammer、Gordon McNutt、Brian Senese和Jennifer Bray共同编写,Syngress Publishing出版。本书旨在为读者提供全面而深入的蓝牙应用开发知识,并...
- **开发者**:XML规范是由W3C(万维网联盟)制定的,主要贡献者包括Tim Bray、Jean Paoli以及C. M. Sperberg-McQueen等专家。 - **开发目标**:XML的设计旨在简化SGML(标准通用标记语言),使其更加适用于互联网上...
它解析相对URL(例如Tim Bray的“进行中”看到的URL)。 它可以正确处理XML名称空间(包括在非常规feed中为主要feed元素定义非默认名称空间的XML名称空间)。 安装 npm install feedparser 用法 本示例只是为了...
- **开发者**: Tim Bray (Textuality and Netscape)、Jean Paoli (Microsoft) 和 C. M. Sperberg-McQueen (University of Illinois at Chicago)。 - **目标**: 设计一种简单且易于实现的标准,同时确保与现有的SGML...
- **开发者与目标**:XML是由Tim Bray(Textuality and Netscape)、Jean Paoli(Microsoft)以及C. M. Sperberg-McQueen(University of Illinois at Chicago)共同发起的项目。其主要目标在于提供一种在Web上能够...
Self-modeling as a treatment for increasing on-task behavior SELF-MODELING AS A TREATMENT FOR INCREASING ON-TASK BEHAVIOR susan k. clare Clovis Unified School District, Clovis, California ...
它由天文学家和软件开发者Tim Bray发起,并以其强大的类型安全、API简洁和高性能而受到开发者的欢迎。XOM的核心设计理念是提供一个既简单又强大的接口来处理XML文档,使得XML解析、操作和序列化变得更加直观。 在...
Web 2.0人物访谈录。Wiley Web 2.0 Heroes Interviews ...19 Bob Brewin & Tim Bray: Sun Microsystems . . . . . . . . . . . . . . 229 20 Michele Turner:Adobe Systems Incorporated . . . . . . . . . . . . 243
Tim Bray在他的博客“Ongoing”中提及Adam Bosworth,认为他是全球顶尖20位软件人士之一。Bosworth因其在Quattro Pro、Microsoft Access和Internet Explorer 4等项目中的贡献而闻名,之后在BEA和Google担任重要职务...
- **Tim Bray**:Textuality 和 Netscape 的成员 (tbray@textuality.com) - **Jean Paoli**:Microsoft 的成员 (jeanpa@microsoft.com) - **C.M. Sperberg-McQueen**:伊利诺伊大学芝加哥分校的成员 (cmsmcq@uic.edu...
Bertha是用于高端存储子系统的I / O基准测试工具。 基于Tim Bray的Bonnie基准测试工具,并进行了三项增强:1)生成更多的I / O; 2)提供重播I / O事务的工具;以及3)广泛的指标报告。