- 浏览: 306388 次
文章分类
最新评论
-
shoru:
最新的netbeans scala插件无法安装到6.9中呀,求 ...
Scala-对Java的修正和超越-Presentation -
alanwu:
night_stalker 写道讲一次多少钱啊,能透露下么…… ...
Scala-对Java的修正和超越-Presentation -
wzjin:
暂时没有考虑学习,等有一定市场再看!
Scala-对Java的修正和超越-Presentation -
dcaoyuan:
jasspier 写道对scala有点兴趣,有没有入门的推荐教 ...
Scala-对Java的修正和超越-Presentation -
jasspier:
草原兄,听了你的课,对scala有点兴趣,有没有入门的推荐教程 ...
Scala-对Java的修正和超越-Presentation
Learning Coding Parallelization (Was Tim's Erlang Exercise - Round V)
- 博客分类:
- /Erlang
Updated Oct 16: After testing my code on different machines, I found that disk/io performed varyingly, for some very large files, reading file in parallel may cause longer elapsed time (typically on non-server machine, which is not equipped for fast disk/io). So, I added another version tbray4b.erl, in this version, only reading file is not parallalized, all other code is the same. If you'd like to have a test on your machine, please try both.
Well, I think I've learned a lot from doing Tim's exercise, not only the List vs Binary in Erlang, but also computing in parallel. Coding Concurrency is farely easy in Erlang, but coding Parallelization is not only about the Languages, it's also a real question.
I wrote tbray3.erl in The Erlang Way (Was Tim Bray's Erlang Exercise - Round IV) and got a fairly good result by far on my 2-core MacBook. But things always are a bit complex. As Steve pointed in the comment, when he tried tbray3.erl on his 8-core linux box:
"I ran it in a loop 10 times, and the best time I saw was 13.872 sec, and user/CPU time was only 16.150 sec, so it’s apparently not using the multiple cores very well."
I also encoutered this issue on my 4-CPU Intel Xeon CPU 2.80GHz debian box, it runs even worse (8.420s) than my 2-core MacBook (4.483s).
I thought about my code a while, and found that my code seems spawning too many processes for scan_chunk, as the scan_chunk's performance has been improved a lot, each process will finish its task very quickly, too quick to the file reading, the inceasing CPUs have no much chance to play the game, the cycled 'reading'-'spawning scan process' is actually almost sequential now, there has been very few simultaneously alive scanning processes. I think I finally meet the file reading bound.
But wait, as I claimed before, that reading file to memory is very fast in Erlang, for a 200M log file, it takes less than 800ms. The time elapsed for tbray3.erl is about 4900ms, far away from 800ms, why I say the file reading is the bound now?
The problem here is: since I suspect the performance of traversing binary byte by byte, I choose to convert binary to list to scan the world. Per my testing results, list is better than binary when is not too longer, in many cases, not longer than several KBytes. And, to make the code clear and readable, I also choose splitting big binary when read file in the meanwhile, so, I have to read file in pieces of no longer than n KBytes. For a very big file, the reading procedure is broken to several ten-thousands steps, which finally cause the whole file reading time elapsed is bit long. That's bad.
So, I decide to write another version, which will read file in parallel (Round III), and split each chunk on lastest new-line (Round II), scan the words using pattern match (Round IV), and yes, I'll use binary instead of list this time, try to solve the worse performance of binary-traverse by parallel, on multiple cores.
The result is interesting, it's the first time I achieved around 10 sec in my 2-core MacBook when use binary match only, and it's also the first time, on my dummy 4-CPU Intel Xeon CPU 2.80GHz debian box, I got better result than my MacBook.
(Updated Oct 15: Steve run the code on his 8-core 2.33 GHz Intel Xeon Linux box, with the best time was 4.920 sec, which was exactly 100% speedup to my 4-core box (Although, they are two different machines, we can not compare the results linearly) :
"the best time I saw for your newest version was 4.920 sec on my 8-core Linux box. Fast! However, user time was only 14.751 sec, so I’m not sure it’s using all the cores that well. Perhaps you’re getting down to where I/O is becoming a more significant factor."
Please see Steve's One More Erlang Wide Finder and his widefinder attempts.)
Result on 2.0GHz 2-core MacBook:
$ time erl -smp -noshell -run tbray4_bin start o1000k.ap 4 -s erlang halt 8900 : 2006/09/29/Dynamic-IDE 2000 : 2006/07/28/Open-Data 1300 : 2003/07/25/NotGaming 800 : 2003/10/16/Debbie 800 : 2003/09/18/NXML 800 : 2006/01/31/Data-Protection 700 : 2003/06/23/SamsPie 600 : 2006/09/11/Making-Markup 600 : 2003/02/04/Construction 600 : 2005/11/03/Cars-and-Office-Suites Time: 10375.53 ms real 0m10.788s user 0m11.216s sys 0m3.851s
Result on 4-CPU Intel Xeon CPU 2.80GHz debian box,:
# When process number is set to 20: $ time erl -smp -noshell -run tbray4_bin start o1000k.ap 20 -s erlang halt real 0m9.894s user 0m20.521s sys 0m1.668s # When process number is set to 1: $ time erl -smp -noshell -run tbray4_bin start o1000k.ap 1 -s erlang halt real 0m28.193s user 0m27.218s sys 0m0.984s # On a 940M 5 million lines log file: $ time erl -smp -noshell -run tbray4_bin start o5000k.ap 400 -s erlang halt 44500 : 2006/09/29/Dynamic-IDE 10000 : 2006/07/28/Open-Data 6500 : 2003/07/25/NotGaming 4000 : 2003/10/16/Debbie 4000 : 2003/09/18/NXML 4000 : 2006/01/31/Data-Protection 3500 : 2003/06/23/SamsPie 3000 : 2006/09/11/Making-Markup 3000 : 2003/02/04/Construction 3000 : 2005/11/03/Cars-and-Office-Suites Time: 66456.95 ms real 1m6.767s user 2m7.512s sys 0m8.489s
On the 4-CPU linux box, comparing the elapsed time between ProcNum = 20 and ProcNum = 1, the elapsed time of parallelized one was only 35% of un-parallelized one, speedup about 185%. The ratio was almost the same as my pread_file.erl testing on the same machine.
It's actually a combination of code in my four previous blogs. Although the performance is not so good as tbray3.erl on my MacBook, but I'm happy that this version is a fully parallelized one, from reading file, scanning words etc. it should scale better than all my previous versions.
The code: tbray4.erl
-module(tbray4). -compile([native]). -export([start/1, start/2]). -include_lib("kernel/include/file.hrl"). start([FileName, ProcNum]) when is_list(ProcNum) -> start(FileName, list_to_integer(ProcNum)). start(FileName, ProcNum) -> Start = now(), Main = self(), Counter = spawn(fun () -> count_loop(Main) end), Collector = spawn(fun () -> collect_loop(Counter) end), pread_file(FileName, ProcNum, Collector), %% don't terminate, wait here, until all tasks done. receive stop -> io:format("Time: ~10.2f ms~n", [timer:now_diff(now(), Start) / 1000]) end. pread_file(FileName, ProcNum, Collector) -> ChunkSize = get_chunk_size(FileName, ProcNum), pread_file_1(FileName, ChunkSize, ProcNum, Collector). pread_file_1(FileName, ChunkSize, ProcNum, Collector) -> [spawn(fun () -> Length = if I == ProcNum - 1 -> ChunkSize * 2; %% lastest chuck true -> ChunkSize end, {ok, File} = file:open(FileName, [read, binary]), {ok, Bin} = file:pread(File, ChunkSize * I, Length), {Data, Tail} = split_on_last_newline(Bin), Collector ! {seq, I, Data, Tail}, file:close(File) end) || I <- lists:seq(0, ProcNum - 1)], Collector ! {chunk_num, ProcNum}. collect_loop(Counter) -> collect_loop_1([], <<>>, -1, Counter). collect_loop_1(Chunks, PrevTail, LastSeq, Counter) -> receive {chunk_num, ChunkNum} -> Counter ! {chunk_num, ChunkNum}, collect_loop_1(Chunks, PrevTail, LastSeq, Counter); {seq, I, Data, Tail} -> SortedChunks = lists:keysort(1, [{I, Data, Tail} | Chunks]), {Chunks1, PrevTail1, LastSeq1} = process_chunks(SortedChunks, [], PrevTail, LastSeq, Counter), collect_loop_1(Chunks1, PrevTail1, LastSeq1, Counter) end. count_loop(Main) -> count_loop_1(Main, dict:new(), undefined, 0). count_loop_1(Main, Dict, ChunkNum, ChunkNum) -> print_result(Dict), Main ! stop; count_loop_1(Main, Dict, ChunkNum, ProcessedNum) -> receive {chunk_num, ChunkNumX} -> count_loop_1(Main, Dict, ChunkNumX, ProcessedNum); {dict, DictX} -> Dict1 = dict:merge(fun (_, V1, V2) -> V1 + V2 end, Dict, DictX), count_loop_1(Main, Dict1, ChunkNum, ProcessedNum + 1) end. process_chunks([], ChunkBuf, PrevTail, LastSeq, _) -> {ChunkBuf, PrevTail, LastSeq}; process_chunks([{I, Data, Tail}=Chunk|T], ChunkBuf, PrevTail, LastSeq, Counter) -> case LastSeq + 1 of I -> spawn(fun () -> Counter ! {dict, scan_chunk(<<PrevTail/binary, Data/binary>>)} end), process_chunks(T, ChunkBuf, Tail, I, Counter); _ -> process_chunks(T, [Chunk | ChunkBuf], PrevTail, LastSeq, Counter) end. print_result(Dict) -> SortedList = lists:reverse(lists:keysort(2, dict:to_list(Dict))), [io:format("~b\t: ~s~n", [V, K]) || {K, V} <- lists:sublist(SortedList, 10)]. get_chunk_size(FileName, ProcNum) -> {ok, #file_info{size=Size}} = file:read_file_info(FileName), Size div ProcNum. split_on_last_newline(Bin) -> split_on_last_newline_1(Bin, size(Bin)). split_on_last_newline_1(Bin, Offset) when Offset > 0 -> case Bin of <<Data:Offset/binary,$\n,Tail/binary>> -> {Data, Tail}; _ -> split_on_last_newline_1(Bin, Offset - 1) end; split_on_last_newline_1(Bin, _) -> {Bin, <<>>}. scan_chunk(Bin) -> scan_chunk_1(Bin, 0, dict:new()). scan_chunk_1(Bin, Offset, Dict) when Offset =< size(Bin) - 34 -> case Bin of <<_:Offset/binary,"GET /ongoing/When/",_,_,_,$x,$/,Y1,Y2,Y3,Y4,$/,M1,M2,$/,D1,D2,$/,Rest/binary>> -> case match_until_space_newline(Rest, 0) of {Rest1, <<>>} -> scan_chunk_1(Rest1, 0, Dict); {Rest1, Word} -> Key = <<Y1,Y2,Y3,Y4,$/,M1,M2,$/,D1,D2,$/, Word/binary>>, scan_chunk_1(Rest1, 0, dict:update_counter(Key, 1, Dict)) end; _ -> scan_chunk_1(Bin, Offset + 1, Dict) end; scan_chunk_1(_, _, Dict) -> Dict. match_until_space_newline(Bin, Offset) when Offset < size(Bin) -> case Bin of <<Word:Offset/binary,$ ,Rest/binary>> -> {Rest, Word}; <<_:Offset/binary,$.,Rest/binary>> -> {Rest, <<>>}; <<_:Offset/binary,10,Rest/binary>> -> {Rest, <<>>}; _ -> match_until_space_newline(Bin, Offset + 1) end; match_until_space_newline(_, _) -> {<<>>, <<>>}.
=====> Updated Oct 16:After testing my code on different machines, I found that disk/io performed varyingly, for some very large files, reading file in parallel may cause longer elapsed time (typically on non-server machine, which is not equipped for fast disk/io). So, I wrote another version: tbray4b.erl, in this version, only reading file is not parallalized, all other code is the same. Here's a result for this version on a 940M file with 5 million lines, with ProcNum set to 200 and 400)
# On 2-core MacBook: $ time erl -smp -noshell -run tbray4b start o5000k.ap 200 -s erlang halt real 0m50.498s user 0m49.746s sys 0m11.979s # On 4-cpu linux box: $ time erl -smp -noshell -run tbray4b start o5000k.ap 400 -s erlang halt real 1m2.136s user 1m59.907s sys 0m7.960s
The code: tbray4b.erl
-module(tbray4b). -compile([native]). -export([start/1, start/2]). -include_lib("kernel/include/file.hrl"). start([FileName, ProcNum]) when is_list(ProcNum) -> start(FileName, list_to_integer(ProcNum)). start(FileName, ProcNum) -> Start = now(), Main = self(), Counter = spawn(fun () -> count_loop(Main) end), Collector = spawn(fun () -> collect_loop(Counter) end), read_file(FileName, ProcNum, Collector), %% don't terminate, wait here, until all tasks done. receive stop -> io:format("Time: ~10.2f ms~n", [timer:now_diff(now(), Start) / 1000]) end. read_file(FileName, ProcNum, Collector) -> ChunkSize = get_chunk_size(FileName, ProcNum), {ok, File} = file:open(FileName, [raw, binary]), read_file_1(File, ChunkSize, 0, Collector). read_file_1(File, ChunkSize, I, Collector) -> case file:read(File, ChunkSize) of eof -> file:close(File), Collector ! {chunk_num, I}; {ok, Bin} -> spawn(fun () -> {Data, Tail} = split_on_last_newline(Bin), Collector ! {seq, I, Data, Tail} end), read_file_1(File, ChunkSize, I + 1, Collector) end. collect_loop(Counter) -> collect_loop_1([], <<>>, -1, Counter). collect_loop_1(Chunks, PrevTail, LastSeq, Counter) -> receive {chunk_num, ChunkNum} -> Counter ! {chunk_num, ChunkNum}, collect_loop_1(Chunks, PrevTail, LastSeq, Counter); {seq, I, Data, Tail} -> SortedChunks = lists:keysort(1, [{I, Data, Tail} | Chunks]), {Chunks1, PrevTail1, LastSeq1} = process_chunks(SortedChunks, [], PrevTail, LastSeq, Counter), collect_loop_1(Chunks1, PrevTail1, LastSeq1, Counter) end. count_loop(Main) -> count_loop_1(Main, dict:new(), undefined, 0). count_loop_1(Main, Dict, ChunkNum, ChunkNum) -> print_result(Dict), Main ! stop; count_loop_1(Main, Dict, ChunkNum, ProcessedNum) -> receive {chunk_num, ChunkNumX} -> count_loop_1(Main, Dict, ChunkNumX, ProcessedNum); {dict, DictX} -> Dict1 = dict:merge(fun (_, V1, V2) -> V1 + V2 end, Dict, DictX), count_loop_1(Main, Dict1, ChunkNum, ProcessedNum + 1) end. process_chunks([], ChunkBuf, PrevTail, LastSeq, _) -> {ChunkBuf, PrevTail, LastSeq}; process_chunks([{I, Data, Tail}=Chunk|T], ChunkBuf, PrevTail, LastSeq, Counter) -> case LastSeq + 1 of I -> spawn(fun () -> Counter ! {dict, scan_chunk(<<PrevTail/binary, Data/binary>>)} end), process_chunks(T, ChunkBuf, Tail, I, Counter); _ -> process_chunks(T, [Chunk | ChunkBuf], PrevTail, LastSeq, Counter) end. print_result(Dict) -> SortedList = lists:reverse(lists:keysort(2, dict:to_list(Dict))), [io:format("~b\t: ~s~n", [V, K]) || {K, V} <- lists:sublist(SortedList, 10)]. get_chunk_size(FileName, ProcNum) -> {ok, #file_info{size=Size}} = file:read_file_info(FileName), Size div ProcNum. split_on_last_newline(Bin) -> split_on_last_newline_1(Bin, size(Bin)). split_on_last_newline_1(Bin, Offset) when Offset > 0 -> case Bin of <<Data:Offset/binary,$\n,Tail/binary>> -> {Data, Tail}; _ -> split_on_last_newline_1(Bin, Offset - 1) end; split_on_last_newline_1(Bin, _) -> {Bin, <<>>}. scan_chunk(Bin) -> scan_chunk_1(Bin, 0, dict:new()). scan_chunk_1(Bin, Offset, Dict) when Offset =< size(Bin) - 34 -> case Bin of <<_:Offset/binary,"GET /ongoing/When/",_,_,_,$x,$/,Y1,Y2,Y3,Y4,$/,M1,M2,$/,D1,D2,$/,Rest/binary>> -> case match_until_space_newline(Rest, 0) of {Rest1, <<>>} -> scan_chunk_1(Rest1, 0, Dict); {Rest1, Word} -> Key = <<Y1,Y2,Y3,Y4,$/,M1,M2,$/,D1,D2,$/, Word/binary>>, scan_chunk_1(Rest1, 0, dict:update_counter(Key, 1, Dict)) end; _ -> scan_chunk_1(Bin, Offset + 1, Dict) end; scan_chunk_1(_, _, Dict) -> Dict. match_until_space_newline(Bin, Offset) when Offset < size(Bin) -> case Bin of <<Word:Offset/binary,$ ,Rest/binary>> -> {Rest, Word}; <<_:Offset/binary,$.,Rest/binary>> -> {Rest, <<>>}; <<_:Offset/binary,10,Rest/binary>> -> {Rest, <<>>}; _ -> match_until_space_newline(Bin, Offset + 1) end; match_until_space_newline(_, _) -> {<<>>, <<>>}.
=======
发表评论
-
Updated "From Rails to Erlyweb" Part I and Part II
2007-07-30 00:13 863I've updated "From Rails t ... -
From Rails to Erlyweb - Part IV
2007-07-30 00:14 912IV. Support Mysql Spatial Exten ... -
HTML Entity Refs and xmerl
2007-07-30 00:14 965According to [erlang-bugs] xme ... -
From Rails to Erlyweb - Part III
2007-07-30 00:14 8773. The Magic Behind Erlyweb Wit ... -
From Rails to Erlyweb - Part I
2007-08-05 06:34 1232Updated July 20 2007: new param ... -
ErlyBird 0.12.0 released - Erlang IDE based on NetBeans
2007-08-08 18:40 917I'm pleased to announce ErlyBir ... -
A Simple POET State Machine Accepting SAX Events to Build Plain Old Erlang Term
2007-08-20 07:19 995Per previous blogs: A Simple ... -
A Simple XML State Machine Accepting SAX Events to Build xmerl Compitable XML Tree: icalendar demo
2007-08-20 07:23 1136xmerl is a full XML functionali ... -
recbird - An Erlang Dynamic Record Inferring Parse Transform
2007-08-23 08:31 1057You should have read Yariv's re ... -
From Rails to Erlyweb - Part II
2007-08-23 21:16 1268Updated Aug 23: Please see Fro ... -
From Rails to Erlyweb - Part II Manage Project - Reloaded
2007-08-24 19:20 1348The migrating from Rails to Erl ... -
Parse JSON to xmerl Compitable XML Tree via A Simple XML State Machine
2007-09-02 20:00 1534Updated Aug 16: Fix bugs when j ... -
Tim Bray's Erlang Exercise on Large Dataset Processing
2007-10-10 20:12 1244Updated Oct 10: pread_file/5 s ... -
Tim Bray's Erlang Exercise on Large Dataset Processing - Round II
2007-10-15 15:37 1621Updated Oct 09: Added more ben ... -
Reading File in Parallel in Erlang (Was Tim Bray's Erlang Exercise - Round III)
2007-10-15 19:56 1915My first solution for Tim's ex ... -
The Erlang Way (Was Tim Bray's Erlang Exercise - Round IV)
2007-10-17 07:45 1795Playing with Tim's Erlang Exerc ...
相关推荐
网络-coding-v3.0.zip网络-coding-v3.0.zip网络-coding-v3.0.zip网络-coding-v3.0.zip网络-coding-v3.0.zip网络-coding-v3.0.zip网络-coding-v3.0.zip网络-coding-v3.0.zip
开源很好的入门资料,带你进入开源的世界
#_--_coding_UTF-8_--_import_sys,os,dlib,glob,nu_PythonFace
模型学习(model-based learning)是指在强化学习中建立一个模型来模拟环境的动态特性。这样的模型可以预测智能体执行某个动作之后环境的状态转移和奖励。结合模型学习与Tile Coding编码的Actor-Critic方法能够同时...
《AISG天线端口颜色编码v3.1》标准详细解读 AISG(Antenna Interface Standards Group)是一个致力于制定天线接口标准的组织,其发布的《AISG天线端口颜色编码v3.1》是通信行业中一个重要的规范。此标准主要规定了...
在这个“coding-and-decodin-of-convolutional.rar”压缩包中,包含的是利用MATLAB实现的卷积码编译码程序,特别是采用了维特比算法进行译码。 维特比算法是解卷积码的最常用方法,由Arthur Viterbi在1967年提出。...
已经打包好的,可直接导入burp使用
lambda编码回合评估器 该项目的目标是实现一个代码评估器,组织可以使用该评估器来自动进行轮回访谈的编码。 这是受到Coursera的Scala Progfun代码评估器的... lambda-coding-round-evaluator用于自动执行编码回合提
13818-1-GENERIC CODING OF MOVING PICTURES AND ASSOCIATED AUDIO-SYSTEMS Mpeg标准,这是第一部分,另外两部分见我的下载资源
构建和安装 ./gradlew build installDist跑步 ./build/install/coding-exercise-java-robot/bin/coding-exercise-java-robot a.txt b.txt c.txtSimulating file: a.txtVector<0>Simulating file: b.txtVector<0>...
100_Days_of_Artificial_Intelligence_Coding。人工智能100_100-Days-Of-AI-Code
A visual step-by-step guide to writing code in Python. Beginners and experienced ...With coding theory interwoven into the instructions for building each game, learning coding is made effortless and fun.
取回奖励带回家测试-Mobile-Engineer-实习生 Java中的本机Android应用程序,可从检索数据 要求 显示按“ listId”分组的所有项目\ 显示时,先按“ listId”对结果进行排序,然后按“名称”对结果进行排序\ ...
《Zen.Coding-Dreamweaver.v0.7》是一款针对Dreamweaver的高效编码插件,旨在提升开发者的代码编写效率。Zen Coding,现更名为Emmet,是前端开发者中的一个广受欢迎的工具,它通过简短的缩写来快速生成复杂的HTML和...
burpsuite分块传输插件,一键生成分块传输请求,用于bypass waf等
《Everything You Need to Ace Computer Science and Coding》是一本全面、系统的学习指南,涵盖计算机科学和编程的所有方面,为中学生提供了详细的学习资源。 计算机系统 计算机系统是计算机科学的基础,包括硬件...
《sei-cert-cpp-coding-standard-2016-v01_c_安全编程_sei-cert-c-coding_S》这份文档是SEI CERT C++编码标准的2016年版本,专注于C和C++编程语言的安全实践。SEI CERT(Software Engineering Institute's Computer ...