Parsing CSV in erlang(转)

全部 Linux 数据库敏捷编程数据结构软件测试项目管理 Oracle 编程综合互联网 Erlang MySQL

浏览 3705 次

锁定老帖子主题：Parsing CSV in erlang(转) 精华帖 (0) :: 良好帖 (0) :: 新手帖 (0) :: 隐藏帖 (0)
作者	正文
mryufeng 等级: 性别: 文章: 526 积分: 1780 来自: 广州	发表时间：2008-03-05 最后修改：2009-03-03 相关推荐: [Erlang] XML处理方案 [Erlang 0072] Erlang XML处理解决方案 php类库 [转]高负载并发网站架构分析小测试：两种构造字符串方式的性能对比更多相关推荐 Erlang 引用地址（需要爬墙） http://ppolv.wordpress.com/2008/02/25/parsing-csv-in-erlang/ So I need to parse a CSV file in erlang. Although files in CSV have a very simple structure, simply calling lists:tokens(Line,",") for each line in the file won't do the trick, as there could be quoted fields that spans more than one line and contains commas or escaped quotes. A detailed discussion of string parsing in erlang can be found at the excellent Parsing text and binary files with Erlang article by Joel Reymont. And the very first example is parsing a CSV file!; but being the first example, it was written with simplicity rather than completeness in mind, so it didn't take quoted/multi-line fields into account. Now, we will write a simple parser for RFC-4180 documents ( witch is way cooler than parse plain old CSV files ;-) ) . As the format is really simple, we won't use yecc nor leex, but parse the input file by hand using binaries,lists and lots of pattern matching. Our goals are * Recognize fields delimited by commas, records delimited by line breaks * Recognize quoted fields * Being able to parse quotes, commas and line breaks inside quoted fields * Ensure that all records had the same number of fields * Provide a fold-like callback interface, in addition to a return-all-records-in-file function What the parser won't do: * Unicode. We will treat the file as binary and consider each character as ASCII, 1 byte wide. To parse unicode files, you can use xmerl_ucs:from_utf8/1, and then process the resulting list instead of the raw binary A quick lock suggest that the parser will pass through the following states: cvs parsing states * Field start at the begin of each field. The whitespaces should be consider for unquoted fields, but any whitespace before a quoted field is discarded * Normal an unquoted field * Quoted inside a quoted field * Post Quoted after a quoted field. Whitespaces could appear between a quoted field and the next field/record, and should be discarded Parsing state While parsing, we will use the following record to keep track of the current state -record(ecsv,{ state = field_start, %%field_start\|normal\|quoted\|post_quoted cols = undefined, %%how many fields per record current_field = [], current_record = [], fold_state, fold_fun %%user supplied fold function }). API functions parse_file(FileName,InitialState,Fun) -> {ok, Binary} = file:read_file(FileName), parse(Binary,InitialState,Fun). parse_file(FileName) -> {ok, Binary} = file:read_file(FileName), parse(Binary). parse(X) -> R = parse(X,[],fun(Fold,Record) -> [Record\|Fold] end), lists:reverse(R). parse(X,InitialState,Fun) -> do_parse(X,#ecsv{fold_state=InitialState,fold_fun = Fun}). The tree arguments functions provide the fold-like interface, while the single argument one returns a list with all the records in the file. Parsing Now the fun part!. The transitions (State X Input -> NewState ) are almost 1:1 derived from the diagram, with minor changes (like the handling of field and record delimiters, common to both the normal and post_quoted state). Inside a quoted field, a double quote must be escaped by preceding it with another double quote. Its really easy to distinguish this case by matching against <<$",$",_/binary>> sort of "lookahead" in yacc's lexicon. %% --------- Field_start state --------------------- %%whitespace, loop in field_start state do_parse(<<32,Rest/binary>>,S = #ecsv{state=field_start,current_field=Field})-> do_parse(Rest,S#ecsv{current_field=[32\|Field]}); %%its a quoted field, discard previous whitespaces do_parse(<<$",Rest/binary>>,S = #ecsv{state=field_start})-> do_parse(Rest,S#ecsv{state=quoted,current_field=[]}); %%anything else, is a unquoted field do_parse(Bin,S = #ecsv{state=field_start})-> do_parse(Bin,S#ecsv{state=normal}); %% --------- Quoted state --------------------- %%Escaped quote inside a quoted field do_parse(<<$",$",Rest/binary>>,S = #ecsv{state=quoted,current_field=Field})-> do_parse(Rest,S#ecsv{current_field=[$"\|Field]}); %%End of quoted field do_parse(<<$",Rest/binary>>,S = #ecsv{state=quoted})-> do_parse(Rest,S#ecsv{state=post_quoted}); %%Anything else inside a quoted field do_parse(<<X,Rest/binary>>,S = #ecsv{state=quoted,current_field=Field})-> do_parse(Rest,S#ecsv{current_field=[X\|Field]}); do_parse(<<>>, #ecsv{state=quoted})-> throw({ecsv_exception,unclosed_quote}); %% --------- Post_quoted state --------------------- %%consume whitespaces after a quoted field do_parse(<<32,Rest/binary>>,S = #ecsv{state=post_quoted})-> do_parse(Rest,S); %%---------Comma and New line handling. ------------------ %%---------Common code for post_quoted and normal state--- %%EOF in a new line, return the records do_parse(<<>>, #ecsv{current_record=[],fold_state=State})-> State; %%EOF in the last line, add the last record and continue do_parse(<<>>,S)-> do_parse([],new_record(S)); %% skip carriage return (windows files uses CRLF) do_parse(<<$\r,Rest/binary>>,S = #ecsv{})-> do_parse(Rest,S); %% new record do_parse(<<$\n,Rest/binary>>,S = #ecsv{}) -> do_parse(Rest,new_record(S)); do_parse(<<$, ,Rest/binary>>,S = #ecsv{current_field=Field,current_record=Record})-> do_parse(Rest,S#ecsv{state=field_start, current_field=[], current_record=[lists:reverse(Field)\|Record]}); %%A double quote in any other place than the already managed is an error do_parse(<<$",_Rest/binary>>, #ecsv{})-> throw({ecsv_exception,bad_record}); %%Anything other than whitespace or line ends in post_quoted state is an error do_parse(<<_X,_Rest/binary>>, #ecsv{state=post_quoted})-> throw({ecsv_exception,bad_record}); %%Accumulate Field value do_parse(<<X,Rest/binary>>,S = #ecsv{state=normal,current_field=Field})-> do_parse(Rest,S#ecsv{current_field=[X\|Field]}). Record assembly and callback Convert each record to a tuple, and check that it has the same number of fields than the previous records. Invoke the callback function with the new record and the previous state. %%check the record size against the previous, and actualize state. new_record(S=#ecsv{cols=Cols,current_field=Field,current_record=Record,fold_state=State,fold_fun=Fun}) -> NewRecord = list_to_tuple(lists:reverse([lists:reverse(Field)\|Record])), if (tuple_size(NewRecord) =:= Cols) or (Cols =:= undefined) -> NewState = Fun(State,NewRecord), S#ecsv{state=field_start,cols=tuple_size(NewRecord), current_record=[],current_field=[],fold_state=NewState}; (tuple_size(NewRecord) =/= Cols) -> throw({ecsv_exception,bad_record_size}) end. Final notes We used a single function, do_parse/2, with many clauses to do the parsing. In a more complex scenario, you probably will use different functions for different sections of the grammar you are parsing. Also you could first tokenize the input and then parse the resulting token stream, this could make your work simpler even if your aren't using a parser generator like yecc (this is the approach i'm using to parse ldap filters). 描述: cvs parsing states 大小: 29.9 KB 查看图片附件声明：ITeye文章版权属于作者，受法律保护。没有作者书面许可不得转载。推荐链接
返回顶楼

mryufeng 等级: 性别: 文章: 526 积分: 1780 来自: 广州	发表时间：2009-03-03 非常独特的方式来处理分析写状态机也可以这么容易
返回顶楼	回帖地址 0 0 请登录后投票

sw2wolf 等级: 初级会员性别: 文章: 31 积分: 60 来自: 外星	发表时间：2009-03-04 erlang代码能否语法加亮, 那样眼睛舒服点?
返回顶楼	回帖地址 0 0 请登录后投票

Trustno1 等级: 性别: 文章: 3260 积分: 3186 来自: 吉林->北京->上海	发表时间：2009-03-04 mryufeng 写道非常独特的方式来处理分析写状态机也可以这么容易这是FP里非常经典的解析模式啦.lists:token这种方式其实是在模拟命令式语言的堆栈，堆栈其实完全等价转换为continuation.而continuation在FP里是最自然不过的事情咧.
返回顶楼	回帖地址 0 0 请登录后投票

mryufeng 等级: 性别: 文章: 526 积分: 1780 来自: 广州	发表时间：2009-03-04 这种模型的开发效率确实很高状态图画出来代码也差不多了
返回顶楼	回帖地址 0 0 请登录后投票

论坛首页 → 综合技术版

跳转论坛: