- 浏览: 982006 次
- 性别:
- 来自: 广州
最新评论
-
qingchuwudi:
有用,非常感谢!
erlang进程的优先级 -
zfjdiamond:
你好 这条命令 在那里输入??
你们有yum 我有LuaRocks -
simsunny22:
这个是在linux下运行的吧,在window下怎么运行escr ...
escript的高级特性 -
mozhenghua:
http://www.erlang.org/doc/apps/ ...
mnesia 分布协调的几个细节 -
fxltsbl:
A new record of 108000 HTTP req ...
Haproxy 1.4-dev2: barrier of 100k HTTP req/s crossed
引用地址(需要爬墙) http://ppolv.wordpress.com/2008/02/25/parsing-csv-in-erlang/
So I need to parse a CSV file in erlang.
Although files in CSV have a very simple structure, simply calling lists:tokens(Line,",") for each line in the file won't do the trick, as there could be quoted fields that spans more than one line and contains commas or escaped quotes.
A detailed discussion of string parsing in erlang can be found at the excellent Parsing text and binary files with Erlang article by Joel Reymont. And the very first example is parsing a CSV file!; but being the first example, it was written with simplicity rather than completeness in mind, so it didn't take quoted/multi-line fields into account.
Now, we will write a simple parser for RFC-4180 documents ( witch is way cooler than parse plain old CSV files ;-) ) . As the format is really simple, we won't use yecc nor leex, but parse the input file by hand using binaries,lists and lots of pattern matching.
Our goals are
* Recognize fields delimited by commas, records delimited by line breaks
* Recognize quoted fields
* Being able to parse quotes, commas and line breaks inside quoted fields
* Ensure that all records had the same number of fields
* Provide a fold-like callback interface, in addition to a return-all-records-in-file function
What the parser won't do:
* Unicode. We will treat the file as binary and consider each character as ASCII, 1 byte wide. To parse unicode files, you can use xmerl_ucs:from_utf8/1, and then process the resulting list instead of the raw binary
A quick lock suggest that the parser will pass through the following states:
cvs parsing states
* Field start
at the begin of each field. The whitespaces should be consider for unquoted fields, but any whitespace before a quoted field is discarded
* Normal
an unquoted field
* Quoted
inside a quoted field
* Post Quoted
after a quoted field. Whitespaces could appear between a quoted field and the next field/record, and should be discarded
Parsing state
While parsing, we will use the following record to keep track of the current state
-record(ecsv,{
state = field_start, %%field_start|normal|quoted|post_quoted
cols = undefined, %%how many fields per record
current_field = [],
current_record = [],
fold_state,
fold_fun %%user supplied fold function
}).
API functions
parse_file(FileName,InitialState,Fun) ->
{ok, Binary} = file:read_file(FileName),
parse(Binary,InitialState,Fun).
parse_file(FileName) ->
{ok, Binary} = file:read_file(FileName),
parse(Binary).
parse(X) ->
R = parse(X,[],fun(Fold,Record) -> [Record|Fold] end),
lists:reverse(R).
parse(X,InitialState,Fun) ->
do_parse(X,#ecsv{fold_state=InitialState,fold_fun = Fun}).
The tree arguments functions provide the fold-like interface, while the single argument one returns a list with all the records in the file.
Parsing
Now the fun part!.
The transitions (State X Input -> NewState ) are almost 1:1 derived from the diagram, with minor changes (like the handling of field and record delimiters, common to both the normal and post_quoted state).
Inside a quoted field, a double quote must be escaped by preceding it with another double quote. Its really easy to distinguish this case by matching against
<<$",$",_/binary>>
sort of "lookahead" in yacc's lexicon.
%% --------- Field_start state ---------------------
%%whitespace, loop in field_start state
do_parse(<<32,Rest/binary>>,S = #ecsv{state=field_start,current_field=Field})->
do_parse(Rest,S#ecsv{current_field=[32|Field]});
%%its a quoted field, discard previous whitespaces
do_parse(<<$",Rest/binary>>,S = #ecsv{state=field_start})->
do_parse(Rest,S#ecsv{state=quoted,current_field=[]});
%%anything else, is a unquoted field
do_parse(Bin,S = #ecsv{state=field_start})->
do_parse(Bin,S#ecsv{state=normal});
%% --------- Quoted state ---------------------
%%Escaped quote inside a quoted field
do_parse(<<$",$",Rest/binary>>,S = #ecsv{state=quoted,current_field=Field})->
do_parse(Rest,S#ecsv{current_field=[$"|Field]});
%%End of quoted field
do_parse(<<$",Rest/binary>>,S = #ecsv{state=quoted})->
do_parse(Rest,S#ecsv{state=post_quoted});
%%Anything else inside a quoted field
do_parse(<<X,Rest/binary>>,S = #ecsv{state=quoted,current_field=Field})->
do_parse(Rest,S#ecsv{current_field=[X|Field]});
do_parse(<<>>, #ecsv{state=quoted})->
throw({ecsv_exception,unclosed_quote});
%% --------- Post_quoted state ---------------------
%%consume whitespaces after a quoted field
do_parse(<<32,Rest/binary>>,S = #ecsv{state=post_quoted})->
do_parse(Rest,S);
%%---------Comma and New line handling. ------------------
%%---------Common code for post_quoted and normal state---
%%EOF in a new line, return the records
do_parse(<<>>, #ecsv{current_record=[],fold_state=State})->
State;
%%EOF in the last line, add the last record and continue
do_parse(<<>>,S)->
do_parse([],new_record(S));
%% skip carriage return (windows files uses CRLF)
do_parse(<<$\r,Rest/binary>>,S = #ecsv{})->
do_parse(Rest,S);
%% new record
do_parse(<<$\n,Rest/binary>>,S = #ecsv{}) ->
do_parse(Rest,new_record(S));
do_parse(<<$, ,Rest/binary>>,S = #ecsv{current_field=Field,current_record=Record})->
do_parse(Rest,S#ecsv{state=field_start,
current_field=[],
current_record=[lists:reverse(Field)|Record]});
%%A double quote in any other place than the already managed is an error
do_parse(<<$",_Rest/binary>>, #ecsv{})->
throw({ecsv_exception,bad_record});
%%Anything other than whitespace or line ends in post_quoted state is an error
do_parse(<<_X,_Rest/binary>>, #ecsv{state=post_quoted})->
throw({ecsv_exception,bad_record});
%%Accumulate Field value
do_parse(<<X,Rest/binary>>,S = #ecsv{state=normal,current_field=Field})->
do_parse(Rest,S#ecsv{current_field=[X|Field]}).
Record assembly and callback
Convert each record to a tuple, and check that it has the same number of fields than the previous records. Invoke the callback function with the new record and the previous state.
%%check the record size against the previous, and actualize state.
new_record(S=#ecsv{cols=Cols,current_field=Field,current_record=Record,fold_state=State,fold_fun=Fun}) ->
NewRecord = list_to_tuple(lists:reverse([lists:reverse(Field)|Record])),
if
(tuple_size(NewRecord) =:= Cols) or (Cols =:= undefined) ->
NewState = Fun(State,NewRecord),
S#ecsv{state=field_start,cols=tuple_size(NewRecord),
current_record=[],current_field=[],fold_state=NewState};
(tuple_size(NewRecord) =/= Cols) ->
throw({ecsv_exception,bad_record_size})
end.
Final notes
We used a single function, do_parse/2, with many clauses to do the parsing. In a more complex scenario, you probably will use different functions for different sections of the grammar you are parsing. Also you could first tokenize the input and then parse the resulting token stream, this could make your work simpler even if your aren't using a parser generator like yecc (this is the approach i'm using to parse ldap filters).
这是FP里非常经典的解析模式啦.lists:token这种方式其实是在模拟命令式语言的堆栈,堆栈其实完全等价转换为continuation.而continuation在FP里是最自然不过的事情咧.
So I need to parse a CSV file in erlang.
Although files in CSV have a very simple structure, simply calling lists:tokens(Line,",") for each line in the file won't do the trick, as there could be quoted fields that spans more than one line and contains commas or escaped quotes.
A detailed discussion of string parsing in erlang can be found at the excellent Parsing text and binary files with Erlang article by Joel Reymont. And the very first example is parsing a CSV file!; but being the first example, it was written with simplicity rather than completeness in mind, so it didn't take quoted/multi-line fields into account.
Now, we will write a simple parser for RFC-4180 documents ( witch is way cooler than parse plain old CSV files ;-) ) . As the format is really simple, we won't use yecc nor leex, but parse the input file by hand using binaries,lists and lots of pattern matching.
Our goals are
* Recognize fields delimited by commas, records delimited by line breaks
* Recognize quoted fields
* Being able to parse quotes, commas and line breaks inside quoted fields
* Ensure that all records had the same number of fields
* Provide a fold-like callback interface, in addition to a return-all-records-in-file function
What the parser won't do:
* Unicode. We will treat the file as binary and consider each character as ASCII, 1 byte wide. To parse unicode files, you can use xmerl_ucs:from_utf8/1, and then process the resulting list instead of the raw binary
A quick lock suggest that the parser will pass through the following states:
cvs parsing states
* Field start
at the begin of each field. The whitespaces should be consider for unquoted fields, but any whitespace before a quoted field is discarded
* Normal
an unquoted field
* Quoted
inside a quoted field
* Post Quoted
after a quoted field. Whitespaces could appear between a quoted field and the next field/record, and should be discarded
Parsing state
While parsing, we will use the following record to keep track of the current state
-record(ecsv,{
state = field_start, %%field_start|normal|quoted|post_quoted
cols = undefined, %%how many fields per record
current_field = [],
current_record = [],
fold_state,
fold_fun %%user supplied fold function
}).
API functions
parse_file(FileName,InitialState,Fun) ->
{ok, Binary} = file:read_file(FileName),
parse(Binary,InitialState,Fun).
parse_file(FileName) ->
{ok, Binary} = file:read_file(FileName),
parse(Binary).
parse(X) ->
R = parse(X,[],fun(Fold,Record) -> [Record|Fold] end),
lists:reverse(R).
parse(X,InitialState,Fun) ->
do_parse(X,#ecsv{fold_state=InitialState,fold_fun = Fun}).
The tree arguments functions provide the fold-like interface, while the single argument one returns a list with all the records in the file.
Parsing
Now the fun part!.
The transitions (State X Input -> NewState ) are almost 1:1 derived from the diagram, with minor changes (like the handling of field and record delimiters, common to both the normal and post_quoted state).
Inside a quoted field, a double quote must be escaped by preceding it with another double quote. Its really easy to distinguish this case by matching against
<<$",$",_/binary>>
sort of "lookahead" in yacc's lexicon.
%% --------- Field_start state ---------------------
%%whitespace, loop in field_start state
do_parse(<<32,Rest/binary>>,S = #ecsv{state=field_start,current_field=Field})->
do_parse(Rest,S#ecsv{current_field=[32|Field]});
%%its a quoted field, discard previous whitespaces
do_parse(<<$",Rest/binary>>,S = #ecsv{state=field_start})->
do_parse(Rest,S#ecsv{state=quoted,current_field=[]});
%%anything else, is a unquoted field
do_parse(Bin,S = #ecsv{state=field_start})->
do_parse(Bin,S#ecsv{state=normal});
%% --------- Quoted state ---------------------
%%Escaped quote inside a quoted field
do_parse(<<$",$",Rest/binary>>,S = #ecsv{state=quoted,current_field=Field})->
do_parse(Rest,S#ecsv{current_field=[$"|Field]});
%%End of quoted field
do_parse(<<$",Rest/binary>>,S = #ecsv{state=quoted})->
do_parse(Rest,S#ecsv{state=post_quoted});
%%Anything else inside a quoted field
do_parse(<<X,Rest/binary>>,S = #ecsv{state=quoted,current_field=Field})->
do_parse(Rest,S#ecsv{current_field=[X|Field]});
do_parse(<<>>, #ecsv{state=quoted})->
throw({ecsv_exception,unclosed_quote});
%% --------- Post_quoted state ---------------------
%%consume whitespaces after a quoted field
do_parse(<<32,Rest/binary>>,S = #ecsv{state=post_quoted})->
do_parse(Rest,S);
%%---------Comma and New line handling. ------------------
%%---------Common code for post_quoted and normal state---
%%EOF in a new line, return the records
do_parse(<<>>, #ecsv{current_record=[],fold_state=State})->
State;
%%EOF in the last line, add the last record and continue
do_parse(<<>>,S)->
do_parse([],new_record(S));
%% skip carriage return (windows files uses CRLF)
do_parse(<<$\r,Rest/binary>>,S = #ecsv{})->
do_parse(Rest,S);
%% new record
do_parse(<<$\n,Rest/binary>>,S = #ecsv{}) ->
do_parse(Rest,new_record(S));
do_parse(<<$, ,Rest/binary>>,S = #ecsv{current_field=Field,current_record=Record})->
do_parse(Rest,S#ecsv{state=field_start,
current_field=[],
current_record=[lists:reverse(Field)|Record]});
%%A double quote in any other place than the already managed is an error
do_parse(<<$",_Rest/binary>>, #ecsv{})->
throw({ecsv_exception,bad_record});
%%Anything other than whitespace or line ends in post_quoted state is an error
do_parse(<<_X,_Rest/binary>>, #ecsv{state=post_quoted})->
throw({ecsv_exception,bad_record});
%%Accumulate Field value
do_parse(<<X,Rest/binary>>,S = #ecsv{state=normal,current_field=Field})->
do_parse(Rest,S#ecsv{current_field=[X|Field]}).
Record assembly and callback
Convert each record to a tuple, and check that it has the same number of fields than the previous records. Invoke the callback function with the new record and the previous state.
%%check the record size against the previous, and actualize state.
new_record(S=#ecsv{cols=Cols,current_field=Field,current_record=Record,fold_state=State,fold_fun=Fun}) ->
NewRecord = list_to_tuple(lists:reverse([lists:reverse(Field)|Record])),
if
(tuple_size(NewRecord) =:= Cols) or (Cols =:= undefined) ->
NewState = Fun(State,NewRecord),
S#ecsv{state=field_start,cols=tuple_size(NewRecord),
current_record=[],current_field=[],fold_state=NewState};
(tuple_size(NewRecord) =/= Cols) ->
throw({ecsv_exception,bad_record_size})
end.
Final notes
We used a single function, do_parse/2, with many clauses to do the parsing. In a more complex scenario, you probably will use different functions for different sections of the grammar you are parsing. Also you could first tokenize the input and then parse the resulting token stream, this could make your work simpler even if your aren't using a parser generator like yecc (this is the approach i'm using to parse ldap filters).
评论
4 楼
mryufeng
2009-03-04
这种模型的开发效率确实很高 状态图画出来 代码也差不多了
3 楼
Trustno1
2009-03-04
mryufeng 写道
非常独特的方式来处理分析 写状态机也可以这么容易
这是FP里非常经典的解析模式啦.lists:token这种方式其实是在模拟命令式语言的堆栈,堆栈其实完全等价转换为continuation.而continuation在FP里是最自然不过的事情咧.
2 楼
sw2wolf
2009-03-04
erlang代码能否语法加亮, 那样眼睛舒服点?
1 楼
mryufeng
2009-03-03
非常独特的方式来处理分析 写状态机也可以这么容易
发表评论
-
OTP R14A今天发布了
2010-06-17 14:36 2676以下是这次发布的亮点,没有太大的性能改进, 主要是修理了很多B ... -
R14A实现了EEP31,添加了binary模块
2010-05-21 15:15 3030Erlang的binary数据结构非常强大,而且偏向底层,在作 ... -
如何查看节点的可用句柄数目和已用句柄数
2010-04-08 03:31 4813很多同学在使用erlang的过程中, 碰到了很奇怪的问题, 后 ... -
获取Erlang系统信息的代码片段
2010-04-06 21:49 3475从lib/megaco/src/tcp/megaco_tcp_ ... -
iolist跟list有什么区别?
2010-04-06 20:30 6529看到erlang-china.org上有个 ... -
erlang:send_after和erlang:start_timer的使用解释
2010-04-06 18:31 8386前段时间arksea 同学提出这个问题, 因为文档里面写的很不 ... -
Latest news from the Erlang/OTP team at Ericsson 2010
2010-04-05 19:23 2013参考Talk http://www.erlang-factor ... -
对try 异常 运行的疑问,为什么出现两种结果
2010-04-05 19:22 2841郎咸武<langxianzhe@163.com> ... -
Erlang ERTS Async基础设施
2010-03-19 00:03 2517其实Erts的Async做的很不错的, 相当的完备, 性能又高 ... -
CloudI 0.0.9 Released, A Cloud as an Interface
2010-03-09 22:32 2474基于Erlang的云平台 看了下代码 质量还是不错的 完成了不 ... -
Memory matters - even in Erlang (再次说明了了解内存如何工作的必要性)
2010-03-09 20:26 3439原文地址:http://www.lshift.net/blog ... -
Some simple examples of using Erlang’s XPath implementation
2010-03-08 23:30 2050原文地址 http://www.lshift.net/blog ... -
lcnt 环境搭建
2010-02-26 16:19 2613抄书:otp_doc_html_R13B04/lib/tool ... -
Erlang强大的代码重构工具 tidier
2010-02-25 16:22 2485Jan 29, 2010 We are very happy ... -
[Feb 24 2010] Erlang/OTP R13B04 has been released
2010-02-25 00:31 1387Erlang/OTP R13B04 has been rele ... -
R13B04 Installation
2010-01-28 10:28 1389R13B04后erlang的源码编译为了考虑移植性,就改变了编 ... -
Running tests
2010-01-19 14:51 1485R13B03以后 OTP的模块加入了大量的测试模块,这些模块都 ... -
R13B04在细化Binary heap
2010-01-14 15:11 1508从github otp的更新日志可以清楚的看到otp R13B ... -
R13B03 binary vheap有助减少binary内存压力
2009-11-29 16:07 1667R13B03 binary vheap有助减少binary内存 ... -
erl_nif 扩展erlang的另外一种方法
2009-11-26 01:02 3218我们知道扩展erl有2种方法, driver和port. 这2 ...
相关推荐
Parsing JSON in Swift will teach you to harness the power of Swift and give you confidence that your app can gracefully handle any JSON that comes its way. You'll learn: - How to use ...
- Harry Bunt, Paola Merlo, Joakim Nivre 编辑的《Trends in Parsing Technology》 - Nancy Ide, Vassar College, New York 和 Jean Véronis, Université de Provence and CNRS, France 系列编辑的《Text, Speech...
Parsing techniques have grown considerably in importance, both in computer science, ie. advanced compilers often use general CF parsers, and computational linguistics where such parsers are the only ...
"A fast streaming JSON parsing library in C" 提供了一个高效的方法来处理JSON数据流,尤其适合处理大量或实时的数据。 在C语言中,JSON解析库通常有两种主要的工作模式:SAX(Simple API for XML)风格的流式...
CSV(Comma Separated Values)文件是一种常见的数据交换格式,广泛用于存储表格数据。它以逗号分隔各个字段,每一行代表一个记录,而每一列代表一个特定的数据项。在Java中,处理CSV文件时,我们可以利用各种库,如...
CSV转JSON 该项目不依赖于其他软件包或库。目录描述使用Node.js将csv文件转换为JSON文件。 输入如下文件: 名字姓电子邮件性别年龄压缩注册康斯坦丁兰斯顿 男96 123 真正诺拉理由 女32 假例如: first_name;last_...
本项目"A reusable framework for parsing JSON in Swift"提供了一个可重用的框架,旨在简化这一过程,提高代码的可维护性和效率。 开源项目通常意味着代码公开,允许其他开发者查看、学习甚至改进源代码。Freddy...
csv转换使用 node-csv 将 csv 文档转换为其他 csv 文档。用法 var fs = require('fs');var csvtransformer = require('csv-conversion').transform;var srcFile = process.argv.slice(2)[0];var source = srcFile ? ...
数据集包括:ATR(human parsing)、LIP(Looking into Person)、Multi-human-parsing数据集。基本山涵盖了所有国际公开的human parsing数据集!
semantic-csv, 用于处理CSV数据和文件的高级工具 语义 CSV 为高级CSV解析/过程功能提供一个Clojure库... Clojure clojure/data.csv parsing目前最流行的两个CSV解析库concern只关注CSV和 clojure-csv,它们都是CSV的一
"前端开源库-parsing"着重关注的是解析技术,特别是基于JSON语法的解析器。JSON(JavaScript Object Notation)是一种轻量级的数据交换格式,因其易读易写、机器可读性强的特点,在网络数据传输中广泛应用。本文将...
《Parsing Techniques》是一本深入探讨解析技术的权威著作,对于学习和理解编译原理的高级概念至关重要。在编程语言的设计与实现中,解析是至关重要的一步,它将源代码转换为计算机可以理解的形式。这本书详细介绍了...
### Parsing Techniques - A Practical Guide (第二版) #### 知识点概述 《Parsing Techniques - A Practical Guide》(第二版)是一本深入浅出地介绍解析器技术的专业书籍。相较于经典的“龙书”,本书以其通俗...
"parsing algorithm"指的是用于分析和理解句子结构的程序或方法,它的目标是将一段文本分解成可解释的语法单元,以便更好地理解其含义。在这个过程中,每一个"node"代表句子中的一个语法成分,如单词、短语或者从句...
Elixir是一种基于BEAM虚拟机的函数式编程语言,它与 Erlang 共享运行时,因此在并发和分布式计算方面表现出色。nimble_csv库充分利用了Elixir的语言特性,提供了流畅的API来处理CSV数据。 nimble_csv库的主要特点...
NSLog(@"Error parsing CSV: %@", error.localizedDescription); } }]; ``` 这是一个简单的使用示例,实际使用时需要根据具体解析器的API进行调整。 7. 自定义需求: 如果解析库不能满足特定需求,比如处理...
在Elixir编程语言中,`csv`库是一个用于处理逗号分隔值(CSV)文件的强大工具。...在实际项目中,结合Elixir的其他库和工具,如ETS(Erlang Term Storage)或数据库接口,可以构建强大的数据处理系统。