`
mryufeng
  • 浏览: 985843 次
  • 性别: Icon_minigender_1
  • 来自: 广州
社区版块
存档分类
最新评论

Parsing CSV in erlang(转)

阅读更多
引用地址(需要爬墙) http://ppolv.wordpress.com/2008/02/25/parsing-csv-in-erlang/

So I need to parse a CSV file in erlang.

Although files in CSV have a very simple structure, simply calling lists:tokens(Line,",") for each line in the file won't do the trick, as there could be quoted fields that spans more than one line and contains commas or escaped quotes.

A detailed discussion of string parsing in erlang can be found at the excellent Parsing text and binary files with Erlang article by Joel Reymont. And the very first example is parsing a CSV file!; but being the first example, it was written with simplicity rather than completeness in mind, so it didn't take quoted/multi-line fields into account.

Now, we will write a simple parser for RFC-4180 documents ( witch is way cooler than parse plain old CSV files ;-) ) . As the format is really simple, we won't use yecc nor leex, but parse the input file by hand using binaries,lists and lots of pattern matching.

Our goals are

    * Recognize fields delimited by commas, records delimited by line breaks
    * Recognize quoted fields
    * Being able to parse quotes, commas and line breaks inside quoted fields
    * Ensure that all records had the same number of fields
    * Provide a fold-like callback interface, in addition to a return-all-records-in-file function

What the parser won't do:

    * Unicode. We will treat the file as binary and consider each character as ASCII, 1 byte wide. To parse unicode files, you can use xmerl_ucs:from_utf8/1, and then process the resulting list instead of the raw binary

A quick lock suggest that the parser will pass through the following states:
cvs parsing states

    * Field start

      at the begin of each field. The whitespaces should be consider for unquoted fields, but any whitespace before a quoted field is discarded
    * Normal

      an unquoted field
    * Quoted

      inside a quoted field
    * Post Quoted

      after a quoted field. Whitespaces could appear between a quoted field and the next field/record, and should be discarded

Parsing state

While parsing, we will use the following record to keep track of the current state

-record(ecsv,{
   state = field_start,  %%field_start|normal|quoted|post_quoted
   cols = undefined, %%how many fields per record
   current_field = [],
   current_record = [],
   fold_state,
   fold_fun  %%user supplied fold function
   }).

API functions

parse_file(FileName,InitialState,Fun) ->
   {ok, Binary} = file:read_file(FileName),
    parse(Binary,InitialState,Fun).

parse_file(FileName)  ->
   {ok, Binary} = file:read_file(FileName),
    parse(Binary).

parse(X) ->
   R = parse(X,[],fun(Fold,Record) -> [Record|Fold] end),
   lists:reverse(R).

parse(X,InitialState,Fun) ->
   do_parse(X,#ecsv{fold_state=InitialState,fold_fun = Fun}).

The tree arguments functions provide the fold-like interface, while the single argument one returns a list with all the records in the file.
Parsing

Now the fun part!.
The transitions (State X Input -> NewState ) are almost 1:1 derived from the diagram, with minor changes (like the handling of field and record delimiters, common to both the normal and post_quoted state).
Inside a quoted field, a double quote must be escaped by preceding it with another double quote. Its really easy to distinguish this case by matching against

<<$",$",_/binary>>

sort of "lookahead" in yacc's lexicon.

%% --------- Field_start state ---------------------
%%whitespace, loop in field_start state
do_parse(<<32,Rest/binary>>,S = #ecsv{state=field_start,current_field=Field})->
do_parse(Rest,S#ecsv{current_field=[32|Field]});

%%its a quoted field, discard previous whitespaces
do_parse(<<$",Rest/binary>>,S = #ecsv{state=field_start})->
do_parse(Rest,S#ecsv{state=quoted,current_field=[]});

%%anything else, is a unquoted field
do_parse(Bin,S = #ecsv{state=field_start})->
do_parse(Bin,S#ecsv{state=normal});

%% --------- Quoted state ---------------------
%%Escaped quote inside a quoted field
do_parse(<<$",$",Rest/binary>>,S = #ecsv{state=quoted,current_field=Field})->
do_parse(Rest,S#ecsv{current_field=[$"|Field]});

%%End of quoted field
do_parse(<<$",Rest/binary>>,S = #ecsv{state=quoted})->
do_parse(Rest,S#ecsv{state=post_quoted});

%%Anything else inside a quoted field
do_parse(<<X,Rest/binary>>,S = #ecsv{state=quoted,current_field=Field})->
do_parse(Rest,S#ecsv{current_field=[X|Field]});

do_parse(<<>>, #ecsv{state=quoted})->
throw({ecsv_exception,unclosed_quote});

%% --------- Post_quoted state ---------------------
%%consume whitespaces after a quoted field
do_parse(<<32,Rest/binary>>,S = #ecsv{state=post_quoted})->
do_parse(Rest,S);

%%---------Comma and New line handling. ------------------
%%---------Common code for post_quoted and normal state---

%%EOF in a new line, return the records
do_parse(<<>>, #ecsv{current_record=[],fold_state=State})->
State;
%%EOF in the last line, add the last record and continue
do_parse(<<>>,S)->
do_parse([],new_record(S));

%% skip carriage return (windows files uses CRLF)
do_parse(<<$\r,Rest/binary>>,S = #ecsv{})->
do_parse(Rest,S);

%% new record
do_parse(<<$\n,Rest/binary>>,S = #ecsv{}) ->
do_parse(Rest,new_record(S));

do_parse(<<$, ,Rest/binary>>,S = #ecsv{current_field=Field,current_record=Record})->
do_parse(Rest,S#ecsv{state=field_start,
  current_field=[],
  current_record=[lists:reverse(Field)|Record]});

%%A double quote in any other place than the already managed is an error
do_parse(<<$",_Rest/binary>>, #ecsv{})->
throw({ecsv_exception,bad_record});

%%Anything other than whitespace or line ends in post_quoted state is an error
do_parse(<<_X,_Rest/binary>>, #ecsv{state=post_quoted})->
throw({ecsv_exception,bad_record});

%%Accumulate Field value
do_parse(<<X,Rest/binary>>,S = #ecsv{state=normal,current_field=Field})->
do_parse(Rest,S#ecsv{current_field=[X|Field]}).

Record assembly and callback

Convert each record to a tuple, and check that it has the same number of fields than the previous records. Invoke the callback function with the new record and the previous state.

%%check the record size against the previous, and actualize state.
new_record(S=#ecsv{cols=Cols,current_field=Field,current_record=Record,fold_state=State,fold_fun=Fun}) ->
NewRecord = list_to_tuple(lists:reverse([lists:reverse(Field)|Record])),
if
(tuple_size(NewRecord) =:= Cols) or (Cols =:= undefined) ->
NewState = Fun(State,NewRecord),
S#ecsv{state=field_start,cols=tuple_size(NewRecord),
current_record=[],current_field=[],fold_state=NewState};

(tuple_size(NewRecord) =/= Cols) ->
throw({ecsv_exception,bad_record_size})
end.

Final notes

We used a single function, do_parse/2, with many clauses to do the parsing. In a more complex scenario, you probably will use different functions for different sections of the grammar you are parsing. Also you could first tokenize the input and then parse the resulting token stream, this could make your work simpler even if your aren't using a parser generator like yecc (this is the approach i'm using to parse ldap filters).
  • 描述: cvs parsing states
  • 大小: 29.9 KB
分享到:
评论
4 楼 mryufeng 2009-03-04  
这种模型的开发效率确实很高  状态图画出来 代码也差不多了
3 楼 Trustno1 2009-03-04  
mryufeng 写道
非常独特的方式来处理分析 写状态机也可以这么容易

这是FP里非常经典的解析模式啦.lists:token这种方式其实是在模拟命令式语言的堆栈,堆栈其实完全等价转换为continuation.而continuation在FP里是最自然不过的事情咧.
2 楼 sw2wolf 2009-03-04  
erlang代码能否语法加亮, 那样眼睛舒服点?
1 楼 mryufeng 2009-03-03  
非常独特的方式来处理分析 写状态机也可以这么容易

相关推荐

    Parsing JSON in Swift

    Parsing JSON in Swift will teach you to harness the power of Swift and give you confidence that your app can gracefully handle any JSON that comes its way. You'll learn: - How to use ...

    Trends in Parsing Technology

    - Harry Bunt, Paola Merlo, Joakim Nivre 编辑的《Trends in Parsing Technology》 - Nancy Ide, Vassar College, New York 和 Jean Véronis, Université de Provence and CNRS, France 系列编辑的《Text, Speech...

    Parsing Techniques: A Practical Guide (Monographs in Computer Science)

    Parsing techniques have grown considerably in importance, both in computer science, ie. advanced compilers often use general CF parsers, and computational linguistics where such parsers are the only ...

    A fast streaming JSON parsing library in C..zip

    "A fast streaming JSON parsing library in C" 提供了一个高效的方法来处理JSON数据流,尤其适合处理大量或实时的数据。 在C语言中,JSON解析库通常有两种主要的工作模式:SAX(Simple API for XML)风格的流式...

    CSV-Parsing:使用InputStream类解析csv文件

    CSV(Comma Separated Values)文件是一种常见的数据交换格式,广泛用于存储表格数据。它以逗号分隔各个字段,每一行代表一个记录,而每一列代表一个特定的数据项。在Java中,处理CSV文件时,我们可以利用各种库,如...

    csvToJson:将CSV文件转换为JSON

    CSV转JSON 该项目不依赖于其他软件包或库。目录描述使用Node.js将csv文件转换为JSON文件。 输入如下文件: 名字姓电子邮件性别年龄压缩注册康斯坦丁兰斯顿 男96 123 真正诺拉理由 女32 假例如: first_name;last_...

    A reusable framework for parsing JSON in Swift..zip

    本项目"A reusable framework for parsing JSON in Swift"提供了一个可重用的框架,旨在简化这一过程,提高代码的可维护性和效率。 开源项目通常意味着代码公开,允许其他开发者查看、学习甚至改进源代码。Freddy...

    csv-conversion:使用 node-csv 将 csv 文档转换为其他 csv 文档

    csv转换使用 node-csv 将 csv 文档转换为其他 csv 文档。用法 var fs = require('fs');var csvtransformer = require('csv-conversion').transform;var srcFile = process.argv.slice(2)[0];var source = srcFile ? ...

    整理的一些human parsing数据集

    数据集包括:ATR(human parsing)、LIP(Looking into Person)、Multi-human-parsing数据集。基本山涵盖了所有国际公开的human parsing数据集!

    semantic-csv, 用于处理CSV数据和文件的高级工具.zip

    semantic-csv, 用于处理CSV数据和文件的高级工具 语义 CSV 为高级CSV解析/过程功能提供一个Clojure库... Clojure clojure/data.csv parsing目前最流行的两个CSV解析库concern只关注CSV和 clojure-csv,它们都是CSV的一

    前端开源库-parsing

    "前端开源库-parsing"着重关注的是解析技术,特别是基于JSON语法的解析器。JSON(JavaScript Object Notation)是一种轻量级的数据交换格式,因其易读易写、机器可读性强的特点,在网络数据传输中广泛应用。本文将...

    Parsing Techniques 原装扫描

    《Parsing Techniques》是一本深入探讨解析技术的权威著作,对于学习和理解编译原理的高级概念至关重要。在编程语言的设计与实现中,解析是至关重要的一步,它将源代码转换为计算机可以理解的形式。这本书详细介绍了...

    Parsing Techniques - A Practical Guide

    ### Parsing Techniques - A Practical Guide (第二版) #### 知识点概述 《Parsing Techniques - A Practical Guide》(第二版)是一本深入浅出地介绍解析器技术的专业书籍。相较于经典的“龙书”,本书以其通俗...

    parsing algorithm

    "parsing algorithm"指的是用于分析和理解句子结构的程序或方法,它的目标是将一段文本分解成可解释的语法单元,以便更好地理解其含义。在这个过程中,每一个"node"代表句子中的一个语法成分,如单词、短语或者从句...

    nimble_csv:用于Elixir的简单快速的CSV分析和转储库

    Elixir是一种基于BEAM虚拟机的函数式编程语言,它与 Erlang 共享运行时,因此在并发和分布式计算方面表现出色。nimble_csv库充分利用了Elixir的语言特性,提供了流畅的API来处理CSV数据。 nimble_csv库的主要特点...

    A proper CSV parser for Objective-C.zip

    NSLog(@"Error parsing CSV: %@", error.localizedDescription); } }]; ``` 这是一个简单的使用示例,实际使用时需要根据具体解析器的API进行调整。 7. 自定义需求: 如果解析库不能满足特定需求,比如处理...

    csv:Elixir的CSV解码和编码

    在Elixir编程语言中,`csv`库是一个用于处理逗号分隔值(CSV)文件的强大工具。...在实际项目中,结合Elixir的其他库和工具,如ETS(Erlang Term Storage)或数据库接口,可以构建强大的数据处理系统。

Global site tag (gtag.js) - Google Analytics