`
bupt04406
  • 浏览: 348821 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

RegexSerDe

    博客分类:
  • Hive
 
阅读更多

官方示例在:

https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ApacheWeblogData

Apache Weblog Data

The format of Apache weblog is customizable, while most webmasters uses the default.
For default Apache weblog, we can create a table with the following command.

More about !RegexSerDe can be found here: http://issues.apache.org/jira/browse/HIVE-662

add jar ../build/contrib/hive_contrib.jar;

CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;

 

官方issues是 https://issues.apache.org/jira/browse/HIVE-167

官方UT在contrib/src/test/queries/clientnegative/serde_regex.q文件中。

RegexSerDe基于正则解析一条记录(row),使用java的Pattern。input.regex是Pattern解析的规则。output.format.string描述如何序列化一条记录,使用java的String,String.format(outputFormatString, outputFields);

outputFormatString = tbl.getProperty("output.format.string");

 

 

 

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics