Hive-0.5中SerDe概述 -

xq0804200134

浏览: 179038 次
性别:
来自: 杭州

最近访客更多访客>>

shuiguang

mike.liu

zzc125

m1475a

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Hive-0.5中SerDe概述

博客分类：

hive

Hive-0.5中SerDe概述
propertiesobjecttablestringnullstructure
一、背景

1、当进程在进行远程通信时，彼此可以发送各种类型的数据，无论是什么类型的数据都会以二进制序列的形式在网络上传送。发送方需要把对象转化为字节序列才可在网络上传输，称为对象序列化；接收方则需要把字节序列恢复为对象，称为对象的反序列化。

2、Hive的反序列化是对key/value反序列化成hive table的每个列的值。

3、Hive可以方便的将数据加载到表中而不需要对数据进行转换，这样在处理海量数据时可以节省大量的时间。

二、技术细节

1、SerDe是Serialize/Deserilize的简称，目的是用于序列化和反序列化。

2、用户在建表时可以用自定义的SerDe或使用Hive自带的SerDe，SerDe能为表指定列，且对列指定相应的数据。

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type
[COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)]
INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
创建指定SerDe表时，使用row format row_format参数，例如：

a、添加jar包。在hive客户端输入：hive>add jar /run/serde_test.jar;
或者在linux shell端执行命令：${HIVE_HOME}/bin/hive -auxpath /run/serde_test.jar
b、建表：create table serde_table row format serde 'hive.connect.TestDeserializer';
3、编写序列化类TestDeserializer。实现Deserializer接口的三个函数：

a）初始化：initialize(Configuration conf, Properties tb1)。

b）反序列化Writable类型返回Object:deserialize(Writable blob)。

c）获取deserialize(Writable blob)返回值Object的inspector:getObjectInspector()。

public interface Deserializer {

/**
   * Initialize the HiveDeserializer.
   * @param conf System properties
   * @param tbl table properties
   * @throws SerDeException
   */
public void initialize(Configuration conf, Properties tbl) throws SerDeException;

/**
   * Deserialize an object out of a Writable blob.
   * In most cases, the return value of this function will be constant since the function
   * will reuse the returned object.
   * If the client wants to keep a copy of the object, the client needs to clone the
   * returned value by calling ObjectInspectorUtils.getStandardObject().
   * @param blob The Writable object containing a serialized object
   * @return A Java object representing the contents in the blob.
   */
public Object deserialize(Writable blob) throws SerDeException;

/**
   * Get the object inspector that can be used to navigate through the internal
   * structure of the Object returned from deserialize(...).
   */
public ObjectInspector getObjectInspector() throws SerDeException;

}
实现一行数据划分成hive表的time,userid,host,path四个字段的反序列化类。例如：

package hive.connect;

import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.serde2.Deserializer;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.ObjectInspectorOptions;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;

public class TestDeserializer implements Deserializer {
   private static List<String> FieldNames = new ArrayList<String>();
   private static List<ObjectInspector> FieldNamesObjectInspectors = new ArrayList<ObjectInspector>();
   static {
     FieldNames.add("time");
     FieldNamesObjectInspectors.add(ObjectInspectorFactory
          .getReflectionObjectInspector(Long.class,
               ObjectInspectorOptions.JAVA));
     FieldNames.add("userid");
     FieldNamesObjectInspectors.add(ObjectInspectorFactory
          .getReflectionObjectInspector(Integer.class,
               ObjectInspectorOptions.JAVA));
     FieldNames.add("host");
     FieldNamesObjectInspectors.add(ObjectInspectorFactory
          .getReflectionObjectInspector(String.class,
               ObjectInspectorOptions.JAVA));

     FieldNames.add("path");
     FieldNamesObjectInspectors.add(ObjectInspectorFactory
          .getReflectionObjectInspector(String.class,
               ObjectInspectorOptions.JAVA));

}

   @Override
   public Object deserialize(Writable blob) {
     try {
        if (blob instanceof Text) {
          String line = ((Text) blob).toString();
          if (line == null)
             return null;
          String[] field = line.split("/t");
          if (field.length != 3) {
             return null;
          }
          List<Object> result = new ArrayList<Object>();
          URL url = new URL(field[2]);
          Long time = Long.valueOf(field[0]);
          Integer userid = Integer.valueOf(field[1]);
          result.add(time);
          result.add(userid);
          result.add(url.getHost());
          result.add(url.getPath());
          return result;
        }
     } catch (MalformedURLException e) {
        e.printStackTrace();
     }
     return null;
   }

   @Override
   public ObjectInspector getObjectInspector() throws SerDeException {
     return ObjectInspectorFactory.getStandardStructObjectInspector(
          FieldNames, FieldNamesObjectInspectors);
   }

   @Override
   public void initialize(Configuration arg0, Properties arg1)
        throws SerDeException {
   }

}
测试HDFS上hive表数据，如下为一条测试数据：

1234567891012 123456 http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF

hive> add jar /run/jar/merg_hua.jar;
Added /run/jar/merg_hua.jar to class path
hive> create table serde_table row format serde 'hive.connect.TestDeserializer';
Found class for hive.connect.TestDeserializer
OK
Time taken: 0.028 seconds
hive> describe serde_table;
OK
time    bigint from deserializer
userid int     from deserializer
host    string from deserializer
path    string from deserializer
Time taken: 0.042 seconds
hive> select * from serde_table;
OK
1234567891012   123456 wiki.apache.org /hadoop/Hive/LanguageManual/UDF
Time taken: 0.039 seconds
三、总结
1、创建Hive表使用序列化时，需要自写一个实现Deserializer的类，并且选用create命令的row format参数。

2、在处理海量数据的时候，如果数据的格式与表结构吻合，可以用到Hive的反序列化而不需要对数据进行转换，可以节省大量的时间。

分享到：

大数据挑战与NoSQL数据库技术试读有感 | hive bucket

2013-04-19 09:23
浏览 981
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Hive-0.5中SerDe概述

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Hive-0.5中SerDe概述

评论

发表评论

相关推荐

hive bucket

最近访客更多访客>>