samuschen

浏览: 410682 次
性别:
来自: 北京

最近访客更多访客>>

dy.f

u012363178

谁谁谁

wangyy

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

hive serde

博客分类：

hive

Hadoop Access 网络协议 UP

一、背景

1、当进程在进行远程通信时，彼此可以发送各种类型的数据，无论是什么类型的数据都会以二进制序列的形式在网络上传送。发送方需要把对象转化为字节序列才可在网络上传输，称为对象序列化；接收方则需要把字节序列恢复为对象，称为对象的反序列化。

2、Hive的反序列化是对key/value反序列化成hive table的每个列的值。

3、Hive可以方便的将数据加载到表中而不需要对数据进行转换，这样在处理海量数据时可以节省大量的时间。

SerDe

SerDe 是 Serialize/Deserilize 的简称，目的是用于序列化和反序列化。序列化的格式包括：

分隔符（tab、逗号、CTRL-A）
Thrift 协议

反序列化（内存内）：

Java Integer/String/ArrayList/HashMap
Hadoop Writable 类
用户自定义类

目前存在的 Serde 见下图：

其中，LazyObject 只有在访问到列的时候才进行反序列化。 BinarySortable：保留了排序的二进制格式。

Input processing

Hive's execution engine (referred to as just engine henceforth) first uses the configured InputFormat to read in a record of data (the value object returned by the RecordReader of the InputFormat).
The engine then invokes Serde.deserialize() to perform deserialization of the record. There is no real binding that the deserialized object returned by this method indeed be a fully deserialized one. For instance, in Hive there is a LazyStruct object which is used by the LazySimpleSerde to represent the deserialized object. This object does not have the bytes deserialized up front but does at the point of access of a field.
The engine also gets hold of the ObjectInspector to use by invoking Serde.getObjectInspector(). This has to be a subclass of structObjectInspector since a record representing a row of input data is essentially a struct type.
The engine passes the deserialized object( eg. LazyStruct ) and the object inspector to all operators for their use in order to get the needed data from the record. The object inspector knows how to construct individual fields out of a deserialized record. For example, StructObjectInspector has a method called getStructFieldData() which returns a certain field in the record. This is the mechanism to access individual fields. For instance ExprNodeColumnEvaluator class which can extract a column from the input row uses this mechanism to get the real column object from the serialized row object. This real column object in turn can be a complex type (like a struct). To access sub fields in such complex typed objects, an operator would use the object inspector associated with that field (The top level StructObjectInspector for the row maintains a list of field level object inspectors which can be used to interpret individual fields).

ps：ExprNodeColumnEvaluator是真正将数据deserialize的结构，而 ObjectInspector使用的是deserialize以后的结果。

For UDFs the new GenericUDF abstract class provides the ObjectInspector associated with the UDF arguments in the initialize() method. So the engine first initializes the UDF by calling this method. The UDF can then use these ObjectInspectors to interpret complex arguments (for simple arguments, the object handed to the udf is already the right primitive object like LongWritable/IntWritable etc).

Output processing

Output is analogous to input. The engine passes the deserialized Object representing a record and the corresponding ObjectInspector to Serde.serialize(). In this context serialization means converting the record object to an object of the type expected by the OutputFormat which will be used to perform the write. To perform this conversion, the serialize() method can make use of the passed ObjectInspector to get the individual fields in the record in order to convert the record to the appropriate type.

分享到：

hadoop作业调优参数整理及原理 | linux的内存为什么总是占用率很高

2011-04-13 15:34
浏览 2942
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hive serde

SerDe

Input processing

Output processing

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hive serde

SerDe

Input processing

Output processing

评论

发表评论

相关推荐

hive compile-1

hive 用mysql存储元信息

hive编译部分的源码结构

hive执行作业时reduce任务个数设置为多少合适？

hive 源码结构分析（编译器）

hive中关于partition的操作

hive mapjoin

Hive QL

hive数据模型

SequenceFile的压缩和分片

hive的一些资料整理

hive的存储格式

TPC-H on Hive

hive show table显示不出表的问题

hive运行实例

源码编译hive

hive报Invalid maximum heap size: -Xmx4096m错误解决方法

Hive Installation and Configuration

最近访客更多访客>>