`
tmrp
  • 浏览: 45448 次
  • 性别: Icon_minigender_1
  • 来自: 深圳
社区版块
存档分类
最新评论

RPC and Serialization with Hadoop, Thrift, and Protocol Buffers

阅读更多
http://www.cloudera.com/resources/rpc-and-serialization-in-hadoop

RPC and Serialization with Hadoop, Thrift, and Protocol Buffers
This post was originally written by Tom White, published here.

Hadoop and related projects like Thrift provide a choice of protocols and formats for doing RPC and serialization. In this post I'll briefly run through them and explain where they came from, how they relate to each other and how Google's newly released Protocol Buffers might fit in.

RPC and Writables
Hadoop has its own RPC mechanism that dates back to when Hadoop was a part of Nutch. It's used throughout Hadoop as the mechanism by which daemons talk to each other. For example, a DataNode communicates with the NameNode using the RPC interface DatanodeProtocol.

Protocols are defined using Java interfaces whose arguments and return types are primitives, Strings, Writables, or arrays. These types can all be serialized using Hadoop's specialized serialization format, based on Writable. Combined with the magic of Java dynamic proxies, we get a simple RPC mechanism which for the caller appears to be a Java interface.

MapReduce and Writables
Hadoop uses Writables for another, quite different, purpose: as a serialization format for MapReduce programs. If you've ever written a Hadoop MapReduce program you will have used Writables for the key and value types. For example:

public class MapClassimplements Mapper<LongWritable, Text, Text, IntWritable> {// ...}(Text is just a Writable version of Java String.)

The primary benefit of using Writables is in their efficiency. Compared to Java serialization, which would have been an obvious alternative choice, they have a more compact representation. Writables don't store their type in the serialized representation, since at the point of deserialization it is known which type is expected. For the MapReduce code above, the input key is a LongWritable, so an empty LongWritable instance is asked to populate itself from the input data stream.

More flexible MapReduce
There are downsides of having to use Writables for MapReduce types, however. For a newcomer to Hadoop it's another hurdle: something else to learn ("why can't I just use a String?"). More seriously, perhaps, is that it's hard to use different binary storage formats for MapReduce input and output. For example, Apache Thrift (see below) is an increasingly popular way of storing binary data. It's possible, but cumbersome and inefficient, to read or write Thrift data from MapReduce.

From Hadoop 0.17.0 onwards you no longer have to use Writables for key and value types in MapReduce programs. You can use any serialization framework. (Note that this is change is completely independent of Hadoop's RPC mechanism, which still uses Writables - and can only use Writables - as its on-wire format.) So it's easier to use Thrift types, say, throughout your MapReduce program. Or you can even use Java serialization (with some limitations which will be fixed). What's more, you can add your own serialization framework if you like.

Record I/O, Thrift and Protocol Buffers

Another problem with Writables, at least for the MapReduce programmer, is that creating new types is a burden. You have to implement the Writable interface, which means designing the on-wire format, and writing two methods: one to write the data in that format and one to read it back.

Hadoop's Record I/O was created to solve this problem. You write a definition of your types using a record definition language, then run a record compiler to generate Java source code representations of your types. All Record I/O types are Writable, so they plug into Hadoop very easily. As a bonus, you can generate bindings for other languages, so it's easy to read your data files from other programs.

For whatever reason, Record I/O never really took off. It's used in ZooKeeper, but that's about it (and ZooKeeper will move away from it someday). Momentum has switched to Thrift (from Facebook, now in the Apache Incubator), which offers a very similar proposition, but in more languages. Thrift also makes it easy to build a (cross-language) RPC mechanism.

Yesterday, Google open sourced Protocol Buffers, its "language-neutral, platform-neutral, extensible mechanism for serializing structured data". Record I/O, Thrift and Protocol Buffers are really solving the same problem, so it will be interesting to see how this develops. Of course, since we're talking about persistent data formats, nothing's going to go away in the short or medium term while people have significant amounts of data locked up in these formats.

That's why it makes sense to add support in Hadoop for MapReduce using Thrift and Protocol Buffers: so people can process data in the format they have it in. This will be a relatively simple addition.

What Next?
For RPC, where a message is short-lived, changing the mechanism is more viable in the short term. Going back to Hadoop's RPC mechanism, now that both Thrift and Protocol Buffers offer an alternative, it may well be time to evaluate them to see if either can offer a performance boost. It would be a big job to retrofit RPC in Hadoop with another implementation, but if there are significant performance gains to be had, then it would be worth doing.
分享到:
评论

相关推荐

    the programmer's guide to apache thrift

    Apache Thrift is an open source cross language serialization and RPC framework. With support for over 15 programming languages, Apache Thrift can play an important role in a range of distributed ...

    深入浅析Java Object Serialization与 Hadoop 序列化

    深入浅析Java Object Serialization与 Hadoop 序列化 序列化是指将结构化对象转化为字节流以便在网络上传输或者写到磁盘永久存储的过程。Java 中的序列化是通过实现 Serializable 接口来实现的,而 Hadoop 序列化则...

    Hadoop: The Definitive Guide, 4th Edition

    With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to ...

    java操作hadoop的RPC,源码

    - `ProtobufRpcEngine`:Hadoop使用Google的Protocol Buffers进行序列化和反序列化,此引擎负责处理这些操作。 - `RPC.Server`:服务器端的实现,负责接收和处理客户端的请求。 - `RPC.Client`:客户端的实现,...

    Hadoop- The Definitive Guide, 3rd Edition.pdf

    serialization, and file-based data structures. The next four chapters cover MapReduce in depth. Chapter 5 goes through the practical steps needed to develop a MapReduce application. Chapter 6 looks ...

    Hadoop The Definitive Guide 3rd Edition

    Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence Discover common pitfalls and advanced features for writing real-world ...

    C++RPC基于boost.asio、boost.serialization等boost库进行反射.zip

    本项目“C++ RPC基于boost.asio、boost.serialization等boost库进行反射”是针对C++编程语言实现的一套RPC框架,利用了Boost库的强大功能来提高其效率和可维护性。 首先,让我们详细了解一下关键组件: 1. **Boost...

    Hadoop: The Definitive Guide [Paperback]

    Become familiar with Hadoop’s data and I/O building blocks for compression, data integrity, serialization, and persistence Discover common pitfalls and advanced features for writing real-world ...

    Hadoop in Practice(2012)

    It balances conceptual foundations with practical recipes for key problem areas like data ingress and egress, serialization, and LZO compression. You'll explore each technique step by step, learning ...

    Pro C# 7: With .NET and .NET Core

    Chapter 20: File I/O and Object Serialization Chapter 21: Data Access with ADO.NET Chapter 22: Introducing Entity Framework 6 Chapter 23: Introducing Windows Communication Foundation Part VII: ...

    java Streams and Serialization 详解

    ### Java Streams 和 Serialization 详解 在Java编程语言中,数据的读取与写入操作是通过流(Streams)实现的。此外,为了保存对象的状态,Java提供了序列化(Serialization)机制。本文将深入探讨Java中的流操作...

    gwt-rpc-serialization:重用 gwt-storage 和 gwt-rpc 序列化技术在客户端序列化对象的概念证明

    在“gwt-rpc-serialization”项目中,开发者可能已经实现了一个概念证明,演示了如何将GWT-RPC的序列化技术应用于GWT-Storage。这意味着,不仅可以使用GWT-RPC技术进行客户端与服务器之间的通信,还可以利用相同的...

    hadoop_the_definitive_guide_3nd_edition

    Avro, REST, and Thrift 468 Example 469 Schemas 470 Loading Data 471 Web Queries 474 HBase Versus RDBMS 477 Successful Service 478 HBase 479 Use Case: HBase at Streamy.com 479 Table of Contents | ix ...

Global site tag (gtag.js) - Google Analytics