hadoop 如何自定义类型

snwz

浏览: 68644 次
性别:
来自: 北京

最近访客更多访客>>

肆无忌惮neo

dinyun

hero1122

yokoboy

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hadoop

记录一下hadoop 数据类型章节的笔记，以便后期使用，本文是边学习边记录，持续更新中

Hadoop 常用自带的数据类型和Java数据类型配比如下

Hadoop类型	Java类型	描述
BooleanWritable	boolean	布尔型
IntWritable	int	整型
FloatWritable	float	浮点float
DoubleWritable	double	浮点型double
ByteWritable	byte	整数类型byte
Text	String	字符串型
ArrayWritable	Array	数组型

在此首先明确定义下序列化
参考百度百科
序列化 (Serialization)将对象的状态信息转换为可以存储或传输的形式的过程。在序列化期间，对象将其当前状态写入到临时或持久性存储区。以后，可以通过从存储区中读取或反序列化对象的状态，重新创建该对象。

Hadoop自定义类型必须实现的一个接口 Writable 代码如下

public interface Writable {

  void write(DataOutput out) throws IOException;

  void readFields(DataInput in) throws IOException;
}

write 方法：Serialize the fields of this object to out

readFields：Deserialize the fields of this object from in

实现该接口后，还需要手动实现一个静态方法，在该方法中返回自定义类型的无参构造方法

for example

 public static MyWritable read(DataInput in) throws IOException {
         MyWritable w = new MyWritable();
         w.readFields(in);
         return w;
       }

官方完成例子

public class MyWritable implements Writable {
       // Some data     
       private int counter;
       private long timestamp;
       
       public void write(DataOutput out) throws IOException {
         out.writeInt(counter);
         out.writeLong(timestamp);
       }
       
       public void readFields(DataInput in) throws IOException {
         counter = in.readInt();
         timestamp = in.readLong();
       }
       
       public static MyWritable read(DataInput in) throws IOException {
         MyWritable w = new MyWritable();
         w.readFields(in);
         return w;
       }
     }

WritableComparables can be compared to each other, typically via Comparators. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface.

如果该自定义类型作为key，那么需要实现 WritableComparable 接口，这个接口实现了两个接口，分别为 Comparable<T>, Writable

类似上一段代码主要新增 compareTo 方法代码如下

  public int compareTo(MyWritableComparable w) {
         int thisValue = this.value;
         int thatValue = ((IntWritable)o).value;
         return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
       }

特殊的类型 NullWritable

NullWritable是Writable的一个特殊类，序列化的长度为0，实现方法为空实现，不从数据流中读数据，也不写入数据，只充当占位符，如在MapReduce中，如果你不需要使用键或值，你就可以将键或值声明为NullWritable,NullWritable是一个不可变的单实例类型。

特殊的类型 ObjectWritable
ObjectWritable 是对java 基本类型的一个通用封装：用于客户端与服务器间传输的Writable对象，也是对RPC传输对象的封装，因为RPC上交换的信息只能是JAVA的基础数据类型，String或者Writable类型，而ObjectWritable是对其子类的抽象封装
ObjectWritable会往流里写入如下信息：

对象类名，对象自己的串行化结果

其序列化和反序列化方法如下：

public void readFields(DataInput in) throws IOException {
    readObject(in, this, this.conf);
  }
   
  public void write(DataOutput out) throws IOException {
    writeObject(out, instance, declaredClass, conf);
  }

public static void writeObject(DataOutput out, Object instance,
                               Class declaredClass, 
                               Configuration conf) throws IOException {
  //对象为空则抽象出内嵌数据类型NullInstance
  if (instance == null) {                       // null
    instance = new NullInstance(declaredClass, conf);
    declaredClass = Writable.class;
  }
  //先写入类名
  UTF8.writeString(out, declaredClass.getName()); // always write declared
  /*
   * 封装的对象为数组类型，则逐个序列化（序列化为length+对象的序列化内容）
   * 采用了迭代
   */
   
  if (declaredClass.isArray()) {                // array
    int length = Array.getLength(instance);
    out.writeInt(length);
    for (int i = 0; i < length; i++) {
      writeObject(out, Array.get(instance, i),
                  declaredClass.getComponentType(), conf);
    }
    //为String类型直接写入
  } else if (declaredClass == String.class) {   // String
    UTF8.writeString(out, (String)instance);
     
  }//基本数据类型写入 
  else if (declaredClass.isPrimitive()) {     // primitive type
 
    if (declaredClass == Boolean.TYPE) {        // boolean
      out.writeBoolean(((Boolean)instance).booleanValue());
    } else if (declaredClass == Character.TYPE) { // char
      out.writeChar(((Character)instance).charValue());
    } else if (declaredClass == Byte.TYPE) {    // byte
      out.writeByte(((Byte)instance).byteValue());
    } else if (declaredClass == Short.TYPE) {   // short
      out.writeShort(((Short)instance).shortValue());
    } else if (declaredClass == Integer.TYPE) { // int
      out.writeInt(((Integer)instance).intValue());
    } else if (declaredClass == Long.TYPE) {    // long
      out.writeLong(((Long)instance).longValue());
    } else if (declaredClass == Float.TYPE) {   // float
      out.writeFloat(((Float)instance).floatValue());
    } else if (declaredClass == Double.TYPE) {  // double
      out.writeDouble(((Double)instance).doubleValue());
    } else if (declaredClass == Void.TYPE) {    // void
    } else {
      throw new IllegalArgumentException("Not a primitive: "+declaredClass);
    }
    //枚举类型写入
  } else if (declaredClass.isEnum()) {         // enum
    UTF8.writeString(out, ((Enum)instance).name());
    //hadoop的Writable类型写入
  } else if (Writable.class.isAssignableFrom(declaredClass)) { // Writable
    UTF8.writeString(out, instance.getClass().getName());
    ((Writable)instance).write(out);
 
  } else {
    throw new IOException("Can't write: "+instance+" as "+declaredClass);
  }
}

public static Object readObject(DataInput in, Configuration conf)
    throws IOException {
    return readObject(in, null, conf);
  }
     
  /** Read a {<a href="http://my.oschina.net/link1212" class="referer" target="_blank">@link</a> Writable}, {<a href="http://my.oschina.net/link1212" class="referer" target="_blank">@link</a> String}, primitive type, or an array of
   * the preceding. */
  @SuppressWarnings("unchecked")
  public static Object readObject(DataInput in, ObjectWritable objectWritable, Configuration conf)
    throws IOException {
      //获取反序列化的名字
    String className = UTF8.readString(in);
    //假设为基本数据类型
    Class<?> declaredClass = PRIMITIVE_NAMES.get(className);
    /*
     * 判断是否为基本数据类型，不是则为空，则为Writable类型，
     * 对于Writable类型从Conf配置文件中读取类名，
     * 在这里只是获取类名，而并没有反序列化对象
     */
     
    if (declaredClass == null) {
      try {
        declaredClass = conf.getClassByName(className);
      } catch (ClassNotFoundException e) {
        throw new RuntimeException("readObject can't find class " + className, e);
      }
    }    
    //基本数据类型
    Object instance;
    //为基本数据类型，逐一反序列化
    if (declaredClass.isPrimitive()) {            // primitive types
 
      if (declaredClass == Boolean.TYPE) {             // boolean
        instance = Boolean.valueOf(in.readBoolean());
      } else if (declaredClass == Character.TYPE) {    // char
        instance = Character.valueOf(in.readChar());
      } else if (declaredClass == Byte.TYPE) {         // byte
        instance = Byte.valueOf(in.readByte());
      } else if (declaredClass == Short.TYPE) {        // short
        instance = Short.valueOf(in.readShort());
      } else if (declaredClass == Integer.TYPE) {      // int
        instance = Integer.valueOf(in.readInt());
      } else if (declaredClass == Long.TYPE) {         // long
        instance = Long.valueOf(in.readLong());
      } else if (declaredClass == Float.TYPE) {        // float
        instance = Float.valueOf(in.readFloat());
      } else if (declaredClass == Double.TYPE) {       // double
        instance = Double.valueOf(in.readDouble());
      } else if (declaredClass == Void.TYPE) {         // void
        instance = null;
      } else {
        throw new IllegalArgumentException("Not a primitive: "+declaredClass);
      }
 
    } else if (declaredClass.isArray()) {              // array
      int length = in.readInt();
      instance = Array.newInstance(declaredClass.getComponentType(), length);
      for (int i = 0; i < length; i++) {
        Array.set(instance, i, readObject(in, conf));
      }
       
    } else if (declaredClass == String.class) {        // String类型的反序列化
      instance = UTF8.readString(in);
    } else if (declaredClass.isEnum()) {         // enum的反序列化
      instance = Enum.valueOf((Class<? extends Enum>) declaredClass, UTF8.readString(in));
    } else {                                      // Writable
      Class instanceClass = null;
      String str = "";
      try {
          //剩下的从Conf对象中获取类型Class
        str = UTF8.readString(in);
        instanceClass = conf.getClassByName(str);
      } catch (ClassNotFoundException e) {
        throw new RuntimeException("readObject can't find class " + str, e);
      }
      /*
       * 带用了WritableFactories工厂去new instanceClass(实现了Writable接口)对象出来
       * 在调用实现Writable对象自身的反序列化方法
       */
      
      Writable writable = WritableFactories.newInstance(instanceClass, conf);
      writable.readFields(in);
      instance = writable;
 
      if (instanceClass == NullInstance.class) {  // null
        declaredClass = ((NullInstance)instance).declaredClass;
        instance = null;
      }
    }
    //最后存储反序列化后待封装的ObjectWritable对象
    if (objectWritable != null) {                 // store values
      objectWritable.declaredClass = declaredClass;
      objectWritable.instance = instance;
    }
 
    return instance;
       
  }

特殊的类型 GenericWritable
例如一个reduce中的输入从多个map中获，然而各个map的输出value类型都不同，这就需要 GenericWritable 类型 map端用法如下

 context.write(new Text(str), new MyGenericWritable(new LongWritable(1)));
context.write(new Text(str), new MyGenericWritable(new Text("1")));

在reduce 中用法如下

for (MyGenericWritable time : values){  
                //获取MyGenericWritable对象  
                Writable writable = time.get();  
                //如果当前是LongWritable类型  
                if (writable instanceof LongWritable){  
                      
                    count += ((LongWritable) writable).get();  
                }  
                //如果当前是Text类型  
                if (writable instanceof Text){  
                    count += Long.parseLong(((Text)writable).toString());  
                }  
            }

自定义MyGenericWritable如下

class MyGenericWritable extends GenericWritable{  
  
    //无参构造函数  
    public MyGenericWritable() {  
          
    }  
      
    //有参构造函数  
    public MyGenericWritable(Text text) {  
        super.set(text);  
    }  
      
    //有参构造函数  
    public MyGenericWritable(LongWritable longWritable) {  
        super.set(longWritable);  
    }  
  
      
    @Override  
    protected Class<? extends Writable>[] getTypes() {  
          
        return new Class[]{LongWritable.class,Text.class};  
    }

分享到：

Mapreduce优化的点滴 | napreduce shuffle 过程记录

2015-07-15 09:37
浏览 1257
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论