bit1129

浏览: 1077582 次
性别:
来自: 北京

最近访客更多访客>>

xiaoyaohen24

yuxin8000

abc951654

zhongqi2513

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

【Spark七十八】Spark Kyro序列化

博客分类：

Spark

当使用SparkContext的saveAsObjectFile方法将对象序列化到文件，以及通过objectFile方法将对象从文件反序列出来的时候，Spark默认使用Java的序列化以及反序列化机制，通常情况下，这种序列化机制是很低效的，Spark支持使用Kyro作为对象的序列化和反序列化机制，序列化的速度比java更快，但是使用Kyro时要注意，Kyro目前还是有些bug。

Spark默认是使用Java的ObjectOutputStream框架，它支持所有的继承于java.io.Serializable序列化,如果想要进行调优的话，可以通过继承java.io.Externalizable。这种格式比较大，而且速度慢。
Spark还支持这种方式Kryo serialization，它的速度快，而且压缩比高于Java的序列化，但是它不支持所有的Serializable格式，并且需要在程序里面注册。它需要在实例化SparkContext之前进行注册

When Spark is transferring data over the network or spilling data to disk, it needs to
serialize objects into a binary format. This comes into play during shuffle operations,
where potentially large amounts of data are transferred. By default Spark will use
Java’s built-in serializer. Spark also supports the use of Kryo, a third-party serialization
library that improves on Java’s serialization by offering both faster serialization
times and a more compact binary representation, but cannot serialize all types of
objects “out of the box.” Almost all applications will benefit from shifting to Kryo for
serialization.

代码示例：

package spark.examples.kryo

import com.esotericsoftware.kryo.Kryo
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.serializer.KryoRegistrator

//两个成员变量name和age，同时必须实现java.io.Serializable接口
class MyClass1(val name: String, val age: Int) extends java.io.Serializable {
}

//两个成员变量name和age，同时必须实现java.io.Serializable接口
class MyClass2(val name: String, val age: Int) extends java.io.Serializable {

}

//注册使用Kryo序列化的类，要求MyClass1和MyClass2必须实现java.io.Serializable
class MyKryoRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo): Unit = {
    kryo.register(classOf[MyClass1]);
    kryo.register(classOf[MyClass2]);
  }
}

object SparkKryo {
  def main(args: Array[String]) {
    //设置序列化器为KryoSerializer,也可以在配置文件中进行配置
    System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    System.setProperty("spark.kryo.registrator", "spark.examples.kryo.MyKryoRegistrator")
    val conf = new SparkConf()
    conf.setAppName("SparkKryo")
    conf.setMaster("local[3]")

    val sc = new SparkContext(conf)
    val rdd = sc.parallelize(List(new MyClass1("Tom", 31), new MyClass1("Jack", 23), new MyClass1("Mary", 19)))
    val fileDir = "file:///d:/wordcount" + System.currentTimeMillis()
    //将rdd中的对象通过kyro进行序列化，保存到fileDir目录中
    rdd.saveAsObjectFile(fileDir)
    //读取part-00000文件中的数据，对它进行反序列化，，得到对象集合rdd1
    val rdd1 = sc.objectFile[MyClass1](fileDir + "/" + "part-00000")


    rdd1.foreachPartition(iter => {
      while (iter.hasNext) {
        val objOfMyClass1 = iter.next();
        println(objOfMyClass1.name)
      }
    })

    sc.stop
  }
}

查看保存到文件中的内容，是个二进制数据：

SEQ!org.apache.hadoop.io.NullWritable"org.apache.hadoop.io.BytesWritable      蓑_xi??蛔?z汲   i       e ur [Lspark.examples.kryo.MyClass1;?
独#v?  xp   sr spark.examples.kryo.MyClass1z 峌#   xp

问题：

对于普通字符，数字，字符串写入到object文件，是否也是序列化的过程？明确指定使用kvro序列化Int之后，保存的文件确实是二进制的。去掉对Int的注册之后，结果还是一样，序列化的结果完全一样，结果都是：

SEQ!org.apache.hadoop.io.NullWritable"org.apache.hadoop.io.BytesWritable      F脗?庻籡陭姯&?  '       # ur [IM篳&v瓴?  xp

import com.esotericsoftware.kryo.Kryo
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.serializer.KryoRegistrator

//注册使用Kryo序列化的类,对Int进行序列化
class MyKryoRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo): Unit = {
    kryo.register(classOf[Int]);
  }
}

object SparkKryoPrimitiveType {
  def main(args: Array[String]) {
    //设置序列化器为KryoSerializer,也可以在配置文件中进行配置
    System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    System.setProperty("spark.kryo.registrator", "spark.examples.kryo.MyKryoRegistrator")
    val conf = new SparkConf()
    conf.setAppName("SparkKryoPrimitiveType")
    conf.setMaster("local[3]")

    val sc = new SparkContext(conf)
    val rdd = sc.parallelize(List(1, 3, 7, 9, 11, 22))
    val fileDir = "file:///d:/wordcount" + System.currentTimeMillis()
    //将rdd中的对象通过kyro进行序列化，保存到fileDir目录中
    rdd.saveAsObjectFile(fileDir)
    //读取part-00000文件中的数据，对它进行反序列化，，得到对象集合rdd1
    val rdd1 = sc.objectFile[Int](fileDir + "/" + "part-00000")

    rdd1.foreachPartition(iter => {
      while (iter.hasNext) {
        println(iter.next())
      }
    })

    sc.stop
  }
}

其它：

指定使用Kyro序列化，以及注册Kyro序列化类，可可以使用如下方式

val conf = new SparkConf()
//这句是多于的，调用conf的registerKryoClasses时，已经设置了序列化方法
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// Be strict about class registration
///如果一个要序列化的类没有进行Kryo注册，则强制Spark报错
conf.set("spark.kryo.registrationRequired", "true")
conf.registerKryoClasses(Array(classOf[MyClass], classOf[MyOtherClass]))

  /**
   * Use Kryo serialization and register the given set of classes with Kryo.
   * If called multiple times, this will append the classes from all calls together.
   */
  def registerKryoClasses(classes: Array[Class[_]]): SparkConf = {
    val allClassNames = new LinkedHashSet[String]()
    allClassNames ++= get("spark.kryo.classesToRegister", "").split(',').filter(!_.isEmpty)
    allClassNames ++= classes.map(_.getName)

    set("spark.kryo.classesToRegister", allClassNames.mkString(","))
    set("spark.serializer", classOf[KryoSerializer].getName)
    this
  }

分享到：

【Scala七】Scala核心一：函数 | 【Kafka四】Kakfa伪分布式安装

2015-02-22 19:25
浏览 10340
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

【Spark七十八】Spark Kyro序列化

问题：

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

【Spark七十八】Spark Kyro序列化

问题：

评论

发表评论

相关推荐

【Spark109】Windows上运行spark-shell

【Spark108】Spark SQL动态代码生成四

【Spark107】Spark SQL动态代码生成三

【Spark106】Spark SQL动态代码生成二

【Spark105】Spark SQL动态代码生成一

【Spark105】Spark任务调度

【Spark104】Spark源代码构建打包

【Spark103】Task not serializable

【Spark102】Spark存储模块BlockManager剖析

【Spark101】Scala Promise/Future在Spark中的应用

【Spark100】Spark Streaming Checkpoint的一个坑

【Spark九十九】Spark Streaming的batch interval时间内的数据流转源码分析

【Spark九十八】Standalone Cluster Mode下的资源调度源代码分析

【Spark九十七】RDD API之aggregateByKey

【Spark九十六】RDD API之combineByKey

【Spark九十五】Spark Shell操作Spark SQL

【Spark九十四】spark-sql工具的使用

【Spark九十三】Spark读写Sequence File

【Spark九十二】Spark SQL操作Parquet格式的数据

【Spark九十一】Spark Streaming整合Kafka一些值得关注的问题

最近访客更多访客>>