spark程序性能调优实践 -

wx1568908808

浏览: 32028 次

最近访客更多访客>>

morelily

KevinTeng

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (100)

社区版块

存档分类

2019-09 ( 100)
更多存档...

spark程序性能调优实践

初学者刚开始写spark程序的时候，往往只注重实现相应的功能，而容易忽略采用何种实现方式能够实现最高的效率。本文后面讲详细阐述作者在实际项目中遇到的spark程序调优问题。

1. 下面这段代码的背景是这样的，panelFeatureMid1类型为RDD[(String, (scala.collection.mutable.HashMap[String,Double], (Option[String], Option[String])))]，表示一个UUID（可以理解为cookie id）上出现的SPID（可以理解为网页上某个监测点号）及其频次的HashMap统计结果，后面的两个Option[String]中一个表示的是该UUID的人群属性信息（例如性别，年龄，教育程度，收入），另一个可以忽略。

现在需要以SPID及其频次为特征，训练人群属性的分类器。这就需要对SPID进行编号，下面的一段程序就是实现这个功能。

    //step2: broadcast all unique spid and index them
    val spidSet = panelFeatureMid1.mapPartitions(tp => {
     val spidSet = tp.foldLeft(HashSet.empty[String])((s, elem) => s.union(elem._2._1.keySet))
     spidSet.map(s => (s, 1)).toIterator
     }).reduceByKey((a, b) => 1).map(tp => tp._1).collect
    println("spidSet num:" + spidSet.size)
    val indexSet = sc.broadcast(spidSet)
    val panelFeatureMid = panelFeatureMid1.mapPartitions(tp =>
    {var i = -1;
     val indexMap = indexSet.value.foldLeft(HashMap.empty[String, Int])((s, elem) => {i = i+1; s.put(elem, i); s});
     val out = tp.map(s => (s._1, (s._2._2, s._2._1.map(spid => (indexMap.get(spid._1).get, spid._2) ))))
     out.toIterator
     }).cache

个人经验：a. 在后面多个task中将使用到的RDD，需要调用cache函数保存在内存中

b. 对于规模不大，而又需要全局使用的数据集，可以作为广播变量broadcast出去

c. 在涉及需要全局数据参与map操作过程时，尽量使用mapPartitions

2. 在spark程序中，需要特别留意的是需要进行IO shuffle的操作，因为shuffle操作将导致RDD数据的网络IO，非常耗时。而其中join操作（包括left，right，full各种join）尤其令初学者容易陷入耗时的shuffle操作中，不能自拔，而对其产生畏惧。

下面的这段代码的背景是这样的：panel和cookieMapping都是RDD[(String, String)]类型，第一个String表示的是UUID（同上），第二个String分别表示人群属性和cookie mapping属性，需要将同一个UUID的人群属性和cookie mapping属性连接到一起，即对它们做full join。

        val panel = getPanel(sc, panelDir + "/l" + month).repartition(numPartitions)
      val cookieMapping = getCookieMapping(sc, new             StringBuilder(cookieMappingDir).append("/").append(month).append("/*").toString)
         .repartition(numPartitions)
    //step 2: panel full outer join cookie mapping data, repartition and cache
    val total = cookieMapping.fullOuterJoin(panel).cache

个人经验：a. 在进行join操作前，尽量确保参与join操作的两个RDD的分区数量相同，这样可以避免无谓的shuffle操作，同时在groupByKey和reduceByKey等操作中，也提供了分区数量参数，在这里设置分区数量，可以省略额外的repartition操作，如下例：

        val uuids = sc.broadcast(total.map(tp => tp._1).collect.toSet)
    println("uuids number = " + uuids.value.size)
    println("driver after collect freeMem = " + Runtime.getRuntime().freeMemory() + " totalMem = " + Runtime.getRuntime().totalMemory())
    val uuidSpid = sc.newAPIHadoopFile(new StringBuilder(monitorEtlDir).append("/").append(month).append("*/campaign*").toString,
classOf[MzSequenceFileInputFormat], classOf[LongWritable], classOf[Text])
.mapPartitions(tp => {
val uuidSet = uuids.value
val freeMem = Runtime.getRuntime().freeMemory()
val totalMem = Runtime.getRuntime().totalMemory()
println("map freeMem = " + freeMem + " totalMem = " + totalMem)
tp.flatMap(ts => {val items = ts._2.toString().split("\\^").map(s => {val kv = s.split("=")
(kv(0), kv(1))}).toMap
val uuid = items("uuid");
if(uuidSet.contains(uuid)) Iterator((uuid, items.get("p")))
else Iterator()})
}).groupByKey(numPartitions)
.map(tp => { val spidPv = tp._2.foldLeft(HashMap.empty[String, Double])(
(pv, p) => {val spid = p.get;
if(spid != null){
if(pv.contains(spid))
pv.put(spid, pv.get(spid).get+1)
else pv.put(spid, 1)}
pv })
(tp._1, spidPv) })

val panelFeatureMid1 = uuidSpid.join(total)

由于map，mapPartitions等操作不改变分区数量，所以这里可以确保做join时的两个RDD分区数量相同

转载于:https://my.oschina.net/jhone/blog/416312

分享到：

hadoop集群配置与MapReduce性能调优 | spark MLlib决策树

2019-09-20 00:02
浏览 333
评论(0)
分类:互联网
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

spark程序性能调优实践

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

spark程序性能调优实践

评论

发表评论

相关推荐

最近访客更多访客>>