以WordCount为例,最简单情况的Shuffle过程为例,展示Spark Shuffle的读写过程,
WordCount代码:
package spark.examples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SparkWordCount {
def main(args: Array[String]) {
System.setProperty("hadoop.home.dir", "E:\\devsoftware\\hadoop-2.5.2\\hadoop-2.5.2");
val conf = new SparkConf()
conf.setAppName("SparkWordCount")
conf.setMaster("local")
val sc = new SparkContext(conf)
val rdd = sc.textFile("file:///D:/word.in")
println(rdd.toDebugString)
val rdd1 = rdd.flatMap(_.split(" "))
println("rdd1:" + rdd1.toDebugString)
val rdd2 = rdd1.map((_, 1))
println("rdd2:" + rdd2.toDebugString)
val rdd3 = rdd2.reduceByKey(_ + _);
println("rdd3:" + rdd3.toDebugString)
rdd3.saveAsTextFile("file:///D:/wordout" + System.currentTimeMillis());
sc.stop
}
}
上面的代码,在Spark内部会创建六个RDD,两个Stage,两个Task, 一个Job。
前面四个RDD形成一个Stage,后面两个RDD形成第二个Stage。第一个Stage对应的是ShuffleMapTask,类似于Hadoop的Map阶段,类似将数据进行单词分解,计数(为1)。第二个Stage对应的是ResultTask,类似于Hadoop的Reduce阶段,用于将Map阶段的结果Shuffle到Reduce节点,然后经过统计计算,写入本地磁盘(本例所示)。
为什么Stage0的最后一个RDD(MappedRDD)和ShuffleRDD之间没有依赖关系?答:对于有依赖关系的RDD,假如RDD B依赖于RDD A,那么RDD B的compute方法,必然要从A中获取输入,而对于ShuffleRDD,它是从上一个Stage的输出中获取输入(可能是内存也可能是HDFS),(不同的Stage位于不同的Task,并且不同阶段的Task是串行执行的,因此,在上一个Stage结束时,它的Task已经结束了,因此RDD的生命周期也结束了),所以ShuffledRDD只能从RDD之外获取数据,因为ShuffleRDD不是从它的父RDD获取,所以ShuffleRDD没有依赖。
这就是说,每个Stage的第一个RDD不依赖于其它RDD。
ShuffledRDD是个经过shuffle后开始的RDD,那么它依赖的数据分散在多个Mapper节点上,那么ShuffledRDD就需要将它们拉取回来,所以,真正产生从mapper拉取Shuffle数据的RDD是ShuffledRDD。也就是说,ShuffledRDD,是个依赖于多个节点的数据的RDD(也就是ShuffleRDD是个宽依赖)
那么问题来了,第一个Stage的计算结果存放在哪,第二个Stage的Reduce从何处取数据? 第一个Stage和第二个Stage是否有重叠,即Stage0产出一部分数据,Stage1立马可取?
DAGScheduler将RDD图进行Stage,划分为两个Stage,Stage0对应的ShuffleMapTask(负责Map的Shuffle写),Stage1对应的是ResultTask(首先读数据,然后shuffle的统计整理,进行写操作)
ShuffleMapTask的runTask的主流程
1.通过调用如下语句反序列化得到Stage0的最后一个RDD,这里是MappedRDD,以及ShuffleDependency实例。序列化的数据时task的二进制表示taskBinary.value,即从SparkContext提交过来的任务?
注意的是taskBinary是个Broadcast变量,它的定义在ShuffleMapTask是,taskBinary: Broadcast[Array[Byte]]。
a. rdd中有个成员变量f,它是这个RDD携带的操作,在WordCount的,他表示val rdd3 = rdd2.reduceByKey(_ + _)语句中的_ + _,
val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
b.rdd中的dependencies_ 成员变量包含了这个RDD依赖关系图,也完成了序列化
c. dep是ShuffleDependency实例,它的类声明为
问题:在ShuffleMapTask中就有了ShuffleDependency,那么在ResultTask中是否也有此ShuffleDependency,言外之意是,到底哪个Stage是shuffle Stage?
2.通过SparkEnv.get.shuffleManager获取SortSuffleManager实例,SortShuffleManager包含有一个IndexShuffleBlockManager实例,而IndexShuffleBlockManager实例包含org.apache.spark.storage.BlockManager实例。
IndexShuffleBlockManager实例和org.apache.spark.storage.BlockManager实例都是从SparkEnv中获取的,因此可以认为IndexShuffleBlockManager实例和org.apache.spark.storage.BlockManager是全局唯一的。
3.通过SortShuffleManager获取SortShuffleWriter实例,SortShuffleWriter实例中包含
- IndexShuffleBlockManager实例(与SortSuffleManager实例中的IndexShuffleBlockManager实例同一个),
- org.apache.spark.storage.BlockManager实例
- MapStatus实例,这个实例用于最后返回?
writer的实例是通过如下语句获得的,因此在获取是需要传入shuffleHandle对象。dep.shuffleHandle是个方法调用,它要为shuffleManager注册shuffle后得到的句柄,
获取writer实例
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context) //启动的partitionId表示的是当前RDD的某个partition,也就是说write操作作用于partition之上
dep.shuffleHandle方法调用:
val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(shuffleId, _rdd.partitions.size, this)
shuffleManager.registerShuffle方法调用
/**
* Register a shuffle with the manager and obtain a handle for it to pass to tasks.
*/
override def registerShuffle[K, V, C](
shuffleId: Int,
numMaps: Int,
dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
new BaseShuffleHandle(shuffleId, numMaps, dependency)
}
4. 调用SortShuffleWriter的write方法,将结果输出到Shuffle的Output,那么write方法的输入是什么呢?它是调用Stage0的最后一个RDD的iterator方法,将RDD中的数据转换为一个数组返回,那么,执行逻辑将专项调用
RDD(Stage0的最后一个RDD是MappedRDD)的iterator方法获取RDD中的数据。这个iterator方法,会将Stage中所有的RDD的iterator方法调用一遍,具体的算法是:
假如A<-B<-C<-D
D,C,B分别调用它的parent的iterator方法获得数据,然后调用自身的RDD携带的函数,得到一个数据,传递给下一个RDD。
以C为例,
D调用C的iterator方法获取C的数据,那么C在实现时,调用B的iterator方法获得数据,然后C调用自身的函数得到转换的数据后,将它传递给D。这就是RDD的pipeline思想的所在。
A没有依赖的RDD,因为A是HadoopRDD,它的输入不是它依赖的RDD,而是Hadoop文件系统中的InputSplit(这个是个抽象的概念,在Spark中,本地文件也会被封装成HadoopRDD
比如,如下是FlatMapRDD的compute方法:
override def compute(split: Partition, context: TaskContext) =
firstParent[T].iterator(split, context).flatMap(f)
5. SortShuffleWriter的write方法的代码:
override def write(records: Iterator[_ <: Product2[K, V]]): Unit = {
if (dep.mapSideCombine) { ///这个在什么地方定义的?
if (!dep.aggregator.isDefined) {
throw new IllegalStateException("Aggregator is empty for map-side combine")
}
//ExternalSorter是关键类,在构造ExternalSort类是,需要提供aggravator,partitioner。keyOrdering,serializer用来做什么的?
///aggregator用来做什么的?
sorter = new ExternalSorter[K, V, C](
dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
///将依赖的RDD读取过来后,调用sorter的insertAll,insertAll做了什么操作?关注下sorter构造的时候,有哪些参数?blockManager从env中直接获取
sorter.insertAll(records)
} else {
// In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
// care whether the keys get sorted in each partition; that will be done on the reduce side
// if the operation being run is sortByKey.
sorter = new ExternalSorter[K, V, V](
None, Some(dep.partitioner), None, dep.serializer)
sorter.insertAll(records)
}
val outputFile = shuffleBlockManager.getDataFile(dep.shuffleId, mapId)
val blockId = shuffleBlockManager.consolidateId(dep.shuffleId, mapId)
val partitionLengths = sorter.writePartitionedFile(blockId, context, outputFile)
///将shuffleId,mapId,写入索引文件
shuffleBlockManager.writeIndexFile(dep.shuffleId, mapId, partitionLengths)
////记录下shuffleId和blockManager之间的对应关系,在ResultTask中依赖此关系获取Block
mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
}
6. ExternalSorter类
private[spark] class ExternalSorter[K, V, C](
aggregator: Option[Aggregator[K, V, C]] = None, ///aggregator用来做什么操作的?
partitioner: Option[Partitioner] = None,
ordering: Option[Ordering[K]] = None,
serializer: Option[Serializer] = None)
extends Logging with Spillable[SizeTrackingPairCollection[(Int, K), C]] {
private val numPartitions = partitioner.map(_.numPartitions).getOrElse(1) ///从分区器中获取分区的个数
private val shouldPartition = numPartitions > 1 ///Partition的个数大于1的时候,才做分区
private val blockManager = SparkEnv.get.blockManager ///从evn中获取blockManager
private val diskBlockManager = blockManager.diskBlockManager ///从blockManager获取diskBlockManager
private val ser = Serializer.getSerializer(serializer)
private val serInstance = ser.newInstance()
private val conf = SparkEnv.get.conf
private val spillingEnabled = conf.getBoolean("spark.shuffle.spill", true) //是否spill到磁盘
private val fileBufferSize = conf.getInt("spark.shuffle.file.buffer.kb", 32) * 1024 //每个文件buffer的size,默认为32k
private val transferToEnabled = conf.getBoolean("spark.file.transferTo", true) //是否启用spark.file.transferTo,这个参数控制什么的?
// Size of object batches when reading/writing from serializers.
//
// Objects are written in batches, with each batch using its own serialization stream. This
// cuts down on the size of reference-tracking maps constructed when deserializing a stream.
//
// NOTE: Setting this too low can cause excessive copying when serializing, since some serializers
// grow internal data structures by growing + copying every time the number of objects doubles.
private val serializerBatchSize = conf.getLong("spark.shuffle.spill.batchSize", 10000)
ExternalSorters类是个非常重要的类,对于reduceByKey而言,Map端每个节点在写磁盘前会做combine操作,即把本节点相同的Key进行combine,这个combine操作就是_ + _,即对相同的key做加法操作。
7.Stage0的写磁盘操作发生在SortShuffleWriter的write方法中,
override def write(records: Iterator[_ <: Product2[K, V]]): Unit = {
if (dep.mapSideCombine) {
if (!dep.aggregator.isDefined) {
throw new IllegalStateException("Aggregator is empty for map-side combine")
}
sorter = new ExternalSorter[K, V, C](
dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
///重点是sorter.insertAll这条语句的执行
///records是stage0最后一个RDD经过对它依赖的RDD进行函数计算后得到的记录
sorter.insertAll(records)
} else {
// In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
// care whether the keys get sorted in each partition; that will be done on the reduce side
// if the operation being run is sortByKey.
sorter = new ExternalSorter[K, V, V](
None, Some(dep.partitioner), None, dep.serializer)
sorter.insertAll(records)
}
val outputFile = shuffleBlockManager.getDataFile(dep.shuffleId, mapId)
val blockId = shuffleBlockManager.consolidateId(dep.shuffleId, mapId)
val partitionLengths = sorter.writePartitionedFile(blockId, context, outputFile)
shuffleBlockManager.writeIndexFile(dep.shuffleId, mapId, partitionLengths)
mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
}
8. ExternalSorter.insertAll方法
def insertAll(records: Iterator[_ <: Product2[K, V]]): Unit = {
// TODO: stop combining if we find that the reduction factor isn't high
val shouldCombine = aggregator.isDefined ///应该是true
if (shouldCombine) { ///这条语句会执行
// Combine values in-memory first using our AppendOnlyMap
val mergeValue = aggregator.get.mergeValue //mergeValue是个函数定义,指的就是val rdd3 = rdd2.reduceByKey(_ + _);中的_ + _运算
//createCombiner这个函数指定的是PairRDDFunctions的如下方法的(v:V)=>v方法
/*
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = {
combineByKey[V]((v: V) => v, func, func, partitioner)
}
*/
val createCombiner = aggregator.get.createCombiner
var kv: Product2[K, V] = null //kv就是records每次遍历得到的中的(K V)值
//update方法是个关键官方,它接受两个参数,1.是否在Map中包含了值hasValue,2.旧值是多少,如果还不存在,那是null,在scala中,null也是一个对象
val update = (hadValue: Boolean, oldValue: C) => {
///如果已经存在,则进行merge,根据Key进行merge(所谓的merge,就是调用mergeValue方法),否则调用createCombiner获取值
//createCombiner方法是(v:V)=>v就是原样输出值
if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
}
while (records.hasNext) {
addElementsRead()///遍历计数,每遍历一次增1
kv = records.next() ///读取当前record
map.changeValue((getPartition(kv._1), kv._1), update) ///关键代码
maybeSpillCollection(usingMap = true) ///是否要spill到磁盘,这个需要根据数据量来看这个代码
}
} else if (bypassMergeSort) {
// SPARK-4479: Also bypass buffering if merge sort is bypassed to avoid defensive copies
if (records.hasNext) {
spillToPartitionFiles(records.map { kv =>
((getPartition(kv._1), kv._1), kv._2.asInstanceOf[C])
})
}
} else {
// Stick values into our buffer
while (records.hasNext) {
addElementsRead()
val kv = records.next()
buffer.insert((getPartition(kv._1), kv._1), kv._2.asInstanceOf[C])
maybeSpillCollection(usingMap = false)
}
}
}
10.ExternalSorter.insertAll方法中调用map.changeValue((getPartition(kv._1), kv._1), update)
首先
- map是SizeTrackingAppendOnlyMap,这个Map的说明文档是:
// Data structures to store in-memory objects before we spill. Depending on whether we have an
// Aggregator set, we either put objects into an AppendOnlyMap where we combine them, or we
// store them in an array buffer.
- kv是record的每条记录
- update函数是在ExternalSorter.insertAll方法中创建的
- getPartition(kv._1)是根据Key获得Parttion,它的方法实现是
private def getPartition(key: K): Int = {
if (shouldPartition) partitioner.get.getPartition(key) else 0 ///partitioner是个HashPartitioner,如果不进行Partition,则返回0,表示仅有1个Partition
}
10. 看SizeTrackingAppendOnlyMap.changeValue方法的实现
override def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
val newValue = super.changeValue(key, updateFunc)
super.afterUpdate()
newValue
}
在上面SizeTrackingAppendOnlyMap的changeValue调用父类AppendOnlyMap的changeValue方法
def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
assert(!destroyed, destructionMessage)
val k = key.asInstanceOf[AnyRef]
if (k.eq(null)) {///如果key是null,表示什么含义?
if (!haveNullValue) {
incrementSize()
}
nullValue = updateFunc(haveNullValue, nullValue)///nullValue是个val类型,定义于AppendOnlyMap类中
haveNullValue = true
return nullValue
}
var pos = rehash(k.hashCode) & mask ////对于Key进行rehash,计算出这个key在SizeTrackingAppendOnlyMap这个数据结构中的位置
var i = 1
while (true) {
val curKey = data(2 * pos)///data是个数组,应该是AppendOnlyMap底层的数据结构,它使用两倍数据的容量,这是为何?原因是2*pos表示key,2*pos+1表示key对应的value
if (k.eq(curKey) || k.equals(curKey)) { ///当前key已经存在于Map中,则需要做combine操作
val newValue = updateFunc(true, data(2 * pos + 1).asInstanceOf[V]) ///对Map中缓存的Key的Value进行_ + _操作,updateFunc即是在ExternalSorter.insertAll方法中创建的update函数
data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]///将新值回写到data(2*pos+1)处,不管data(2*pos + 1)处是否有值
return newValue
} else if (curKey.eq(null)) {///如果当前Map中,data(2*pos)处是null对象
val newValue = updateFunc(false, null.asInstanceOf[V])///调用_ + _操作获取kv的value值
data(2 * pos) = k ///Key
data(2 * pos + 1) = newValue.asInstanceOf[AnyRef] ///Value
incrementSize() ///因为是Map中新增的K/V,做容量扩容检查
return newValue
} else {////如果当前Map中,data(2*pos)处是空
val delta = i ///这是什么意思?
pos = (pos + delta) & mask
i += 1
}
}
null.asInstanceOf[V] // Never reached but needed to keep compiler happy
}
11.ExternalSorter.insertAll方法调用maybeSpillCollection(usingMap = true)
从方法说明中,如果有必要,将当前内存中的集合spill到磁盘
参数usingMap则指定是使用map还是buffer做为集合内的存储,如果使用Map,则是使用SizeTrackingAppendOnlyMap;如果使用buffer则是使用SizeTrackingPairBuffer。其中map和buffer都是ExternalSorter中定义的两个内存集合数据结构
/**
* Spill the current in-memory collection to disk if needed.
*
* @param usingMap whether we're using a map or buffer as our current in-memory collection
*/
private def maybeSpillCollection(usingMap: Boolean): Unit = {
if (!spillingEnabled) { ///默认启用,如果不启用,则有OOM风险
return
}
if (usingMap) { ///如果使用Map,则有可能重建Map,重建Map,需要将Map中的数据转储到磁盘中,是否要做磁盘级的重新combine呢?
if (maybeSpill(map, map.estimateSize())) {
map = new SizeTrackingAppendOnlyMap[(Int, K), C]
}
} else {
if (maybeSpill(buffer, buffer.estimateSize())) {
buffer = new SizeTrackingPairBuffer[(Int, K), C]
}
}
}
11.1 maybeSpill方法,
maybeSpill方法属于ExternalSorter类,但是它是在ExternalSorter的父接口(trait)Spillable中定义的,因此调用将转到Spillable的maybeSpill方法中
方法声明中说,在spill之前,尝试获得更多的内存。意思是说,如果获得了更多的内存,是否还要做spill到磁盘的动作?
/**
* Spills the current in-memory collection to disk if needed. Attempts to acquire more
* memory before spilling.
*
* @param collection collection to spill to disk
* @param currentMemory estimated size of the collection in bytes
* @return true if `collection` was spilled to disk; false otherwise
*/
protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
//判断逻辑:
//1. if判断如果为false,那么不进行spill
//2. if判断如果为true, 如果我的内存阀值小于当前已使用的内存,则进行spill,否则不进行spill
//elementsRead表示已经读取过的元素个数,只有当前读过的elements为32的整数倍才有可能spill,elementsRead%32==0
//trackMemoryThreshold是预定义的1000,
// Threshold for `elementsRead` before we start tracking this collection's memory usage
// private[this] val trackMemoryThreshold = 1000
//myMemoryThreshod默认是5M:
// Initial threshold for the size of a collection before we start tracking its memory usage
// Exposed for testing
//private[this] val initialMemoryThreshold: Long =
// SparkEnv.get.conf.getLong("spark.shuffle.spill.initialMemoryThreshold", 5 * 1024 * 1024)
///currentMemery是一个预估值,表示当前占用的内存,它是由map.estimateSize()计算而来
//amountToRequest:申请的容量是当前使用容量*2减去内存阀值
if (elementsRead > trackMemoryThreshold && elementsRead % 32 == 0 &&
currentMemory >= myMemoryThreshold) {
// Claim up to double our current memory from the shuffle memory pool
val amountToRequest = 2 * currentMemory - myMemoryThreshold
val granted = shuffleMemoryManager.tryToAcquire(amountToRequest)
///将申请的内存添加到我的能接受的内存阀值之上,即增加我的可忍受的内存阀值
myMemoryThreshold += granted
///此时的内存阀值还是小雨当前使用两,则必须进行spill了
if (myMemoryThreshold <= currentMemory) {
// We were granted too little memory to grow further (either tryToAcquire returned 0,
// or we already had more memory than myMemoryThreshold); spill the current collection
_spillCount += 1
logSpillage(currentMemory)
///进行spill操作
spill(collection)
_elementsRead = 0 ///已读数清0
// Keep track of spills, and release memory
_memoryBytesSpilled += currentMemory ///已经释放的内存总量, // Number of bytes spilled in total private[this] var _memoryBytesSpilled = 0L
releaseMemoryForThisThread() ///因为已经spill到磁盘,所以需要释放已经占用的内存,将我的内存阀值恢复到最初值
return true
}
}
false ///上面的条件不满足,则不需要spill
}、
11.2 记录spill日志logSpillage
/**
* Prints a standard log message detailing spillage.
*
* @param size number of bytes spilled
*/
@inline private def logSpillage(size: Long) {
val threadId = Thread.currentThread().getId
logInfo("Thread %d spilling in-memory map of %s to disk (%d time%s so far)"
.format(threadId, org.apache.spark.util.Utils.bytesToString(size),
_spillCount, if (_spillCount > 1) "s" else ""))
}
11.3 spill(Collection)
/**
* Spill the current in-memory collection to disk, adding a new file to spills, and clear it.
*/
override protected[this] def spill(collection: SizeTrackingPairCollection[(Int, K), C]): Unit = {
if (bypassMergeSort) { /根据这个参数决定如何spill
spillToPartitionFiles(collection)
} else {
spillToMergeableFile(collection)
}
}
spillToPartitionFiles:
private def spillToPartitionFiles(iterator: Iterator[((Int, K), C)]): Unit = {
assert(bypassMergeSort)
// Create our file writers if we haven't done so yet
if (partitionWriters == null) {
curWriteMetrics = new ShuffleWriteMetrics()
partitionWriters = Array.fill(numPartitions) {
// Because these files may be read during shuffle, their compression must be controlled by
// spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use
// createTempShuffleBlock here; see SPARK-3426 for more context.
val (blockId, file) = diskBlockManager.createTempShuffleBlock()
blockManager.getDiskWriter(blockId, file, ser, fileBufferSize, curWriteMetrics).open()
}
}
// No need to sort stuff, just write each element out
while (iterator.hasNext) {
val elem = iterator.next()
val partitionId = elem._1._1
val key = elem._1._2
val value = elem._2
partitionWriters(partitionId).write((key, value))
}
}
spillToMergeableFile
/**
* Spill our in-memory collection to a sorted file that we can merge later (normal code path).
* We add this file into spilledFiles to find it later.
*
* Alternatively, if bypassMergeSort is true, we spill to separate files for each partition.
* See spillToPartitionedFiles() for that code path.
*
* @param collection whichever collection we're using (map or buffer)
*/
private def spillToMergeableFile(collection: SizeTrackingPairCollection[(Int, K), C]): Unit = {
assert(!bypassMergeSort)
// Because these files may be read during shuffle, their compression must be controlled by
// spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use
// createTempShuffleBlock here; see SPARK-3426 for more context.
val (blockId, file) = diskBlockManager.createTempShuffleBlock()
curWriteMetrics = new ShuffleWriteMetrics()
var writer = blockManager.getDiskWriter(blockId, file, ser, fileBufferSize, curWriteMetrics)
var objectsWritten = 0 // Objects written since the last flush
// List of batch sizes (bytes) in the order they are written to disk
val batchSizes = new ArrayBuffer[Long]
// How many elements we have in each partition
val elementsPerPartition = new Array[Long](numPartitions)
// Flush the disk writer's contents to disk, and update relevant variables.
// The writer is closed at the end of this process, and cannot be reused.
def flush() = {
val w = writer
writer = null
w.commitAndClose()
_diskBytesSpilled += curWriteMetrics.shuffleBytesWritten
batchSizes.append(curWriteMetrics.shuffleBytesWritten)
objectsWritten = 0
}
var success = false
try {
val it = collection.destructiveSortedIterator(partitionKeyComparator)
while (it.hasNext) {
val elem = it.next()
val partitionId = elem._1._1
val key = elem._1._2
val value = elem._2
writer.write(key)
writer.write(value)
elementsPerPartition(partitionId) += 1
objectsWritten += 1
if (objectsWritten == serializerBatchSize) {
flush()
curWriteMetrics = new ShuffleWriteMetrics()
writer = blockManager.getDiskWriter(blockId, file, ser, fileBufferSize, curWriteMetrics)
}
}
if (objectsWritten > 0) {
flush()
} else if (writer != null) {
val w = writer
writer = null
w.revertPartialWritesAndClose()
}
success = true
} finally {
if (!success) {
// This code path only happens if an exception was thrown above before we set success;
// close our stuff and let the exception be thrown further
if (writer != null) {
writer.revertPartialWritesAndClose()
}
if (file.exists()) {
file.delete()
}
}
}
spills.append(SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition))
}
11.4回收内存
/**
* Release our memory back to the shuffle pool so that other threads can grab it.
*/
private def releaseMemoryForThisThread(): Unit = {
// The amount we requested does not include the initial memory tracking threshold
shuffleMemoryManager.release(myMemoryThreshold - initialMemoryThreshold) ///回收的内存总量,不能减去自身的大小
myMemoryThreshold = initialMemoryThreshold
}
Map的输出和Reduce的输入是怎么串联到一起的?即Reduce如何知道从哪里获得Map的输出?不管Map输出到磁盘还是内存?
通过MapOutputTracker?MapOutputTracker是一个全局的Map?获取ServerStatus的依据是shuffleId和reduceId,这两个变量是怎么保存的?
val statuses = SparkEnv.get.mapOutputTracker.getServerStatuses(shuffleId, reduceId)
MapOutputTracker的类说明:
/**
* Class that keeps track of the location of the map output of
* a stage. This is abstract because different versions of MapOutputTracker
* (driver and worker) use different HashMap to store its metadata.
*/
MapOutputTracker的getServerStatuser方法的说明:
/**
* Called from executors to get the server URIs and output sizes of the map outputs of
* a given shuffle.
*/
override def runTask(context: TaskContext): MapStatus = {
// Deserialize the RDD using the broadcast variable.
val ser = SparkEnv.get.closureSerializer.newInstance()
///rdd对应stage0的最后一个MappedRDD,这个rdd关联的函数是 val rdd2 = rdd1.map((_, 1)),即把每个单词转换为(单词,1)的操作
///dep是ShuffleDependency,
val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
///对taskBinaray.value进行反序列化,得到rdd和dep的Tuple
ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
metrics = Some(context.taskMetrics)
var writer: ShuffleWriter[Any, Any] = null
try {
//取默认的SortShuffleManager
val manager = SparkEnv.get.shuffleManager
///获取SortShuffleWriter,writer中包含IndexShuffleBlockManager实例
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
///partition是HadoopPartitioner对象,其中包含InputSplit以及partition的idx,表示是InputSplit中的第几个分片?
///rdd.iterator方法是做什么的?调用rdd的compute方法以获取数据(通过遍历的方式)
///调用rdd的父rdd,这里的
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
return writer.stop(success = true).get
} catch {
case e: Exception =>
try {
if (writer != null) {
writer.stop(success = false)
}
} catch {
case e: Exception =>
log.debug("Could not stop writer", e)
}
throw e
}
}
rdd和dep
taskBinary.value字节对应的字符串
SortShuffleManager的getWriter方法
/** Get a writer for a given partition. Called on executors by map tasks. */
override def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext)
: ShuffleWriter[K, V] = {
val baseShuffleHandle = handle.asInstanceOf[BaseShuffleHandle[K, V, _]]
shuffleMapNumber.putIfAbsent(baseShuffleHandle.shuffleId, baseShuffleHandle.numMaps)
new SortShuffleWriter(
///shuffleBlockManager是个方法调用,返回的是IndexShuffleBlockManager
shuffleBlockManager, baseShuffleHandle, mapId, context)
}
rdd.iterator方法
/**
* Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
* This should ''not'' be called by users directly, but is available for implementors of custom
* subclasses of RDD.
*/
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
if (storageLevel != StorageLevel.NONE) {
SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel)
} else {
computeOrReadCheckpoint(split, context)
}
}
相关推荐
《Spark源码剖析》PDF 文件很可能会深入到这些技术细节,包括类结构、算法实现以及关键代码的解析,帮助读者更好地理解和优化 Spark 应用。通过深入学习 Spark 源码,开发者可以更好地掌握 Spark 内部工作原理,从而...
《Apache Spark源码剖析》则会深入到Spark的源代码层面,帮助读者理解其实现细节: 1. **源码结构**:Spark源代码的主要模块和包划分,以及关键类的职责。 2. **调度机制**:DAGScheduler和TaskScheduler的工作流程...
7. **源码解析**:书中可能涵盖了Spark源码的解析,如任务调度、内存管理、shuffle过程等,帮助读者深入理解Spark的内部工作原理。 8. **aas-master实例**:这个文件夹很可能是包含了一系列的Spark应用实例,覆盖了...
Spark 2.2.0是Apache Spark的一个重要版本,它带来了许多增强...通过深入研究Spark 2.2.0的源码,开发者可以更好地理解其内部机制,定制化自己的大数据处理流程,同时也能为未来的版本贡献代码,推动Spark的持续发展。
在GitHub上,Spark的源代码被广泛分享和贡献,为开发者提供了深入理解其内部机制和学习新功能的机会。"Spark github源码 例子很有价值"这个标题表明,通过分析这些源码,我们可以学习到Spark如何实现其核心功能,如...
本文将深入探讨如何利用Spark构建一个基于外卖数据的分析系统,以及在这个过程中涉及到的关键技术和应用。 首先,我们需要理解Spark的核心特性。Spark提供了一种内存计算模型,它将数据存储在内存中,从而极大地...
《Spark大数据分析与实战》课程是一门深入探讨Apache Spark在大数据处理领域的应用和技术的课程,其课后练习答案集提供了对课程所讲授知识的巩固和实践。这是一份珍贵的配套教学资源,旨在帮助学生更好地理解和掌握...
3. Shuffle过程:理解Spark如何处理shuffle操作,如`groupByKey`,涉及的HashPartitioner和SortBasedPartitioner等。 4.容错机制:研究RDD的 lineage(血统),了解如何通过重计算丢失的分区来恢复数据。 5. ...
它包括两个阶段:Shuffle 写,将数据分区写入磁盘;Shuffle 读,根据新的分区策略重新组合数据。 10. **存储管理** Spark 支持多种存储级别,如内存、磁盘和序列化。存储层负责管理数据的缓存和回收,以优化内存...
6. **Shuffle 操作**:在 Spark 中,Shuffle 是数据重新分区的过程,通常发生在 join、groupByKey 或 reduceByKey 等操作中。Shuffle 会触发网络传输,对性能影响较大,理解其内部机制有助于优化作业。 7. **容错...
源代码分析可以帮助我们理解任务调度、数据分区和Shuffle过程的细节。 3. **YARN资源管理系统**:随着Hadoop的发展,YARN(Yet Another Resource Negotiator)取代了最初的JobTracker,负责集群资源的管理和任务...
Shuffle是Spark中数据重新分布的过程,涉及数据在网络间的移动。优化Shuffle可以显著提高性能,如使用合适的分区策略、减少网络传输和避免Shuffle产生的临时文件。 **Spark调度和资源管理** Spark的调度器分为粗...
《Spark大数据处理实战练习题详解》 Spark作为大数据处理领域的重要工具,因其高效、易用的特性备受开发者青睐。为了帮助大家深入理解和掌握Spark的核心功能,我们整理了一系列的Spark考试练习题,涵盖从基础概念到...
- spark-3.1.2.tgz:这是一个tar归档文件,经过gzip压缩,通常包含源代码、文档、配置文件和编译后的二进制文件。 - spark-3.1.2-bin-hadoop2.7.tgz:这个版本除了包含基本的Spark组件外,还集成了Hadoop 2.7的二...
6. **Shuffle过程**:在Spark作业中,shuffle是一个重要的操作,它涉及重新组织数据以满足任务间的依赖。在Spark 2.2.0中,shuffle管理得到了优化,减少了网络传输和磁盘I/O,提高了性能。 7. **动态资源调度**:...
在大数据分析过程中,Spark提供了RDD(弹性分布式数据集)作为基本的数据抽象,它是不可变的、分区的并行数据集合。RDD可以通过转换操作(如map、filter)和行动操作(如count、save)进行操作。此外,DataFrame和...
这个版本的源码包可以从Spark的官方网站下载,用于深入学习其内部机制、理解代码结构和实现,以及根据个人需求进行定制化编译。 源码包中包含的主要目录和文件如下: 1. `core/`:Spark的核心模块,包含任务调度、...
Spark 2.7.4的性能可以通过调整各种配置参数来优化,例如`spark.executor.instances`控制Executor的数量,`spark.executor.memory`设置每个Executor的内存,`spark.sql.shuffle.partitions`决定shuffle操作的分区数...
05_尚硅谷大数据技术之Spark内核.docx探讨了Spark的存储系统、执行计划优化、shuffle过程以及Tachyon和Alluxio等内存数据存储系统的集成,让读者深入了解Spark的运行机制。 五、Spark性能调优与故障处理 06_Spark...
例如,新的Shuffle服务可以提高shuffle操作的性能,而Tungsten项目则通过代码生成优化了DataFrame/Dataset的执行效率。 8. **SQL改进**: Spark SQL在3.0.0版本中引入了对ansi SQL标准的支持,增强了对窗口函数的...