一、上下文
《Spark-Task启动流程》中讲到如果一个Task是一个ShuffleMapTask,那么最后在调用ShuffleWriter写入磁盘后还会判断是否可以启用push-based shuffle机制,下面我们就来继续看看push-based shuffle机制背后都做了什么
二、push-based shuffle机制开启条件
1、spark.shuffle.push.enabled 设置为true 默认为 false (设置为true可在客户端启用基于推送的shuffle,这与服务器端标志spark.shuffle.push.server.mergedShuffleFileManagerImpl协同工作,该标志需要使用相应的org.apache.spark.network.shuffle进行设置。启用基于推送的shuffle的MergedShuffleFileManager实现)
2、提交应用程序以在YARN模式下运行
3、已启用外部洗牌服务
4、IO加密已禁用
5、序列化器(如KryoSerialer)支持重新定位序列化对象
6、RDD不能是Barrier的
说明:调用barrier()可以返回一个RDDBarrier,且会将该RDD所处的Stage也标记为barrier,在该Stage,Spark必须同时启动所有任务。如果任务失败,Spark将中止整个阶段并重新启动此阶段的所有任务,而不是只重新启动失败的任务。
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](...){private def canShuffleMergeBeEnabled(): Boolean = {val isPushShuffleEnabled = Utils.isPushBasedShuffleEnabled(rdd.sparkContext.getConf,// invoked at driverisDriver = true)if (isPushShuffleEnabled && rdd.isBarrier()) {logWarning("Push-based shuffle is currently not supported for barrier stages")}isPushShuffleEnabled &&// TODO: SPARK-35547: Push based shuffle is currently unsupported for Barrier stages!rdd.isBarrier()}}
三、推送ShuffleWriter的结果
1、ShuffleWriteProcessor
def write(rdd: RDD[_],dep: ShuffleDependency[_, _, _],mapId: Long,context: TaskContext,partition: Partition): MapStatus = {var writer: ShuffleWriter[Any, Any] = nulltry {//SparkEnv从获取ShuffleManagerval manager = SparkEnv.get.shuffleManager//从ShuffleManager获取ShuffleWriterwriter = manager.getWriter[Any, Any](dep.shuffleHandle,mapId,context,createMetricsReporter(context))//用ShuffleWriter将该Stage结果写入磁盘writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])val mapStatus = writer.stop(success = true)if (mapStatus.isDefined) {if (dep.shuffleMergeEnabled && dep.getMergerLocs.nonEmpty && !dep.shuffleMergeFinalized) {manager.shuffleBlockResolver match {case resolver: IndexShuffleBlockResolver =>//获取该Stage结果信息val dataFile = resolver.getDataFile(dep.shuffleId, mapId)//推送new ShuffleBlockPusher(SparkEnv.get.conf).initiateBlockPush(dataFile, writer.getPartitionLengths(), dep, partition.index)case _ =>}}}mapStatus.get} catch {......}}
2、ShuffleBlockPusher
用于在启用push-based shuffle时将混洗块推送到远程混洗服务。它是在ShuffleWriter完成结果文件的写入后创建。
private[spark] class ShuffleBlockPusher(conf: SparkConf) extends Logging {//spark.shuffle.push.maxBlockSizeToPush 默认 1M//推送到远程外部shuffle服务的单个块的最大大小。大于此阈值的块不会被推送到远程合并。这些shuffle块将由executors以原始方式获取。private[this] val maxBlockSizeToPush = conf.get(SHUFFLE_MAX_BLOCK_SIZE_TO_PUSH)//spark.shuffle.push.maxBlockBatchSize 默认 3M//要分组到单个推送请求中的一批shuffle块的最大大小//默认值为3m,因为它大于2m(TransportConf#memoryMapBytes的默认值)。如果这也默认为2m,则很可能每批块都将通过内存映射加载到内存中,这对于MB大小的小数据块具有更高的开销。private[this] val maxBlockBatchSize = conf.get(SHUFFLE_MAX_BLOCK_BATCH_SIZE_FOR_PUSH)//每个reduce端任务同时获取的map端输出的最大大小。由于每个输出都需要我们创建一个缓冲区来接收它,这表示每个reduce端任务的固定内存开销,因此除非您有大量内存,否则请保持较小的内存开销private[this] val maxBytesInFlight =conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024//此配置限制了在任何给定点获取块的远程请求数量。当集群中的主机数量增加时,可能会导致与一个或多个节点的大量入站连接,导致工作进程在负载下失败。通过允许它限制获取请求的数量,可以缓解这种情况//默认为 Int.MaxValue 即 2147483647private[this] val maxReqsInFlight = conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue)//spark.reducer.maxBlocksInFlightPerAddress 默认为 Int.MaxValue 即 2147483647//此配置限制了每个reduce任务从给定主机端口获取的远程块的数量。当在单次或同时从给定地址请求大量块时,这可能会使 executor 或 Node Manager 崩溃。这对于在启用外部shuffle时减少节点管理器的负载特别有用。您可以通过将其设置为较低的值来缓解这个问题。private[this] val maxBlocksInFlightPerAddress = conf.get(REDUCER_MAX_BLOCKS_IN_FLIGHT_PER_ADDRESS)private[this] var bytesInFlight = 0Lprivate[this] var reqsInFlight = 0private[this] val numBlocksInFlightPerAddress = new HashMap[BlockManagerId, Int]()private[this] val deferredPushRequests = new HashMap[BlockManagerId, Queue[PushRequest]]()//推送请求队列private[this] val pushRequests = new Queue[PushRequest]private[this] val errorHandler = createErrorHandler()// VisibleForTestingprivate[shuffle] val unreachableBlockMgrs = new HashSet[BlockManagerId]()//......//初始化块推送private[shuffle] def initiateBlockPush(dataFile: File,//map端生成的Shuffle数据文件partitionLengths: Array[Long], //Shuffle块大小数组,这样我们就可以分辨Shuffle块了dep: ShuffleDependency[_, _, _], //用于获取shuffle ID和远程shuffle服务的位置,以推送本地shuffle块mapIndex: Int): Unit = { //shuffle map 任务的索引val numPartitions = dep.partitioner.numPartitionsval transportConf = SparkTransportConf.fromSparkConf(conf, "shuffle")val requests = prepareBlockPushRequests(numPartitions, mapIndex, dep.shuffleId,dep.shuffleMergeId, dataFile, partitionLengths, dep.getMergerLocs, transportConf)// 随机化PushRequest的顺序//如果map端有排序,那么每个分区的相同key所在的大致范围是一样的,就会造成同意时间都向下游同一个分区或者同一台节点推送数据,因此需要乱序下,这样更有利于并行推送pushRequests ++= Utils.randomize(requests)submitTask(() => {tryPushUpToMax()})}private[shuffle] def tryPushUpToMax(): Unit = {try {pushUpToMax()} catch {......}//由于多个块推送线程可能会为同一个映射器调用pushUpToMax,//因此我们同步对此方法的访问,以便只有一个线程可以为给定的映射器推送块。//这有助于简化对共享状态的访问。这样做的缺点是,如果所有线程都被来自同一映射器的块推送占用,我们可能会不必要地阻止其他映射器的区块推送。private def pushUpToMax(): Unit = synchronized {// 如果可能的话,处理任何未完成的延迟推送请求if (deferredPushRequests.nonEmpty) {for ((remoteAddress, defReqQueue) <- deferredPushRequests) {while (isRemoteBlockPushable(defReqQueue) &&!isRemoteAddressMaxedOut(remoteAddress, defReqQueue.front)) {val request = defReqQueue.dequeue()logDebug(s"Processing deferred push request for $remoteAddress with "+ s"${request.blocks.length} blocks")sendRequest(request)if (defReqQueue.isEmpty) {deferredPushRequests -= remoteAddress}}}}//如果可能的话,处理任何常规推送请求。while (isRemoteBlockPushable(pushRequests)) {//从队列中取出一个请求val request = pushRequests.dequeue()val remoteAddress = request.address//reduce 端也有接收块的大小限制,如果超过了就不用给对方发送了 默认为 Int.MaxValue 即 2147483647if (isRemoteAddressMaxedOut(remoteAddress, request)) {logDebug(s"Deferring push request for $remoteAddress with ${request.blocks.size} blocks")deferredPushRequests.getOrElseUpdate(remoteAddress, new Queue[PushRequest]()).enqueue(request)} else {sendRequest(request)}}def isRemoteBlockPushable(pushReqQueue: Queue[PushRequest]): Boolean = {pushReqQueue.nonEmpty &&(bytesInFlight == 0 ||(reqsInFlight + 1 <= maxReqsInFlight &&bytesInFlight + pushReqQueue.front.size <= maxBytesInFlight))}// 检查发送新的推送请求是否会超过推送到给定远程地址的最大块数。def isRemoteAddressMaxedOut(remoteAddress: BlockManagerId, request: PushRequest): Boolean = {(numBlocksInFlightPerAddress.getOrElse(remoteAddress, 0)+ request.blocks.size) > maxBlocksInFlightPerAddress}}//将块推送到远程shuffle服务器。一旦当前批中的某个块传输完成,回调监听器将再次调用#pushUpToMax来触发推送下一批块。这样,我们将映射任务与块推送过程解耦,因为它是负责大部分块推送的网状客户端线程,而不是任务执行线程。private def sendRequest(request: PushRequest): Unit = {bytesInFlight += request.sizereqsInFlight += 1numBlocksInFlightPerAddress(request.address) = numBlocksInFlightPerAddress.getOrElseUpdate(request.address, 0) + request.blocks.lengthval sizeMap = request.blocks.map { case (blockId, size) => (blockId.toString, size) }.toMapval address = request.addressval blockIds = request.blocks.map(_._1.toString)val remainingBlocks = new HashSet[String]() ++= blockIds//块推送监听器val blockPushListener = new BlockPushingListener {//启动连接并将块推送到远程shuffle服务始终由块推送线程处理。//我们不应该在netty事件循环调用的blockPushListener回调中启动连接创建,//因为:1。TransportClient.createConnection(…)块用于建立连接,建议避免在事件循环中进行任何阻塞操作;// 2.实际的连接创建是一个添加到另一个事件循环的任务队列中的任务,该事件循环最终可能会相互阻塞。一旦blockPushListener收到块推送成功或失败的通知,我们只需将其委托给块推送线程。def handleResult(result: PushResult): Unit = {submitTask(() => {if (updateStateAndCheckIfPushMore(sizeMap(result.blockId), address, remainingBlocks, result)) {//再次进行推送tryPushUpToMax()}})}override def onBlockPushSuccess(blockId: String, data: ManagedBuffer): Unit = {logTrace(s"Push for block $blockId to $address successful.")handleResult(PushResult(blockId, null))}override def onBlockPushFailure(blockId: String, exception: Throwable): Unit = {// check the message or it's cause to see it needs to be logged.if (!errorHandler.shouldLogError(exception)) {logTrace(s"Pushing block $blockId to $address failed.", exception)} else {logWarning(s"Pushing block $blockId to $address failed.", exception)}handleResult(PushResult(blockId, exception))}}//除了随机化推送请求的顺序外,还进一步随机化推送申请中块的顺序,以进一步降低推送块在服务器端发生洗牌冲突的可能性。这不会增加在执行器端读取未合并的shuffle文件的成本,因为我们仍然在读取MB大小的块,并且只在读取后对内存中的切片缓冲区进行随机化。//一个请求包括多个块,请求随机化,块也随机化 然后发送请求val (blockPushIds, blockPushBuffers) = Utils.randomize(blockIds.zip(sliceReqBufferIntoBlockBuffers(request.reqBuffer, request.blocks.map(_._2)))).unzip//从SparkEnv获取blockManager,blockManager调用blockStoreClient来进行传输//下面我们就详细看下它是如何把块推送到reduce端的SparkEnv.get.blockManager.blockStoreClient.pushBlocks(address.host, address.port, blockPushIds.toArray,blockPushBuffers.toArray, blockPushListener)}//触发推送protected def submitTask(task: Runnable): Unit = {if (BLOCK_PUSHER_POOL != null) {BLOCK_PUSHER_POOL.execute(task)}}private val BLOCK_PUSHER_POOL: ExecutorService = {val conf = SparkEnv.get.confif (Utils.isPushBasedShuffleEnabled(conf,isDriver = SparkContext.DRIVER_IDENTIFIER == SparkEnv.get.executorId)) {//spark.shuffle.push.numPushThreads 默认值 spark-submit中给executor分批的内核数量//指定推块池中的线程数。这些线程有助于创建连接并将块推送到远程外部shuffle服务。默认情况下,线程池大小等于Spark executor 内核的数量。val numThreads = conf.get(SHUFFLE_NUM_PUSH_THREADS).getOrElse(conf.getInt(SparkLauncher.EXECUTOR_CORES, 1))ThreadUtils.newDaemonFixedThreadPool(numThreads, "shuffle-block-push-thread")} else {null}}//将当前map端的shuffle数据文件转换为PushRequest列表。//基本上,shuffle 文件中的连续块被分组到单个请求中,以允许更有效地读取块数据。//给定shuffle的每个map端将收到与目标位置相同的BlockManagerId列表,以将块推送到目标位置。//同一shuffle中的所有map端将以一致的方式将shuffle分区范围映射到各个目标位置,以确保每个目标位置接收属于同一组分区范围的shuffle块。//0长度的块和足够大的块将被跳过。private[shuffle] def prepareBlockPushRequests(numPartitions: Int,partitionId: Int,shuffleId: Int,shuffleMergeId: Int,dataFile: File,partitionLengths: Array[Long],mergerLocs: Seq[BlockManagerId],transportConf: TransportConf): Seq[PushRequest] = {var offset = 0Lvar currentReqSize = 0var currentReqOffset = 0Lvar currentMergerId = 0val numMergers = mergerLocs.length//推送请求数组val requests = new ArrayBuffer[PushRequest]var blocks = new ArrayBuffer[(BlockId, Int)]for (reduceId <- 0 until numPartitions) {val blockSize = partitionLengths(reduceId)logDebug(s"Block ${ShufflePushBlockId(shuffleId, shuffleMergeId, partitionId,reduceId)} is of size $blockSize")//跳过0长度的块和足够大的块if (blockSize > 0) {val mergerId = math.min(math.floor(reduceId * 1.0 / numPartitions * numMergers),numMergers - 1).asInstanceOf[Int]//如果当前请求超出最大批处理大小,// 或者当前请求中的块数超出每个目标的限制,// 或者下一个块推送位置用于不同的洗牌服务,// 或者下个块超过推送的最大块大小限制,//则启动新的PushRequest。这保证了每个PushRequest代表洗牌文件中要推送到同一洗牌服务的连续块,并且不会超出现有的限制。if (currentReqSize + blockSize <= maxBlockBatchSize&& blocks.size < maxBlocksInFlightPerAddress&& mergerId == currentMergerId && blockSize <= maxBlockSizeToPush) {// 将当前块添加到当前批次currentReqSize += blockSize.toInt} else {if (blocks.nonEmpty) {// 将上一批转换为PushRequestrequests += PushRequest(mergerLocs(currentMergerId), blocks.toSeq,createRequestBuffer(transportConf, dataFile, currentReqOffset, currentReqSize))blocks = new ArrayBuffer[(BlockId, Int)]}//开始一个新批次currentReqSize = 0// 将currentReqOffset设置为-1,以便我们能够区分currentReqOffset的初始值和何时开始新批处理currentReqOffset = -1currentMergerId = mergerId}// 仅在大小符合的情况下进行推送//如果一个分区的大小超过 1M 那就不能推送了,只能用传统的Shuffle方式拉取//其实这个blockSize 应该不会太大,除非有数据倾斜 ,因为这是一个分区向下游某一个分区推送的数据大小if (blockSize <= maxBlockSizeToPush) {val blockSizeInt = blockSize.toIntblocks += ((ShufflePushBlockId(shuffleId, shuffleMergeId, partitionId,reduceId), blockSizeInt))// 仅当当前块是请求中的第一个块时才更新currentReqOffsetif (currentReqOffset == -1) {currentReqOffset = offset}if (currentReqSize == 0) {currentReqSize += blockSizeInt}}}offset += blockSize}// 添加最终请求if (blocks.nonEmpty) {requests += PushRequest(mergerLocs(currentMergerId), blocks.toSeq,createRequestBuffer(transportConf, dataFile, currentReqOffset, currentReqSize))}requests.toSeq}}
3、ExternalBlockStoreClient
客户端,用于读取指向外部(executor外部)服务器的RDD块和shuffle块。
以尽最大努力的方式异步地将一系列Shuffle块推送到远程节点。这些Shuffle块以及其他客户端推送的块将被合并到目标节点上的每个Shuffle分区合并的Shuffle文件中。
public class ExternalBlockStoreClient extends BlockStoreClient {public void pushBlocks(String host,int port,String[] blockIds,ManagedBuffer[] buffers,BlockPushingListener listener) {checkInit();//如果块大小和buffer大小不匹配就停止assert blockIds.length == buffers.length : "Number of block ids and buffers do not match.";//将块和buffer进行对应 放入map中Map<String, ManagedBuffer> buffersWithId = new HashMap<>();for (int i = 0; i < blockIds.length; i++) {buffersWithId.put(blockIds[i], buffers[i]);}//日志打印:推送多大的 shuffle 块 都某一台节点logger.debug("Push {} shuffle blocks to {}:{}", blockIds.length, host, port);try {RetryingBlockTransferor.BlockTransferStarter blockPushStarter =(inputBlockId, inputListener) -> {if (clientFactory != null) {assert inputListener instanceof BlockPushingListener :"Expecting a BlockPushingListener, but got " + inputListener.getClass();//创建一个连接到给定远程主机/端口的 TransportClient//我们维护一个客户端数组(大小由spark.shuffle.io.numConnectionsPerPeer决定),并随机选择一个使用。如果在随机选择的位置之前没有创建客户端,则此函数会创建一个新客户端并将其放置在那里。//因为推送shuffle会对多个节点发送多次请求,因此将创建好的TransportClient 放入数组,如果下次遇到同一个远程目标主机,就不用再创建了,//fastFail 默认值为 false 如果fastFail参数为真,则在快速失败时间窗口内(io等待重试超时的95%)对同一地址的最后一次尝试失败时立即失败。假设调用者将处理重试。//在创建新的TransportClient之前,我们将执行在此工厂注册的所有TransportClientBootstrap 这会一直阻止,直到成功建立并完全引导连接。//TransportClient 其实是一个netty 客户端,且会再pipeline中设置一个TransportChannelHandlerTransportClient client = clientFactory.createClient(host, port);//构建OneForOneBlockPusher 进行推送new OneForOneBlockPusher(client, appId, comparableAppAttemptId, inputBlockId,(BlockPushingListener) inputListener, buffersWithId).start();} else {logger.info("This clientFactory was closed. Skipping further block push retries.");}};int maxRetries = transportConf.maxIORetries();if (maxRetries > 0) {new RetryingBlockTransferor(transportConf, blockPushStarter, blockIds, listener, PUSH_ERROR_HANDLER).start();} else {blockPushStarter.createAndStart(blockIds, listener);}} catch (Exception e) {logger.error("Exception while beginning pushBlocks", e);for (String blockId : blockIds) {listener.onBlockPushFailure(blockId, e);}}}}
4、OneForOneBlockPusher
用于将块推送到要合并的远程shuffle服务,与之对应的类是OneForOneBlockFetcher:用于从远程shuffles服务中拉取块
public class OneForOneBlockPusher {//开始推块过程,每次推块都调用监听器public void start() {logger.debug("Start pushing {} blocks", blockIds.length);for (int i = 0; i < blockIds.length; i++) {assert buffers.containsKey(blockIds[i]) : "Could not find the block buffer for block "+ blockIds[i];String[] blockIdParts = blockIds[i].split("_");if (blockIdParts.length != 5 || !blockIdParts[0].equals(SHUFFLE_PUSH_BLOCK_PREFIX)) {throw new IllegalArgumentException("Unexpected shuffle push block id format: " + blockIds[i]);}//构建消息头:appId 重试id 块信息等ByteBuffer header =new PushBlockStream(appId, appAttemptId, Integer.parseInt(blockIdParts[1]),Integer.parseInt(blockIdParts[2]), Integer.parseInt(blockIdParts[3]),Integer.parseInt(blockIdParts[4]), i).toByteBuffer();//调用netty的客户端开始传输数据client.uploadStream(new NioManagedBuffer(header), buffers.get(blockIds[i]),new BlockPushCallback(i, blockIds[i]));}}}
5、TransportClient
客户端,用于获取预先协商的流的连续块。此API旨在实现大量数据的高效传输,这些数据被分解为大小从数百KB到几MB不等的块。
请注意,虽然此客户端处理从流(即数据平面)中提取块,但流的实际设置是在传输层范围之外完成的。提供方便的方法“sendRPC”来实现客户端和服务器之间的控制平面通信,以执行此设置。
例如,一个典型的工作流程可能是:
client.sendRPC(新的OpenFile(“/foo”))--> 返回StreamId=100
client.fetchChunk(streamId = 100, chunkIndex = 0, callback)
client.fetchChunk(streamId = 100, chunkIndex = 1, callback)
......
client.sendRPC(new CloseStream(100))
使用TransportClientFactory构造TransportClient的实例。单个TransportClient可用于多个流,但任何给定的流都必须限制在单个客户端,以避免乱序响应。
注意:此类用于向服务器发出请求,而TransportResponseHandler负责处理来自服务器的响应。
并发:线程安全,可以从多个线程调用。
public class TransportClient implements Closeable {//以流的形式将数据发送到远程端public long uploadStream(ManagedBuffer meta,ManagedBuffer data,RpcResponseCallback callback) {if (logger.isTraceEnabled()) {logger.trace("Sending RPC to {}", getRemoteAddress(channel));}long requestId = requestId();handler.addRpcRequest(requestId, callback);RpcChannelListener listener = new RpcChannelListener(requestId, callback);//UploadStream是一种一种RPC,其数据在帧外发送,因此可以作为流读取。//利用netty传输数据channel.writeAndFlush(new UploadStream(requestId, meta, data)).addListener(listener);return requestId;}}
四、总结
1、ShuffleMapTask中的ShuffleWriter将结果写入磁盘完毕
2、判断当前环境是否支持push-based shuffle(假定支持)
3、获取该Task中的Shuffle结果文件
4、构建并初始化ShuffleBlockPusher(单块最大限制、单次推送请求数据大小限制、对端一次性可接收的数据大小限制等等)
5、按照分区将数据块组装成PushRequest放入队列中,并将其随机打散(如果有的分区过大会造成不会推送的情况,此时就需要下一个Stage计算时过来拉取)
6、准备推送
7、推送前检查对端是否达到接收限制,并将这次PushRequest中的块进行打散
8、从SparkEnv获取BlockManager,BlockManager调用BlockStoreClient来进行传输
9、为该Task结果数据维护一个Map(ConcurrentHashMap<SocketAddress, ClientPool> connectionPool)如果没有远端的Socket对应的Netty客户端就新建,如果有就直接获取
10、构建一个OneForOneBlockPusher开始推送数据流
11、最终调用Netty客户端的channel.writeAndFlush()将数据流推送到目标主机
12、如果监听器收到推送成功的消息将再次调用pushUpToMax来触发推送下一批块