too many blocks in cooperative launch at cudaLaunchCooperativeKernel

news/2024/9/19 4:51:03/ 标签: 人工智能

在使用cudaLaunchCooperativeKernel时出现:

cudaErrorCooperativeLaunchTooLarge (error 82) due to “too many blocks in cooperative launch” on CUDA API call to cudaLaunchCooperativeKernel.

问题:

在使用cudaLaunchCooperativeKernel时,限制其最大grid_dim和block_dim的元素是什么?

A100的关键参数:
在这里插入图片描述
从上面表格中可以看到,影响cooperative launch的max_grid_dim 和max_block_dim的因素有三个:

maximum number of resident blocks per SM
maximum number of resident warps per SM
maximum number of resident threads per SM

对于A100理论上来说,在cooperative launch的时候,有如下限制(寄存器等先忽略):

block 不能超过 10832=3456
warps不能超过108
64=6912
threads 不能超过 108*2048=221184
按照上面条件获得下表,理论上下面表格数据是能cooperative launch成功的
在这里插入图片描述
(从256开始红色代表实测值)

问题是,当cooperative launch空kernel,当grid dim 从256开始后,按照理论的max block dim(绿色)的时候出现:

“too many blocks in cooperative launch” on CUDA API call to cudaLaunchCooperativeKernel.”

问题如下:

I understand I’m using too many ‘active blocks’ and have no argument with that.

What I don’t understand is how to do the math to know how many blocks and threads I can call beforehand.

为了获得答案,我们看一下sm上kernel分布:

grid_dimblock_dimsm0sm1sm2sm3sm4sm5sm6sm7sm8sm9sm10sm11sm12sm13sm14sm15sm16sm17sm18sm19sm20sm21sm22sm23sm24sm25sm26sm27sm28sm29sm30sm31sm32sm33sm34sm35sm36sm37sm38sm39sm40sm41sm42sm43sm44sm45sm46sm47sm48sm49sm50sm51sm52sm53sm54sm55sm56sm57sm58sm59sm60sm61sm62sm63sm64sm65sm66sm67sm68sm69sm70sm71sm72sm73sm74sm75sm76sm77sm78sm79sm80sm81sm82sm83sm84sm85sm86sm87sm88sm89sm90sm91sm92sm93sm94sm95sm96sm97sm98sm99sm100sm101sm102sm103sm104sm105sm106sm107
2561323232323232323232323232323232323232323232323232323232323232323232323232323232322222222222222222222222222222
25664192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128192128128128128128128128128128128128128128128128128128128128128128128128128128128128128128
25696288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192288192192192192192192192192192192192192192192192192192192192192192192192192192192192192192
256128384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256384256256256256256256256256256256256256256256256256256256256256256256256256256256256256256
256160480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320480320320320320320320320320320320320320320320320320320320320320320320320320320320320320320
256192576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384576384384384384384384384384384384384384384384384384384384384384384384384384384384384384384
256224672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448672448448448448448448448448448448448448448448448448448448448448448448448448448448448448448
256256768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512768512512512512512512512512512512512512512512512512512512512512512512512512512512512512512
256288864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576864576576576576576576576576576576576576576576576576576576576576576576576576576576576576576
256320960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640960640640640640640640640640640640640640640640640640640640640640640640640640640640640640640
2563521056704105670410567041056704105670410567041056704105670410567041056704105670410567041056704105670410567041056704105670410567041056704105670410567041056704105670410567041056704105670410567041056704105670410567041056704105670410567041056704105670410567041056704105670410567041056704704704704704704704704704704704704704704704704704704704704704704704704704704704704704
2563841152768115276811527681152768115276811527681152768115276811527681152768115276811527681152768115276811527681152768115276811527681152768115276811527681152768115276811527681152768115276811527681152768115276811527681152768115276811527681152768115276811527681152768115276811527681152768768768768768768768768768768768768768768768768768768768768768768768768768768768768768
2564161248832124883212488321248832124883212488321248832124883212488321248832124883212488321248832124883212488321248832124883212488321248832124883212488321248832124883212488321248832124883212488321248832124883212488321248832124883212488321248832124883212488321248832124883212488321248832832832832832832832832832832832832832832832832832832832832832832832832832832832832832
2564481344896134489613448961344896134489613448961344896134489613448961344896134489613448961344896134489613448961344896134489613448961344896134489613448961344896134489613448961344896134489613448961344896134489613448961344896134489613448961344896134489613448961344896134489613448961344896896896896896896896896896896896896896896896896896896896896896896896896896896896896896
2564801440960144096014409601440960144096014409601440960144096014409601440960144096014409601440960144096014409601440960144096014409601440960144096014409601440960144096014409601440960144096014409601440960144096014409601440960144096014409601440960144096014409601440960144096014409601440960960960960960960960960960960960960960960960960960960960960960960960960960960960960960
256512153610241536102415361024153610241536102415361024153610241536102415361024153610241536102415361024153610241536102415361024153610241536102415361024153610241536102415361024153610241536102415361024153610241536102415361024153610241536102415361024153610241536102415361024153610241536102415361024153610241536102415361024153610241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024
256544163210881632108816321088163210881632108816321088163210881632108816321088163210881632108816321088163210881632108816321088163210881632108816321088163210881632108816321088163210881632108816321088163210881632108816321088163210881632108816321088163210881632108816321088163210881632108816321088163210881632108816321088163210881088108810881088108810881088108810881088108810881088108810881088108810881088108810881088108810881088108810881088
256576172811521728115217281152172811521728115217281152172811521728115217281152172811521728115217281152172811521728115217281152172811521728115217281152172811521728115217281152172811521728115217281152172811521728115217281152172811521728115217281152172811521728115217281152172811521728115217281152172811521728115217281152172811521152115211521152115211521152115211521152115211521152115211521152115211521152115211521152115211521152115211521152
256608182412161824121618241216182412161824121618241216182412161824121618241216182412161824121618241216182412161824121618241216182412161824121618241216182412161824121618241216182412161824121618241216182412161824121618241216182412161824121618241216182412161824121618241216182412161824121618241216182412161824121618241216182412161216121612161216121612161216121612161216121612161216121612161216121612161216121612161216121612161216121612161216
256640192012801920128019201280192012801920128019201280192012801920128019201280192012801920128019201280192012801920128019201280192012801920128019201280192012801920128019201280192012801920128019201280192012801920128019201280192012801920128019201280192012801920128019201280192012801920128019201280192012801920128019201280192012801280128012801280128012801280128012801280128012801280128012801280128012801280128012801280128012801280128012801280
256672201613442016134420161344201613442016134420161344201613442016134420161344201613442016134420161344201613442016134420161344201613442016134420161344201613442016134420161344201613442016134420161344201613442016134420161344201613442016134420161344201613442016134420161344201613442016134420161344201613442016134420161344201613441344134413441344134413441344134413441344134413441344134413441344134413441344134413441344134413441344134413441344

直接看最后一行,可以看到block dim=672,也就是说threads per block is 672,我们知道一个block内的threads不能跨SM,所以在这里,每个SM上可以放6723=2016个threads.那么我们所有的256(小于1083=324)个block都是可以放上去的。

当执行到256*704的时候失败,
在这里插入图片描述
为什么这里失败呢?

这是因为当grid dim固定为256的时候,当block dim=704的时候,每个sm上最多能放2个block,所以最多能launch108*2=216个block,其余的block加载不上去,为什么?

因为block dim=704,现在每个SM上已经都有704*2=1408个threads了,现在每个SM还能容纳2048-1408=640个threads,而我们的每个block是704,所以就出现too many error了。

查看cuda文档,看到下面介绍,可以通过函数

cudaOccupancyMaxActiveBlocksPerMultiprocessor()来查询:
在这里插入图片描述


http://www.ppmy.cn/news/1520835.html

相关文章

ffmpeg音视频开发从入门到精通——ffmpeg实现音频抽取

文章目录 FFmpeg 实现音频流抽取1. 包含FFmpeg头文件与命名空间声明2. 主函数与参数处理3. 打开输入文件4. 获取文件信息5. 查找音频流6. 分配输出文件上下文7. 猜测输出文件格式8. 创建新的音频流9. 打开输出文件10. 写入文件头信息11. 读取并写入音频数据12. 写入文件尾部信息…

k8s集群环境搭建(一主二从--kubeadm安装)

前置条件 版本:CentOS Linux release 7.5.1804 (Core) 内存:2G CPU:2 主机名解析 vim /etc/hosts 192.168.109.100 master 192.168.109.101 node1 192.168.109.102 node2时间同步,这里直接使用chronyd服务从网络同步时间syste…

ESP32-IDF http请求崩溃问题分析与解决

文章目录 esp32s3 http请求崩溃问题代码讨论修正后不崩溃的代码esp32相关文章 ESP32S3板子, 一运行http请求百度网站的例子, 就会panic死机, 记录下出现及解决过程. esp32s3 http请求崩溃 一执行http请求的perform就会崩溃, 打印如图 ESP32-IDF 的http请求代码是根据官方dem…

【亚马逊云】注册登录AWS 合作伙伴网络(APN)操作流程

文章目录 1、什么是APN?2、登录AWS官网3、加入 AWS 合作伙伴网络4、登录 AWS 合作伙伴网络5、常见问题5.1 忘记密码5.2 修改信息 6、活动上新1️⃣「云上驰骋,考证无忧」云从业者认证考试优惠活动2️⃣ Amazon 动手实验3️⃣AWS Certified 助理级认证挑战…

[Tools: LoRA] Diffusers中Stable Diffusion的实现

实现底层原理 Diffusers中的Attention操作实现在AttnProcessor类(diffusers.models.attention_processor.py),里面定义了单次Attention操作。添加LoRA,本质上是用LoRAAttnProcessor类替换AttnProcessor类。LoRAAttnProcessor中新…

强连通分量专题总结

~~~~~ 总题单链接 ~~~~~ 对于只需要考虑强连通分量的题,就可以用强连通分量(大雾 ~~~~~ 我想了很久,确实没有什么好说的 … \ldots …

ECCV2024|RegionDrag:基于区域的图像编辑方法,通过手动拖拽实现图像编辑!

香港大学和牛津大学提出了一种使用扩散模型进行基于区域的快速图像编辑方法RegionDrag, RegionDrag 是一种基于区域的图像编辑方法,通过使用户能够通过 手柄和 目标区域表达指令,提供比点拖动方法更快、更精确的图像编辑,在速度上…

el-table利用折叠面板 type=“expand“ 嵌套el-table,并实现 明细数据多选,选中明细数据后返回原数据得嵌套格式

效果图: 废话不多说直接上代码&#xff0c;完整代码展示&#xff1a; <template><el-tableborderref"multipleTable":data"tableData"tooltip-effect"dark"style"width: 100%"><el-table-columnwidth"50"la…

Java | Leetcode Java题解之第385题迷你语法分析器

题目&#xff1a; 题解&#xff1a; class Solution {int index 0;public NestedInteger deserialize(String s) {if (s.charAt(index) [) {index;NestedInteger ni new NestedInteger();while (s.charAt(index) ! ]) {ni.add(deserialize(s));if (s.charAt(index) ,) {in…

创新之光闪耀,点赋科技在第十三届创新创业大赛中绽放光彩

近日&#xff0c;第十三届创新创业大赛决赛落下帷幕&#xff0c;这场充满激情与挑战的赛事吸引了众多优秀企业参与角逐。在激烈的竞争中&#xff0c;点赋科技脱颖而出&#xff0c;荣获第三名的佳绩。 创新创业大赛一直是企业展示实力、交流创新理念的重要平台。本次大赛中&…

前端防抖和节流函数的实现原理

在前端开发中&#xff0c;防抖&#xff08;Debounce&#xff09;和节流&#xff08;Throttle&#xff09;是两种常用的优化技术&#xff0c;它们主要用于减少事件处理函数的执行频率&#xff0c;从而提高程序性能和用户体验。 防抖&#xff08;Debounce&#xff09; 防抖的目…

iomuxc、pinctrl子系统、gpio子系统(学习总结)

iomuxc、pinctrl子系统、gpio子系统三者的关系 相互依赖&#xff1a;IOMUXC、pinctrl子系统和gpio子系统在功能上相互依赖。IOMUXC提供了引脚复用和电气属性的配置能力&#xff0c;pinctrl子系统负责从设备树中获取这些配置信息并完成初始化&#xff0c;而gpio子系统则在引脚被…

UE 【材质编辑】自定义材质节点

使用UE的材质编辑器&#xff0c;蓝图提供了大量的节点函数&#xff1a; 实际上&#xff0c;这是一段封装好的包含一串HLSL代码的容器。打开“Source/Runtime/Engine/Classes/Material”&#xff0c;可以看到很多不同节点的头文件&#xff1a; 照葫芦画瓢 以UMaterialExpressi…

notepad++将换行替换成空

将多行里的换行置为一行&#xff0c;例如将下面的6行置为3行 crrlH打开替换框&#xff0c; 替换目标为【,\r\n】&#xff0c;替换成空&#xff0c;勾选循环查找和 正则表达式&#xff0c;全部替换即可。 替换后的效果

应该怎么从0搭建一个图像识别系统,如果想考计算机的研究生应该如何准备

搭建一个图像识别系统的过程可以分为以下几个步骤&#xff1a; 数据收集和准备&#xff1a;收集包含标注的图像数据集&#xff0c;并将其准备为训练集和测试集。确保数据集的多样性和代表性。 特征提取和选择&#xff1a;选择适当的特征提取方法&#xff0c;如卷积神经网络&am…

如何配置iSAID_Devkit环境

这个库有点年头了&#xff0c;使用README.md里的conda env create -f environment.yml会说包之间有冲突, 没法安装. 解决方法: 自己建立一个conda env, conda create -n py_isaid pip python3.6.8 记得自己提前定好python版本use gpt to transform environment.yml to setup.p…

mac安装spark

参考&#xff1a;在Mac上安装Spark apache-spark-3.5.1_mac安装spark-CSDN博客 几个需要用到的路径&#xff1a; hadoop的bin目录&#xff1a;/opt/homebrew/Cellar/hadoop/3.4.0/bin spark的conf目录/opt/homebrew/Cellar/apache-spark/3.5.2/libexec/conf spark的bin目录&am…

Elasticsearch之原理详解

简介 ES是使用 Java 编写的一种开源搜索引擎&#xff0c;它在内部使用 Lucene 做索引与搜索&#xff0c;通过对 Lucene 的封装&#xff0c;隐藏了 Lucene 的复杂性&#xff0c;取而代之的提供一套简单一致的 RESTful API 然而&#xff0c;Elasticsearch 不仅仅是 Lucene&#…

SpringCloud Alibaba】(十三)学习 RocketMQ 消息队列

目录 1、MQ 使用场景与选型对比1.1、MQ 的使用场景1.2、引入 MQ 后的注意事项1.3、MQ 选型对比 2、下载、安装 RocketMQ 及 RocketMQ 控制台2.1、下载安装 RocketMQ2.2、测试 RocketMQ 环境2.3、RocketMQ 控制台【图形化管理控制台】2.3.1、下载、安装2.3.2、验证 RocketMQ 控制…

day-49 使数组中所有元素相等的最小操作数

思路 第一个数和最后一个数要变为一致&#xff0c;需要操作n-1次&#xff0c;然后第二个数和倒数第二个数要操作n-3次 解题过程 以此类推即可得出答案 Code class Solution {public int minOperations(int n) {int ans0;int t(n-1);while(t>0){anst;t-2;}return ans;} }作…