论文：KernelBench: Can LLMs Write Efficient GPU Kernels?

在网上看到可以使用LLM来写cuda内核了？太厉害了

作为编译器工程师，特别想知道是怎么做到的，非常的好奇，他的提示词是怎么写的，工作流程是什么样子。

把论文下载下来研究下，回头有机会也试试，看看效果怎么样。

但是从论文来看，效果其实不是很好。测试的模型结构整体也是比较简单，感觉有点像玩具属性？？？

论文

KernelBench: Can LLMs Write Efficient GPU Kernels?

高效的GPU内核对于构建高性能的机器学习架构至关重要，但编写它们是一项耗时且需要大量专业知识的挑战；因此，我们探索使用语言模型（LM）来自动化内核生成。我们引入了KernelBench，这是一个开源框架，用于评估LM在一套精心挑选的250个PyTorch ML工作负载上编写快速和正确内核的能力。KernelBench代表了一个真实世界的工程环境，并且直接对引入的基准进行改进可以转化为更快的实际内核。我们介绍了一种新的评价指标fast_p，它衡量的是功能正确的生成内核的比例，并提供比基线更高的速度提升p。我们在各种最先进的模型和测试时间方法上的实验表明，前沿推理模型表现最佳，但仍然整体落后，在不到20％的情况下与PyTorch基线匹配。虽然我们展示了通过在迭代细化过程中利用执行和分析反馈可以改善结果，但KernelBench仍然是一个具有挑战性的基准，随着提高加速阈值p而变得更加困难。

该论文团队还写了 Blob

实现步骤

任务输入：对于给定的人工智能工作负载，该任务的输入是一个用 PyTorch 编写的参考实现。模仿人工智能研究人员的工作流程，PyTorch 代码包含一个名为 Model 的类，该类继承自 torch.nn.Module ()，其中标准的 init 和 forward () 函数（以及任何辅助函数）都填充了该人工智能工作负载的 PyTorch 操作。
任务输出：根据输入，大语言模型需要输出一个名为 ModelNew 的新类，该类继承自 torch.nn.Module () ，并包含自定义优化。例如，大语言模型可以在 forward () 函数中使用 PyTorch 的 CUDA - C 扩展来嵌入内核调用。

整体的实现从论文来看还是比较简单的，

告诉AI你需要以下pytorch代码翻译为更有效地cuda代码的自然语言提示，并提供代码，然后让AI生成。
将AI生成的结果进行评估，测试精度和性能

LLM需要做的是
1. 哪些op需要优化
2. 如何优化？
  1. tile
    1. fusion
    2. 特殊张量优化
    3. 使用什么优化库
      1. cuda
      2. cutlass
      3. triton

总结一下论文中的工作流：

任务输入准备：对于给定的人工智能工作负载，提供一个用 PyTorch 编写的参考实现作为任务输入。该代码包含一个继承自torch.nn.Module()的名为Model的类，init和forward()函数（以及任何辅助函数）中填充了人工智能工作负载的 PyTorch 操作。
大语言模型处理：大语言模型接收上述输入后，需要输出一个新的同样继承自torch.nn.Module()的类ModelNew，这个类要包含自定义优化。在此过程中，大语言模型需要确定Model类中哪些操作最适合优化，以及如何进行优化。它可以使用融合、平铺等硬件效率技术，张量核心等专门指令，以及 PTX、CUDA、CUTLASS、Triton、ThunderKittens 等任何编程库。
任务输出评估：对大语言模型输出的ModelNew类进行评估，评估内容包括性能和功能正确性，最好以编程方式实现评估。同时，收集生成内核的性能分析和执行信息。
反馈与迭代：将执行结果和分析器反馈提供给大语言模型，让其在上下文中进行多次优化迭代，以提高内核质量。
任务分级与拓展：KernelBench 中的 250 个任务根据包含的原始操作或 PyTorch 库函数数量分为单个操作、操作序列、端到端架构三个级别。并且该框架易于扩展以涵盖新的工作负载。

论文作者开源了代码： KernelBench

提示词

寻找代码逻辑

他的入口文件是： generate_and_eval_single_sample_modal

从函数 custom_cuda_prompt = prompt_generate_custom_cuda_from_prompt_template(ref_arch_src) 生成提示词

使用提示示例（元素添加）用于及时模板
示例的最基本形式只是向LLM显示任务和预期输出格式

下面这个函数应该就是提示词函数的主体实现了。从这个里面我们可以看到提示词的整体构成结构。

PROBLEM_INSTRUCTION = """
Optimize the architecture named Model with custom CUDA operators! Name your optimized output architecture ModelNew. Output the new code in codeblocks. Please generate real code, NOT pseudocode, make sure the code compiles and is fully functional. Just output the new model code, no other text, and NO testing code! \n
"""def prompt_generate_custom_cuda(arc_src: str, example_arch_src: str, example_new_arch_src: str
) -> str:prompt = PROBLEM_STATEMENTif example_arch_src != "" and example_new_arch_src != "":prompt += f"""Here's an example to show you the syntax of inline embedding custom CUDA operators in torch: The example given architecture is: \n```\n{example_arch_src}```\nThe example new arch with custom CUDA kernels looks like this: ```{example_new_arch_src}```\n"""prompt += f"""You are given the following architecture: \n```{arc_src}```"""prompt += PROBLEM_INSTRUCTIONreturn prompt

prompt

从代码看提示词还是比较抽象，在这里总结下论文作者用的提示词。

提示词模板

You write custom CUDA kernels to replace the pytorch operators in the given architecture to get speedups.

You have complete freedom to choose the set of operators you want to replace. You may make the decision to replace some operators with custom CUDA kernels and leave others unchanged. You may replace multiple operators with custom implementations, consider operator fusion opportunities (combining multiple operators into a single kernel, for example, combining matmul+relu), or algorithmic changes (such as online softmax). You are only limited by your imagination.

Here’s an example to show you the syntax of inline embedding custom CUDA operators in torch: The example given architecture is:

xxx （/workspaces/KernelBench/src/prompts/model_ex_1.py）在这里提供样例源代码

The example new arch with custom CUDA kernels looks like this:

xxx （/workspaces/KernelBench/src/prompts/model_new_ex_1.py）在这里提示样例代码的cuda优化版本

You are given the following architecture: xxx （真实的pytoch代码），在这里提供需要LLM优化的源代码

Optimize the architecture named Model with custom CUDA operators! Name your optimized output architecture ModelNew. Output the new code in codeblocks. Please generate real code, NOT pseudocode, make sure the code compiles and is fully functional. Just output the new model code, no other text, and NO testing code!

example

You write custom CUDA kernels to replace the pytorch operators in the given architecture to get speedups. You have complete freedom to choose the set of operators you want to replace. You may make the decision to replace some operators with custom CUDA kernels and leave others unchanged. You may replace multiple operators with custom implementations, consider operator fusion opportunities (combining multiple operators into a single kernel, for example, combining matmul+relu), or algorithmic changes (such as online softmax). You are only limited by your imagination.Here's an example to show you the syntax of inline embedding custom CUDA operators in torch: The example given architecture is: ```import torch
import torch.nn as nn
import torch.nn.functional as Fclass Model(nn.Module):def __init__(self) -> None:super().__init__()def forward(self, a, b):return a + bdef get_inputs():# randomly generate input tensors based on the model architecturea = torch.randn(1, 128).cuda()b = torch.randn(1, 128).cuda()return [a, b]def get_init_inputs():# randomly generate tensors required for initialization based on the model architecturereturn []```The example new arch with custom CUDA kernels looks like this: ```
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.cpp_extension import load_inline# Define the custom CUDA kernel for element-wise addition
elementwise_add_source = """
#include <torch/extension.h>
#include <cuda_runtime.h>__global__ void elementwise_add_kernel(const float* a, const float* b, float* out, int size) {int idx = blockIdx.x * blockDim.x + threadIdx.x;if (idx < size) {out[idx] = a[idx] + b[idx];}
}torch::Tensor elementwise_add_cuda(torch::Tensor a, torch::Tensor b) {auto size = a.numel();auto out = torch::zeros_like(a);const int block_size = 256;const int num_blocks = (size + block_size - 1) / block_size;elementwise_add_kernel<<<num_blocks, block_size>>>(a.data_ptr<float>(), b.data_ptr<float>(), out.data_ptr<float>(), size);return out;
}
"""elementwise_add_cpp_source = ("torch::Tensor elementwise_add_cuda(torch::Tensor a, torch::Tensor b);"
)# Compile the inline CUDA code for element-wise addition
elementwise_add = load_inline(name="elementwise_add",cpp_sources=elementwise_add_cpp_source,cuda_sources=elementwise_add_source,functions=["elementwise_add_cuda"],verbose=True,extra_cflags=[""],extra_ldflags=[""],
)class ModelNew(nn.Module):def __init__(self) -> None:super().__init__()self.elementwise_add = elementwise_adddef forward(self, a, b):return self.elementwise_add.elementwise_add_cuda(a, b)```You are given the following architecture: ```import torch
import torch.nn as nnclass Model(nn.Module):"""A model that performs a matrix multiplication, scaling, and residual addition.Args:in_features (int): Number of input features.out_features (int): Number of output features.scaling_factor (float): Scaling factor to apply after matrix multiplication."""def __init__(self, in_features, out_features, scaling_factor):super(Model, self).__init__()self.matmul = nn.Linear(in_features, out_features)self.scaling_factor = scaling_factordef forward(self, x):"""Forward pass of the model.Args:x (torch.Tensor): Input tensor of shape (batch_size, in_features).Returns:torch.Tensor: Output tensor of shape (batch_size, out_features)."""x = self.matmul(x)original_x = x.clone().detach()x = x * self.scaling_factorx = x + original_xreturn xbatch_size = 128
in_features = 64
out_features = 128
scaling_factor = 0.5def get_inputs():return [torch.randn(batch_size, in_features)]def get_init_inputs():return [in_features, out_features, scaling_factor]```Optimize the architecture named Model with custom CUDA operators! Name your optimized output architecture ModelNew. Output the new code in codeblocks. Please generate real code, NOT pseudocode, make sure the code compiles and is fully functional. Just output the new model code, no other text, and NO testing code!

结束语

起初就是好奇ta是怎么做的，让AI写cuda的kenerl，看完了后感觉整体也是比较简单的。

就是设置了提示词，然后提供了一组样例，在提供自己需要AI来优化的源代码，让AI自己写。

写完了去验证一下。。。。。。。

整体感觉比较简陋，使用的测试样例也都简单。感觉有点玩具属性，纯属个人看法~可能点吹牛逼了，请勿太在意~

AI写CUDA内核有什么好处

当前的开发周期（从高效实现到推广再到编译器集成）非常漫长。高效编译器通常比新的 GPU 架构落后两年多：CUDA 专家大约需要一年的时间来开发优化的实现，而将这些优化推广到编译器中则需要一年的时间。传统编译器擅长生成可证明正确、稳健且通用的解决方案，因此对于广泛的应用来说，它们是必不可少的。然而，开发编译器仍然是一个耗费大量人力和时间的过程。

许多设计模式和优化都可以在 GPU 内核中重复使用——基本原则包括重叠、融合、高效内存访问和最大化占用率。我们的方法旨在通过关注不同的目标来补充传统编译器。我们的目标是将人类直觉直接提炼成经过经验测试的专用高性能代码，而不是追求通用、可证明正确的编译器解决方案。这使得能够生成针对特定输入形状和计算模式高度优化的代码，而这种专业化水平在传统编译器中则需要大量的模式匹配规则和手动工程。