NVIDIA Thrust教程

Thrust 的 API 参考指南，CUDA C++ 模板库。
在这里插入图片描述

1.简介

Thrust 是基于标准模板库 (STL) 的 CUDA 的 C++ 模板库。 Thrust 允许您通过与 CUDA C 完全互操作的高级接口，以最少的编程工作实现高性能并行应用程序。

Thrust 提供了丰富的数据并行原语集合，例如扫描、排序和归约，它们可以组合在一起，以简洁、可读的源代码实现复杂的算法。通过用这些高级抽象描述您的计算，您可以让 Thrust 自由地自动选择最有效的实现。因此，Thrust 可用于 CUDA 应用程序的快速原型设计（其中程序员的生产力最为重要），也可用于生产（其中稳健性和绝对性能至关重要）。

本文档描述了如何使用 Thrust 开发 CUDA 应用程序。本教程旨在易于访问，即使您的 C++ 或 CUDA 经验有限。

1.1. 安装和版本控制

安装 CUDA 工具包会将 Thrust 头文件复制到系统的标准 CUDA 包含目录。由于 Thrust 是头文件的模板库，因此无需进一步安装即可开始使用 Thrust。

此外，新版本的 Thrust 继续通过 GitHub Thrust 项目页面在线提供。

2.向量

Thrust 提供了两个 vector 容器，host_vector 和 device_vector。顾名思义，host_vector 存储在主机内存中，而 device_vector 存储在 GPU 设备内存中。 Thrust 的向量容器就像 C++ STL 中的 std::vector。与 std::vector 一样，host_vector 和 device_vector 是可以动态调整大小的通用容器（能够存储任何数据类型）。以下源代码说明了 Thrust 的矢量容器的使用。

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>#include <iostream>int main(void)
{// H has storage for 4 integersthrust::host_vector<int> H(4);// initialize individual elementsH[0] = 14;H[1] = 20;H[2] = 38;H[3] = 46;// H.size() returns the size of vector Hstd::cout << "H has size " << H.size() << std::endl;// print contents of Hfor(int i = 0; i < H.size(); i++)std::cout << "H[" << i << "] = " << H[i] << std::endl;// resize HH.resize(2);std::cout << "H now has size " << H.size() << std::endl;// Copy host_vector H to device_vector Dthrust::device_vector<int> D = H;// elements of D can be modifiedD[0] = 99;D[1] = 88;// print contents of Dfor(int i = 0; i < D.size(); i++)std::cout << "D[" << i << "] = " << D[i] << std::endl;// H and D are automatically deleted when the function returnsreturn 0;
}

如本示例所示，= 运算符可用于将 host_vector 复制到 device_vector（反之亦然）。 = 运算符也可用于将 host_vector 复制到 host_vector 或将 device_vector 复制到 device_vector。 另请注意，可以使用标准括号表示法访问 device_vector 的各个元素。但是，因为这些访问中的每一个都需要调用 cudaMemcpy，所以应该谨慎使用它们。稍后我们将介绍一些更有效的技术。

将向量的所有元素初始化为特定值，或仅将一组特定值从一个向量复制到另一个向量通常很有用。 Thrust 提供了几种方法来执行这些类型的操作。

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>#include <thrust/copy.h>
#include <thrust/fill.h>
#include <thrust/sequence.h>#include <iostream>int main(void)
{// initialize all ten integers of a device_vector to 1thrust::device_vector<int> D(10, 1);// set the first seven elements of a vector to 9thrust::fill(D.begin(), D.begin() + 7, 9);// initialize a host_vector with the first five elements of Dthrust::host_vector<int> H(D.begin(), D.begin() + 5);// set the elements of H to 0, 1, 2, 3, ...thrust::sequence(H.begin(), H.end());// copy all of H back to the beginning of Dthrust::copy(H.begin(), H.end(), D.begin());// print Dfor(int i = 0; i < D.size(); i++)std::cout << "D[" << i << "] = " << D[i] << std::endl;return 0;
}

这里我们说明了填充、复制和序列功能的使用。 copy 函数可用于将一系列主机或设备元素复制到另一个主机或设备向量。与相应的 STL 函数一样，thrust::fill 只是将一系列元素设置为特定值。 Thrust 的序列函数可用于创建等距值序列。

2.1. Thrust命名空间

您会注意到，我们在示例中使用了诸如 Thrust::Host_Vector 或 Thrust::Copy 之类的东西。 Thrust:: 部分告诉 C++ 编译器我们想要在 Thrust 命名空间中查找特定的函数或类。命名空间是避免名称冲突的好方法。例如，thrust::copy 与 STL 中提供的 std::copy 不同。 C++ 命名空间允许我们区分这两个复制函数。

2.2. 迭代器和静态调度

在本节中，我们使用了 H.begin() 和 H.end() 等表达式或 D.begin() + 7 等偏移量。begin() 和 end() 的结果在 C++ 中称为迭代器。对于向量容器（实际上只是数组）来说，迭代器可以被认为是指向数组元素的指针。因此，H.begin() 是一个迭代器，它指向存储在 H 向量内的数组的第一个元素。类似地，H.end() 指向 H 向量的最后一个元素后面的元素。

尽管向量迭代器与指针类似，但它们携带更多信息。请注意，我们不必告诉 Thrust::fill 它正在 device_vector 迭代器上运行。此信息在 D.begin() 返回的迭代器类型中捕获，该类型与 H.begin() 返回的类型不同。当调用 Thrust 函数时，它会检查迭代器的类型以确定是使用主机还是设备实现。此过程称为静态调度，因为主机/设备调度是在编译时解析的。请注意，这意味着调度过程没有运行时开销。

您可能想知道当“原始”指针用作 Thrust 函数的参数时会发生什么。与 STL 一样，Thrust 允许这种用法，并且它将调度算法的主机路径。如果所讨论的指针实际上是指向设备内存的指针，那么您需要在调用该函数之前用Thrust::device_ptr将其包装起来。例如：

size_t N = 10;// raw pointer to device memory
int * raw_ptr;
cudaMalloc((void **) &raw_ptr, N * sizeof(int));// wrap raw pointer with a device_ptr
thrust::device_ptr<int> dev_ptr(raw_ptr);// use device_ptr in thrust algorithms
thrust::fill(dev_ptr, dev_ptr + N, (int) 0);

要从 device_ptr 中提取原始指针，应按如下方式应用 raw_pointer_cast：

size_t N = 10;// create a device_ptr
thrust::device_ptr<int> dev_ptr = thrust::device_malloc<int>(N);// extract raw pointer from device_ptr
int * raw_ptr = thrust::raw_pointer_cast(dev_ptr);

区分迭代器和指针的另一个原因是迭代器可用于遍历多种数据结构。例如，STL 提供了一个链表容器 (std::list)，它提供双向（但不是随机访问）迭代器。虽然 Thrust 不提供此类容器的设备实现，但它与它们兼容。

#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <list>
#include <vector>int main(void)
{// create an STL list with 4 valuesstd::list<int> stl_list;stl_list.push_back(10);stl_list.push_back(20);stl_list.push_back(30);stl_list.push_back(40);// initialize a device_vector with the listthrust::device_vector<int> D(stl_list.begin(), stl_list.end());// copy a device_vector into an STL vectorstd::vector<int> stl_vector(D.size());thrust::copy(D.begin(), D.end(), stl_vector.begin());return 0;
}

供将来参考：我们目前介绍的迭代器很有用，但相当基础。除了这些普通的迭代器之外，Thrust 还提供了一组奇特的迭代器，其名称如 counting_iterator 和 zip_iterator。虽然它们看起来和感觉起来都像普通的迭代器，但奇特的迭代器能够做更令人兴奋的事情。我们将在本教程的后面部分重新讨论这个主题。

3. 算法

Thrust 提供了大量常用的并行算法。其中许多算法在 STL 中都有直接的类似物，当存在等效的 STL 函数时，我们选择名称（例如，thrust::sort 和 std::sort）。

Thrust 中的所有算法都有针对主机和设备的实现。具体来说，当使用主机迭代器调用 Thrust 算法时，将调度主机路径。类似地，当使用设备迭代器定义范围时，将调用设备实现。

除了可以在主机和设备之间复制数据的 Thrust::copy 之外，Thrust 算法的所有迭代器参数都应该位于同一位置：要么全部位于主机上，要么全部位于设备上。当违反此要求时，编译器将产生一条错误消息。

3.1. 变换

转换是对一组（零个或多个）输入范围中的每个元素应用操作，然后将结果存储在目标范围中的算法。我们已经看到的一个例子是 thrust::fill，它将范围内的所有元素设置为指定值。其他转换包括 thrust::sequence、thrust::replace，当然还有 thrust::transform。有关完整列表，请参阅文档。

以下源代码演示了几种转换算法。请注意，thrust::negate 和 thrust::modulus 在 C++ 术语中称为函子。 Thrust 在文件 thrust/functional.h 中提供了这些和其他常见的仿函数，例如加号和乘法。

#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <thrust/fill.h>
#include <thrust/replace.h>
#include <thrust/functional.h>
#include <iostream>int main(void)
{// allocate three device_vectors with 10 elementsthrust::device_vector<int> X(10);thrust::device_vector<int> Y(10);thrust::device_vector<int> Z(10);// initialize X to 0,1,2,3, ....thrust::sequence(X.begin(), X.end());// compute Y = -Xthrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate<int>());// fill Z with twosthrust::fill(Z.begin(), Z.end(), 2);// compute Y = X mod 2thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(), thrust::modulus<int>());// replace all the ones in Y with tensthrust::replace(Y.begin(), Y.end(), 1, 10);// print Ythrust::copy(Y.begin(), Y.end(), std::ostream_iterator<int>(std::cout, "\n"));return 0;
}

虽然 Throws/Functional.h 中的函子涵盖了大部分内置算术和比较运算，但我们经常想做一些不同的事情。例如，考虑向量运算 y <- a * x + y，其中 x 和 y 是向量，a 是标量常数。这是任何 BLAS 库提供的众所周知的 SAXPY 操作。

如果我们想用 Thrust 实现 SAXPY，我们有几个选择。第一种是使用两个转换（一个加法和一个乘法）和一个用值 a 填充的临时向量。更好的选择是将单个转换与用户定义的函子一起使用，这正是我们想要的。我们在下面的源代码中说明了这两种方法。

struct saxpy_functor
{const float a;saxpy_functor(float _a) : a(_a) {}__host__ __device__float operator()(const float& x, const float& y) const {return a * x + y;}
};void saxpy_fast(float A, thrust::device_vector<float>& X, thrust::device_vector<float>& Y)
{// Y <- A * X + Ythrust::transform(X.begin(), X.end(), Y.begin(), Y.begin(), saxpy_functor(A));
}void saxpy_slow(float A, thrust::device_vector<float>& X, thrust::device_vector<float>& Y)
{thrust::device_vector<float> temp(X.size());// temp <- Athrust::fill(temp.begin(), temp.end(), A);// temp <- A * Xthrust::transform(X.begin(), X.end(), temp.begin(), temp.begin(), thrust::multiplies<float>());// Y <- A * X + Ythrust::transform(temp.begin(), temp.end(), Y.begin(), Y.begin(), thrust::plus<float>());
}

saxpy_fast 和 saxpy_slow 都是有效的 SAXPY 实现，但是 saxpy_fast 将明显快于 saxpy_slow。忽略分配临时向量和算术运算的成本，我们有以下成本：

fast_saxpy：执行 2N 次读取和 N 次写入

Slow_saxpy：执行 4N 次读取和 3N 次写入

由于 SAXPY 受内存限制（其性能受限于内存带宽，而不是浮点性能），较大的读写次数使得 saxpy_slow 的成本更高。相反，在优化的 BLAS 实现中，saxpy_fast 的执行速度与 SAXPY 一样快。在像 SAXPY 这样的内存绑定算法中，通常值得应用内核融合（将多个操作组合到一个内核中）以最小化内存事务的数量。

Thrust::transform 仅支持具有一个或两个输入参数的转换（例如 $f(x)\rightarrow y$ 和 $f(x,x)\rightarrow y$ ). 当转换使用两个以上的输入参数时，有必要使用不同的方法。 random_transformation 示例演示了使用Thrust::zip_iterator 和Thrust::for_each 的解决方案。

3.2. Reductions

缩减算法使用二元运算将输入序列缩减为单个值。例如，数字数组的和是通过用加运算减少数组来获得的。类似地，数组的最大值是通过使用接受两个输入并返回最大值的运算符进行归约来获得的。数组的求和是用thrust::reduce实现的，如下所示：

int sum = thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus<int>());

reduce 的前两个参数定义值的范围，而第三个和第四个参数分别提供初始值和归约运算符。实际上，这种归约非常常见，以至于在没有提供初始值或运算符时它是默认选择。因此，以下三行是等效的：

int sum = thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus<int>());
int sum = thrust::reduce(D.begin(), D.end(), (int) 0);
int sum = thrust::reduce(D.begin(), D.end())

尽管 Thrust::reduce 足以实现各种缩减，但为了方便起见，Thrust 还提供了一些附加函数（如 STL）。例如，thrust::count 返回给定序列中特定值的实例数：

#include <thrust/count.h>
#include <thrust/device_vector.h>
...
// put three 1s in a device_vector
thrust::device_vector<int> vec(5,0);
vec[1] = 1;
vec[3] = 1;
vec[4] = 1;// count the 1s
int result = thrust::count(vec.begin(), vec.end(), 1);
// result is three

其他归约操作包括 Thrust::count_if、thrust::min_element、thrust::max_element、thrust::is_sorted、thrust::inner_product 等。有关完整列表，请参阅文档。

转换部分中的 SAXPY 示例展示了如何使用内核融合来减少转换内核使用的内存传输数量。使用thrust::transform_reduce，我们还可以将内核融合应用于缩减内核。考虑以下计算向量范数的示例。

#include <thrust/transform_reduce.h>
#include <thrust/functional.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <cmath>// square<T> computes the square of a number f(x) -> x*x
template <typename T>
struct square
{__host__ __device__T operator()(const T& x) const {return x * x;}
};int main(void)
{// initialize host arrayfloat x[4] = {1.0, 2.0, 3.0, 4.0};// transfer to devicethrust::device_vector<float> d_x(x, x + 4);// setup argumentssquare<float>        unary_op;thrust::plus<float> binary_op;float init = 0;// compute normfloat norm = std::sqrt( thrust::transform_reduce(d_x.begin(), d_x.end(), unary_op, init, binary_op) );std::cout << norm << std::endl;return 0;
}

这里我们有一个称为 square 的一元运算符，它对输入序列的每个元素进行平方。然后使用标准加约化来计算平方和。与 SAXPY 变换的较慢版本一样，我们可以通过多次传递来实现范数：首先使用平方或可能只是乘法进行变换，然后对临时数组进行加减法。然而，这会造成不必要的浪费并且速度会相当慢。通过将平方运算与缩减内核融合，我们再次获得了高度优化的实现，它提供了与手写内核相同的性能。

3.3. 前缀和

并行前缀和或扫描操作是许多并行算法（例如流压缩和基数排序）中的重要构建块。考虑以下源代码，它说明了使用默认加号运算符的包含扫描操作：

#include <thrust/scan.h>int data[6] = {1, 0, 2, 2, 1, 3};thrust::inclusive_scan(data, data + 6, data); // in-place scan// data is now {1, 1, 3, 5, 6, 9}

在包含扫描中，输出的每个元素都是输入范围的相应部分和。例如，data[2] = data[0] + data[1] + data[2]。独占扫描类似，但向右移动一位：

#include <thrust/scan.h>int data[6] = {1, 0, 2, 2, 1, 3};thrust::exclusive_scan(data, data + 6, data); // in-place scan// data is now {0, 1, 1, 3, 5, 6}

所以现在data[2] = data[0] + data[1]。正如这些示例所示，inclusion_scan 和 Exclusive_scan 允许就地执行。 Thrust还提供了函数transform_inclusive_scan和transform_exclusive_scan，它们在执行扫描之前将一元函数应用于输入序列。请参阅文档以获取扫描变体的完整列表。

3.4. 重新排序

Thrust 通过以下算法提供对分区和流压缩的支持：

copy_if ：复制通过谓词测试的元素
partition ：根据谓词对元素重新排序（真值先于假值）
remove 和remove_if ：删除谓词测试失败的元素
unique：删除序列中连续的重复项

有关重新排序函数及其用法示例的完整列表，请参阅文档。

3.5. 排序

Thrust 提供了多种函数来根据给定标准对数据进行排序或重新排列数据。 Thrust::sort 和 Thrust::stable_sort 函数与 STL 中的 sort 和 stable_sort 直接类似。

#include <thrust/sort.h>...
const int N = 6;
int A[N] = {1, 4, 2, 8, 5, 7};thrust::sort(A, A + N);// A is now {1, 2, 4, 5, 7, 8}

此外，Thrust 提供了 thrust::sort_by_key 和 thrust::stable_sort_by_key，它们可以对存储在不同位置的键值对进行排序。

#include <thrust/sort.h>...
const int N = 6;
int    keys[N] = {  1,   4,   2,   8,   5,   7};
char values[N] = {'a', 'b', 'c', 'd', 'e', 'f'};thrust::sort_by_key(keys, keys + N, values);// keys is now   {  1,   2,   4,   5,   7,   8}
// values is now {'a', 'c', 'b', 'e', 'f', 'd'}

与它们的 STL 一样，排序函数也接受用户定义的比较运算符：

#include <thrust/sort.h>
#include <thrust/functional.h>...
const int N = 6;
int A[N] = {1, 4, 2, 8, 5, 7};thrust::stable_sort(A, A + N, thrust::greater<int>());// A is now {8, 7, 5, 4, 2, 1}

4. 迭代器

精美的迭代器具有多种有价值的用途。在本节中，我们将展示奇妙的迭代器如何让我们能够使用标准 Thrust 算法来解决更广泛的问题。对于那些熟悉 Boost C++ 库的人，请注意我们的奇特迭代器是受 Boost 迭代器库中的启发（并且通常派生自）的。

4.1. constant_iterator

可以说是最简单的，constant_iterator 只是一个迭代器，每当我们访问它时，它都会返回相同的值。在下面的示例中，我们用值 10 初始化一个常量迭代器。

#include <thrust/iterator/constant_iterator.h>
...
// create iterators
thrust::constant_iterator<int> first(10);
thrust::constant_iterator<int> last = first + 3;first[0]   // returns 10
first[1]   // returns 10
first[100] // returns 10// sum of [first, last)
thrust::reduce(first, last);   // returns 30 (i.e. 3 * 10)

每当需要输入常量值序列时，constant_iterator 都是一种方便且高效的解决方案。

4.2. 计数迭代器

如果需要一系列递增的值，那么counting_iterator是合适的选择。这里我们用值 10 初始化一个counting_iterator，并像访问数组一样访问它。

#include <thrust/iterator/counting_iterator.h>
...
// create iterators
thrust::counting_iterator<int> first(10);
thrust::counting_iterator<int> last = first + 3;first[0]   // returns 10
first[1]   // returns 11
first[100] // returns 110// sum of [first, last)
thrust::reduce(first, last);   // returns 33 (i.e. 10 + 11 + 12)

虽然constant_iterator和counting_iterator充当数组，但它们实际上不需要任何内存存储。每当我们取消引用这些迭代器之一时，它都会即时生成适当的值并将其返回给调用函数。

4.3. 变换迭代器

在算法部分，我们谈到了内核融合，即将单独的算法（如变换和归约）组合成一个单一的变换归约操作。即使我们没有特殊的 transform_xxx 版本的算法，transform_iterator 也允许我们应用相同的技术。此示例展示了将转换与归约融合的另一种方法，这次仅将简单的归约应用于transform_iterator。

#include <thrust/iterator/transform_iterator.h>
// initialize vector
thrust::device_vector<int> vec(3);
vec[0] = 10; vec[1] = 20; vec[2] = 30;// create iterator (type omitted)
...
first = thrust::make_transform_iterator(vec.begin(), negate<int>());
...
last  = thrust::make_transform_iterator(vec.end(),   negate<int>());first[0]   // returns -10
first[1]   // returns -20
first[2]   // returns -30// sum of [first, last)
thrust::reduce(first, last);   // returns -60 (i.e. -10 + -20 + -30)

请注意，为了简单起见，我们省略了第一个和最后一个迭代器的类型。 Transform_iterator 的一个缺点是指定迭代器的完整类型可能很麻烦，而且可能会相当冗长。因此，通常的做法是简单地将对 make_transform_iterator 的调用放在所调用算法的参数中。例如，

// sum of [first, last)
thrust::reduce(thrust::make_transform_iterator(vec.begin(), negate<int>()),thrust::make_transform_iterator(vec.end(),   negate<int>()));

允许我们避免创建一个变量来存储第一个和最后一个。

4.4. 排列迭代器

在上一节中，我们展示了如何使用transform_iterator将转换与另一种算法融合，以避免不必要的内存操作。 permutation_iterator 是类似的：它允许我们将聚集和分散操作与 Thrust 算法甚至其他奇特的迭代器融合。以下示例演示如何将收集操作与归约融合。

#include <thrust/iterator/permutation_iterator.h>...// gather locations
thrust::device_vector<int> map(4);
map[0] = 3;
map[1] = 1;
map[2] = 0;
map[3] = 5;// array to gather from
thrust::device_vector<int> source(6);
source[0] = 10;
source[1] = 20;
source[2] = 30;
source[3] = 40;
source[4] = 50;
source[5] = 60;// fuse gather with reduction:
//   sum = source[map[0]] + source[map[1]] + ...
int sum = thrust::reduce(thrust::make_permutation_iterator(source.begin(), map.begin()),thrust::make_permutation_iterator(source.begin(), map.end()));

这里我们使用 make_permutation_iterator 函数来简化 permutation_iterator 的构造。 make_permutation_iterator 的第一个参数是收集操作的源数组，第二个参数是映射索引列表。请注意，在这两种情况下，我们都为第一个参数传递 source.begin() ，但改变第二个参数来定义序列的开始和结束。

当 permutation_iterator 用作函数的输出序列时，它相当于将分散操作融合到算法中。一般来说，permutation_iterator 允许您对序列中的一组特定值进行操作，而不是对整个序列进行操作。

4.5. zip_iterator

继续阅读，我们把最好的迭代器留到最后！ zip_iterator 是一个非常有用的小工具：它接受多个输入序列并生成一个元组序列。在这个例子中，我们将一个 int 序列和一个 char 序列“压缩”到一个 tuple<int,char> 序列中，并计算具有最大值的元组。

#include <thrust/iterator/zip_iterator.h>
...
// initialize vectors
thrust::device_vector<int>  A(3);
thrust::device_vector<char> B(3);
A[0] = 10;  A[1] = 20;  A[2] = 30;
B[0] = 'x'; B[1] = 'y'; B[2] = 'z';// create iterator (type omitted)
first = thrust::make_zip_iterator(thrust::make_tuple(A.begin(), B.begin()));
last  = thrust::make_zip_iterator(thrust::make_tuple(A.end(),   B.end()));first[0]   // returns tuple(10, 'x')
first[1]   // returns tuple(20, 'y')
first[2]   // returns tuple(30, 'z')// maximum of [first, last)
thrust::maximum< tuple<int,char> > binary_op;
thrust::tuple<int,char> init = first[0];
thrust::reduce(first, last, init, binary_op); // returns tuple(30, 'z')

zip_iterator 如此有用的原因是大多数算法接受一个或偶尔两个输入序列。 zip_iterator 允许我们将许多独立的序列组合成单个元组序列，该序列可以由多种算法处理。

请参阅 random_transformation 示例，了解如何使用 zip_iterator 和 for_each 实现三元转换。此示例的简单扩展将允许您计算具有多个输出序列的转换。

除了方便之外，zip_iterator 还可以让我们更高效地实现程序。例如，在 CUDA 中将 3d 点存储为 float3 数组通常是一个坏主意，因为数组访问未正确合并。使用 zip_iterator，我们可以将三个坐标存储在三个单独的数组中，这确实允许合并内存访问。在这种情况下，我们使用 zip_iterator 创建一个虚拟的 3d 向量数组，我们可以将其输入 Thrust 算法。有关更多详细信息，请参阅 dot_products_with_zip 示例。