最近在实验室写一个开源项目的扩展,用到了nanobind库来绑定多参数的C++模板,遇到了匪夷所思的问题
问题描述
#1 0x00007ffff73b881c in _Py_XINCREF (op=0x7ffff74187830100)at /usr/include/python3.8/object.h:532
#2 0x00007ffff73ba94e in nanobind::detail::nb_func_new (in_=0x7fffffffbf70)at /home/test/.local/lib/python3.8/site-packages/nanobind/src/nb_func.cpp:425
在 nanobind 中绑定 C++ 函数时出现的对象引用计数(引用计数增减操作)异常。_Py_INCREF 和 _Py_XINCREF 是 Python 内部函数,用来增加对象引用计数,如果引用计数操作不当,就会导致 segmentation fault 或引用错误。
奇怪点在于,该模板类完全没有堆上内存都在栈上定义。类定义大致如下:
template<class T, class W1, class W2, class F, class S>
class xxxxxx {
private:uint32_t totalcount_;uint32_t number;uint32_t filter_size_;F p_filter_;S p_sketch_;
public:xxxxxx (int filter_size, int sketch_depth, int sketch_width, int sketch_seed);void print_filter();void print_sketch();void reset();bool update_asketch(T new_id, uint32_t new_count);int get_filter_size();uint32_t get_totalcount();void get_hhs(std::vector<std::pair<T, W1>>& vec);bool checkmembership_in_filter(const T& new_id);void get_element_in_filter_with_index(int index, T& id, W1& value);uint32_t get_number_of_insert_sketch_() {return number_of_insert_sketch_;}~AugmentedSketch();double getRelativeError() { return p_sketch_.getRelativeError();}double getConfidence() { return p_sketch_.getConfidence();}W2 estimate(const W2 item) { return p_sketch_.estimate(item);}
};
排查问题
考虑nanobind返回值策略
nanobind 提供了多种返回值策略(return value policies),用于指定在将 C++ 对象绑定到 Python 时,如何处理 C++ 对象的所有权和生命周期。默认的返回值策略是 rv_policy::automatic
rv_policy::take_ownership: 将 C++ 实例包装为 Python 对象,转移所有权给 Python,Python 对象在销毁时会调用 C++ 的 delete 操作符释放内存。适用于创建新对象并将其所有权转交给 Python 管理的情况。
rv_policy::copy: 从 C++ 实例中拷贝构造一个新的 Python 对象,Python 持有拷贝对象的所有权,C++ 仍然保留原对象的所有权。适合返回指向 C++ 对象的引用的情况,以避免不清晰的所有权引起的问题。
rv_policy::move: 从 C++ 实例中移动构造一个新的 Python 对象,Python 持有新对象的所有权,C++ 保留原对象的所有权,但可能被 move 操作清空。更高效的策略,适合返回值而不是引用的情况。
rv_policy::reference: 创建一个 Python 对象的轻量包装,不进行拷贝,也不转移所有权,Python 对象销毁时不会调用 delete。适用于不需要 Python 拥有的全局或静态 C++ 对象,Python 只引用它们。
注意:如果 C++ 对象在 Python 仍在使用时被销毁,会产生未定义行为,因此应小心使用。
按道理来说,对于我都在栈上的模板类,默认返回值策略就可以了。为了严谨,全部都测试了一遍,不可行。
绑定类型不正确
nanobind::cpp_function_def
和 nanobind::class_
的绑定可能没有正确处理 函数的类型。这种情况尤其常见于模板类中的重载或指针类型的不匹配。nanobind 在绑定时对模板类型的处理较为严格,若类型推导不当,可能会导致引用计数管理的混乱。
在实例化对象时都是uint32位,绑定的时候为python的int类型,考虑是否是类型不一样导致指针不兼容,于是都改为int类型,还是绑定失败,一样的错误。
继承关系问题
对于实例类来说,连初始化绑定参数都会出现异常,考虑是否是存在不能使用继承的情况(因为在模板参数中会传递具有父类的类参数)。
#include <nanobind/stl/string.h>
#include <nanobind/stl/shared_ptr.h>struct Pet {std::string name;Pet(const std::string& name) : name(name) {}
};struct Dog : public Pet {Dog(const std::string& name) : Pet(name) {}std::string bark() const { return name + ": woof!"; }
};NB_MODULE(my_ext, m) {nb::class_<Pet, std::shared_ptr<Pet>>(m, "Pet").def(nb::init<const std::string &>()).def_rw("name", &Pet::name);nb::class_<Dog, Pet, std::shared_ptr<Dog>>(m, "Dog").def(nb::init<const std::string &>()).def("bark", &Dog::bark);
}
对于测试的python代码
python">import my_extd = my_ext.Dog("Molly")# Python 持有 d 的共享所有权
print(d.name) # 输出: Molly
print(d.bark()) # 输出: Molly: woof!# 在其他 Python 代码中共享该对象
other_ref = d
del d # 删除 d,不会销毁对象,因为 other_ref 仍持有引用
print(other_ref.bark()) # 依然可以访问: Molly: woof!
测试毫无问题,那就不是继承存在导致绑定失败
模板问题
在此考虑是否存在多参数模板是不是存在不能参与绑定函数问题。于是写了一个测试类,一个很简单的堆过滤器:
class MinHeapFilter64
{
private:int filter_size_;std::vector<uint64_t> key_heap_;std::vector<uint64_t> value_heap_;void swap(int i, int j);int fixdown(int index);
public:MinHeapFilter64(int filter_size);void reset();void print_filter();bool isFull();int get_filter_size(){return filter_size_;};int check_membership(uint64_t new_id);void replace_min_element(uint64_t new_id, uint64_t new_count);std::pair<uint64_t, uint64_t> get_element_with_index(int index){return std::make_pair(key_heap_[index],value_heap_[index]);}std::pair<uint64_t, uint64_t> get_top_element(){return std::make_pair(key_heap_[0],value_heap_[0]);}uint64_t get_key_with_index(int index) {return key_heap_[index];}uint64_t get_value_with_index(int index) {return value_heap_[index];}void set_value_with_index(int index, uint64_t new_size) {value_heap_[index] = new_size;fixdown(index);}bool insert_in_filter(uint64_t new_id, uint64_t new_count = 1);uint64_t insert_in_filter_with_sketch(bool* is_insert, bool* is_heavy, uint64_t* min_element_id, uint64_t* min_element_value, uint64_t new_id, uint64_t new_count = 1);
};
nanobind绑定代码:
NB_MODULE(my_module, m) {nb::class_<MinHeapFilter64>(m, "MinHeapFilter64").def(nb::init<int>(), nb::arg("filter_size"), "Initialize MinHeapFilter64 with a given filter size")// 定义方法.def("reset", &MinHeapFilter64::reset, "Reset the filter").def("print_filter", &MinHeapFilter64::print_filter, "Prints the filter contents").def("is_full", &MinHeapFilter64::isFull, "Checks if the filter is full").def("get_filter_size", &MinHeapFilter64::get_filter_size, "Returns the filter size").def("check_membership", &MinHeapFilter64::check_membership, nb::arg("new_id"),"Check if a specific ID is in the filter and return its index, or -1 if not found").def("replace_min_element", &MinHeapFilter64::replace_min_element, nb::arg("new_id"), nb::arg("new_count"),"Replace the minimum element with a new ID and count").def("insert_in_filter", &MinHeapFilter64::insert_in_filter, nb::arg("new_id"), nb::arg("new_count") = 1,"Inserts a new element with ID and count into the filter")// 绑定返回 std::pair 的方法.def("get_element_with_index", &MinHeapFilter64::get_element_with_index, nb::arg("index"),"Get the element at a specific index as a pair of (key, value)").def("get_top_element", &MinHeapFilter64::get_top_element,"Get the top element (min element) as a pair of (key, value)").def("get_key_with_index", &MinHeapFilter64::get_key_with_index, nb::arg("index"),"Get the key at a specific index").def("get_value_with_index", &MinHeapFilter64::get_value_with_index, nb::arg("index"),"Get the value at a specific index").def("set_value_with_index", &MinHeapFilter64::set_value_with_index, nb::arg("index"), nb::arg("new_size"),"Set the value at a specific index and fix down the heap").def("insert_in_filter_with_sketch", &MinHeapFilter64::insert_in_filter_with_sketch,nb::arg("is_insert"), nb::arg("is_heavy"), nb::arg("min_element_id"), nb::arg("min_element_value"),nb::arg("new_id"), nb::arg("new_count") = 1,"Inserts an element with sketch logic, updating relevant flags");
}
再次出现异常,gdb打印堆栈
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
--Type <RET> for more, q to quit, c to continue without paging--
__strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:65
65 ../sysdeps/x86_64/multiarch/strlen-avx2.S: No such file or directory.
(gdb)
(gdb) bt
#0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:65
#1 0x00007ffff7e5c383 in __GI___strdup (s=0x2000200000400 <error: Cannot access memory at address 0x2000200000400>)at strdup.c:41
#2 0x00007ffff73b86e9 in nanobind::detail::strdup_check (s=0x2000200000400 <error: Cannot access memory at address 0x2000200000400>)at /home/test/.local/lib/python3.8/site-packages/nanobind/src/nb_func.cpp:183
#3 0x00007ffff73b95f3 in nanobind::detail::nb_func_new (in_=0x7fffffffbe90)at /home/test/.local/lib/python3.8/site-packages/nanobind/src/nb_func.cpp:424
--Type <RET> for more, q to quit, c to continue without paging--
#4 0x00007ffff73ac437 in nanobind::detail::func_create<false, true, nanobind::cpp_function_def<datasketches::MinHeapFilter64, void, datasketches::MinHeapFilter64, unsigned long, unsigned long, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::arg, nanobind::arg, char [52]>(void (datasketches::MinHeapFilter64::*)(unsigned long, unsigned long), nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::arg const&, nanobind::arg const&, char const (&) [52])::{lambda(datasketches::MinHeapFilter64*, unsigned long, unsigned long)#1}, void, datasketches::MinHeapFilter64*, unsigned long, unsigned long, 0ul, 1ul, 2ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::arg, nanobind::arg, char [52]>(nanobind::cpp_function_def<datasketches::MinHeapFilter64, void, datasketches::MinHeapFilter64, unsigned long, unsigned long, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::arg, nanobind::arg, char [52]>(void (datasketches::MinHeapFilter64::*)(unsigned long, unsigned long), nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::arg const&, nanobind::arg const&, char const (&) [52])::{lambda(datasketches::MinHeapFilter64*, unsigned long, unsigned long)#1}&&, void (*)(datasketches::MinHeapFilter64*, unsigned long, unsigned long), st--Type <RET> for more, q to quit, c to continue without paging--
d::integer_sequence<unsigned long, 0ul, 1ul, 2ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::arg const&, nanobind::arg const&, char const (&) [52]) (is=..., func=...)at /home/test/.local/lib/python3.8/site-packages/nanobind/include/nanobind/nb_func.h:310
#5 nanobind::cpp_function_def<datasketches::MinHeapFilter64, void, datasketches::MinHeapFilter64, unsigned long, unsigned long, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::arg, nanobind::arg, char [52]> (f=(void (datasketches::MinHeapFilter64::*)(datasketches::MinHeapFilter64 * const, unsigned long, unsigned long)) 0x7ffff73aa49c <datasketches::MinHeapFilter64::replace_min_element(unsigned long, unsigned long)>)at /home/test/.local/lib/python3.8/site-packages/nanobind/include/nanobind/nb_func.h:370
#6 nanobind::class_<datasketches::MinHeapFilter64>::def<void (datasketches::MinHeapFilter64::*)(unsigned long, unsigned long), nanobind::arg, nanobind::arg, char [52]> (
--Type <RET> for more, q to quit, c to continue without paging--f=<optimized out>, name_=0x7ffff741787b "replace_min_element", this=0x7fffffffba98)at /home/test/.local/lib/python3.8/site-packages/nanobind/include/nanobind/nb_class.h:567
从堆栈信息看,问题出在 replace_min_element 方法的绑定上。绑定时提供的名称、类型不匹配,导致 nanobind 在生成绑定信息时试图访问无效内存。
如果我跳过这个函数,那么就会在下一个函数出现一样的问题:
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007ffff73b7f17 in _Py_INCREF (op=0x100) at /usr/include/python3.8/object.h:459
459 op->ob_refcnt++;
(gdb)
(gdb) bt
#0 0x00007ffff73b7f17 in _Py_INCREF (op=0x100) at /usr/include/python3.8/object.h:459
#1 0x00007ffff73b7f8c in _Py_XINCREF (op=0x100) at /usr/include/python3.8/object.h:532
#2 0x00007ffff73ba0be in nanobind::detail::nb_func_new (in_=0x7fffffffbf10)at /home/test/.local/lib/python3.8/site-packages/nanobind/src/nb_func.cpp:425
#3 0x00007ffff73b152d in nanobind::detail::func_create<false, true, nanobind::cpp_function_def<datasketches::MinHeapFilter64, bool, datasketches::MinHeapFilter64, unsigned long, unsigned long, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::arg, nanobind::arg_v, char [56]>(bool (datasketches::MinHeapFilter64::*)(unsigned long, unsigned long), nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::arg const&, nanobind::arg_v const&, char const (&) [56])::{lambda(datasketches::MinHeapFilter64*, unsigned long, unsigned long)#1}, bool, datasketches::MinHeapFilter64*, unsign--Type <RET> for more, q to quit, c to continue without paging--
ed long, unsigned long, 0ul, 1ul, 2ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::arg, nanobind::arg_v, char [56]>(nanobind::cpp_function_def<datasketches::MinHeapFilter64, bool, datasketches::MinHeapFilter64, unsigned long, unsigned long, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::arg, nanobind::arg_v, char [56]>(bool (datasketches::MinHeapFilter64::*)(unsigned long, unsigned long), nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::arg const&, nanobind::arg_v const&, char const (&) [56])::{lambda(datasketches::MinHeapFilter64*, unsigned long, unsigned long)#1}&&, bool (*)(datasketches::MinHeapFilter64*, unsigned long, unsigned long), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::arg const&, nanobind::arg_v const&, char const (&) [56]) (is=..., func=...)
--Type <RET> for more, q to quit, c to continue without paging--at /home/test/.local/lib/python3.8/site-packages/nanobind/include/nanobind/nb_func.h:310
#4 nanobind::cpp_function_def<datasketches::MinHeapFilter64, bool, datasketches::MinHeapFilter64, unsigned long, unsigned long, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::arg, nanobind::arg_v, char [56]> (f=(bool (datasketches::MinHeapFilter64::*)(datasketches::MinHeapFilter64 * const, unsigned long, unsigned long)) 0x7ffff73af4c6 <datasketches::MinHeapFilter64::insert_in_filter(unsigned long, unsigned long)>)at /home/test/.local/lib/python3.8/site-packages/nanobind/include/nanobind/nb_func.h:370
#5 nanobind::class_<datasketches::MinHeapFilter64>::def<bool (datasketches::MinHeapFilte--Type <RET> for more, q to quit, c to continue without paging--
r64::*)(unsigned long, unsigned long), nanobind::arg, nanobind::arg_v, char [56]> (f=<optimized out>, name_=0x7ffff741887b "insert_in_filter", this=0x7fffffffbd48)at /home/test/.local/lib/python3.8/site-packages/nanobind/include/nanobind/nb_class.h:567
#6 bind_min_heap_filter_int32 (m=...)at /home/test/Desktop/apache-datasketches-python/src/augmented_sketch_wrapper.cpp:88
#7 0x00007ffff73b15b3 in init_augmented_sketch (m=...)at /home/test/Desktop/apache-datasketches-python/src/augmented_sketch_wrapper.cpp:116
#8 0x00007ffff7105c2d in nanobind_init__datasketches (m=...)at /home/test/Desktop/apache-datasketches-python/src/datasketches.cpp:78
但是函数的实现非常简单,就是个返回元素而已,不涉及申请内存的情况。如果把出问题的函数后面都注释掉,前面的却可以正常绑定。
最终问题(多参数绑定不兼容)
那么对于该类的函数,一个一个进行绑定测试,发现了问题。能绑定成功的都是一个参数的函数,只要超过一个参数,必定出错。这就明白了,两个参数绑定会导致段错误,一个可能的原因是内存对齐和编译器 ABI 兼容性问题。
不同的编译器(例如 GCC 和 Clang)之间,或者即便是相同编译器但使用了不同的编译选项时,都会有可能导致 ABI 的不一致。ABI 涉及到诸如函数调用约定、参数的内存布局和对齐方式等底层实现细节。在 C++ 代码中,如果某些参数涉及到 STL 容器(例如 std::pair、std::tuple),或者是自定义的复杂类型,编译器对这些数据类型的处理方式可能会有所不同。这种 ABI 差异导致 Python 在调用 C++ 函数时,无法正确地找到并访问这些参数在内存中的位置,最终引发段错误。
解决思路
修改函数签名
可以将 多参数的函数修改为只接受一个参数 std::tuple,然后在函数体内进行解包,或者在绑定代码中加一层lambda函数接收一个参数std::tuple解包传递给实际调用的函数。
以下代码为例:
// AdvancedCounter.hpp
#ifndef ADVANCED_COUNTER_HPP
#define ADVANCED_COUNTER_HPPclass AdvancedCounter {
public:AdvancedCounter() : count(0) {}void increment(int value, int step) {count += value * step;}void add_range(int start, int end) {for (int i = start; i <= end; ++i) {count += i;}}void reset() {count = 0;}int get_count() const {return count;}private:int count;
};#endif // ADVANCED_COUNTER_HPP
绑定代码如下:
NB_MODULE(my_advanced_counter, m) {nb::class_<AdvancedCounter>(m, "AdvancedCounter").def(nb::init<>(), "Initializes an AdvancedCounter")// 将 increment 方法包装为一个接受 std::tuple 的函数.def("increment", [](AdvancedCounter &counter, std::tuple<int, int> args) {int value, step;std::tie(value, step) = args;counter.increment(value, step);}, nb::arg("args"), "Increments the counter by value * step, using a tuple (value, step)")// 将 add_range 方法包装为一个接受 std::tuple 的函数.def("add_range", [](AdvancedCounter &counter, std::tuple<int, int> range) {int start, end;std::tie(start, end) = range;counter.add_range(start, end);}, nb::arg("range"), "Adds a range of integers from start to end, using a tuple (start, end)").def("reset", &AdvancedCounter::reset, "Resets the counter to 0").def("get_count", &AdvancedCounter::get_count, "Gets the current count");
}
使用C写接口
提供 C 风格的接口,将多参数封装为简单的函数,便于 Python 调用。
// advanced_counter_wrapper.c
#include "AdvancedCounter.hpp"extern "C" {typedef struct AdvancedCounterWrapper {AdvancedCounter* counter;} AdvancedCounterWrapper;// 创建 AdvancedCounter 实例AdvancedCounterWrapper* create_advanced_counter() {return new AdvancedCounterWrapper{ new AdvancedCounter() };}// 销毁 AdvancedCounter 实例void destroy_advanced_counter(AdvancedCounterWrapper* wrapper) {delete wrapper->counter;delete wrapper;}// 使用两个参数递增计数器void increment_counter(AdvancedCounterWrapper* wrapper, int value, int step) {wrapper->counter->increment(value, step);}// 使用一个范围递增计数器void add_range_to_counter(AdvancedCounterWrapper* wrapper, int start, int end) {wrapper->counter->add_range(start, end);}// 重置计数器void reset_counter(AdvancedCounterWrapper* wrapper) {wrapper->counter->reset();}// 获取当前计数器值int get_count(AdvancedCounterWrapper* wrapper) {return wrapper->counter->get_count();}
}
NB_MODULE(my_advanced_counter_c, m) {nb::class_<AdvancedCounterWrapper>(m, "AdvancedCounterWrapper")// 创建 AdvancedCounter 实例.def(nb::init([]() {return create_advanced_counter();}), "Initializes an AdvancedCounterWrapper")// 销毁 AdvancedCounter 实例.def("__del__", [](AdvancedCounterWrapper *wrapper) {destroy_advanced_counter(wrapper);}, "Destroys the AdvancedCounterWrapper instance")// 使用两个参数递增计数器.def("increment", [](AdvancedCounterWrapper* wrapper, int value, int step) {increment_counter(wrapper, value, step);}, nb::arg("value"), nb::arg("step"), "Increment the counter by value multiplied by step")// 使用一个范围递增计数器.def("add_range", [](AdvancedCounterWrapper* wrapper, int start, int end) {add_range_to_counter(wrapper, start, end);}, nb::arg("start"), nb::arg("end"), "Add range to the counter from start to end")// 重置计数器.def("reset", [](AdvancedCounterWrapper* wrapper) {reset_counter(wrapper);}, "Reset the counter to 0")// 获取当前计数器值.def("get_count", [](AdvancedCounterWrapper* wrapper) {return get_count(wrapper);}, "Get the current count value");
}