Hello ROS community,
as you may have heard, NVIDIA has been working on proposing and prototyping a mechanism to add support for native buffer types into ROS2, to allow ROS2 to natively support APIs to use accelerated buffers like CUDA or Torch tensors efficiently. We had briefly touched on this in a previous discourse post. Since then, a lot of design discussions in the SIG PAI, as well as prototyping on our side has happened, to turn that outline into a full-fledged proposal and prototype.
Below is a rundown of our current status, as well as an outlook of where the work is heading. We are looking forward to discussions and feedback on the proposal.
Native Buffers in ROS 2
Problem statement
Modern robots use advanced, high-resolution sensors to perceive their environment. Whether it’s cameras, LIDARs, time-of-flight sensors or tactile sensor arrays, data rates to be processed are ever-increasing.
Processing of those data streams has for the most part moved onto accelerated hardware that can exploit the parallel nature of the data. Whether that is GPUs, DSPs, NPUs/TPUs, ASICS or other approaches, those hardware engines have some common properties:
- They are inherently parallel, and as such well suited to processing many small samples at the same time
- They are dedicated hardware with dedicated interfaces and often dedicated memory
The second property of dedicated memory regions is problematic in ROS2, as the framework currently does not have a way to handle non-CPU memory.
Consider for example the sensor_msgs/PointCloud2 message, which stores data like this:
uint8[] data # Actual point data, size is (row_step*height)
A similar approach is used by sensor_msgs/Image. In rclcpp, this will map to a member like
std::vector<uint8_t> data;
This is problematic for large pieces of data that are never going to be touched by the CPU. It forces the data to be present in CPU memory whenever the framework handles it, in particular for message transport, and every time it crosses a node boundary.
For truly efficient, fully accelerated pipelines, this is undesirable. In cases where there are one or more hardware engines handling the data, it is preferable for the data to stay resident in the accelerator, and never be copied into CPU memory unless a node specifically requests to do so.
We are therefore proposing to add the notion of pluggable memory backends to ROS2 by introducing a concept of buffers that share a common API, but are implemented with vendor-specific plugins to allow efficient storage and transport with vendor-native, optimized facilities.
Specifically, we are proposing to map uint8[] in rosidl to a custom buffer type in rclcpp that behaves like a std::vector<uint8_t> if used for CPU code, but will automatically keep the data resident to the vendor’s accelerator memory otherwise. This buffer type is also integrated with rmw to allow the backend to move the buffer between nodes using vendor-specific side channels, allowing for transparent zero-copy transport of the data if implemented by the vendor.
Architecture overview
Message encoding
The following diagram shows the overview of a message containing a uint8[] array, and how it is mapped to C++, and then serialized:
It shows the following parts, which we will discuss in more detail later:
- Declaration of a buffer using
uint8[] in a message definition as before
- Mapping onto a custom buffer type in rclcpp, called
Buffer<T> here
- The internals of the
Buffer<T> type, in particular its std::vector<T>-compatible interface, as well as a pointer to a vendor-specific implementation
- A vendor-specific backend providing serialization, as well as custom APIs
The message being encoded into a vendor-specific buffer descriptor message, which is serialized in place of the raw byte array in the message
Choice of uint8[] as trigger
It is worth noting the choice to utilize uint8[] as a trigger to generate Buffer<T> instances. An alternative approach would have been to add a new Buffer type to the IDL, and to translate that into Buffer<T>. However, this would not only introduce a break in compatibility of the IDL, but also force the introduction of a sensor_msgs/PointCloud3 and similar data types, fracturing the message ecosystem further.
We believe the cost of maintaining a std::vector compatible interface and the slight loss of semantics is outweighed by the benefit of being drop-in compatible with both existing messages and existing code bases.
Integration with rclcpp (and rclpy and rclrs)
rclcpp exposes all uint8[] fields as rosidl_runtime_cpp::Buffer<T> members in their respective generated C++ structs.
rosidl_runtime_cpp::Buffer<T> has a fully compatible interface to std::vector<T>, like size(), operator[](size_type pos) etc.. If any of the std::vector<T> APIs are being used, the vector is copied onto the CPU as necessary, and all members work as expected. This maintains full compatibility with existing code - any code that expects a std::vector<T> in the message will be able to use the corresponding fields as such without any code changes.
In order to access the underlying hardware buffers, the vendor-specific APIs are being used. Suppose a vendor backend named vendor_buffer_backend exists, then the backend would usually contain a static method to convert a buffer to the native type. Our hypothetical vendor backend may then be used as follows:
void topic_callback(const msg::MessageWithTensor & input_msg) {
vendor_native_handle input_h = vendor_buffer_backend::from_buffer(msg.data);
msg::MessageWithTensor output_msg =
vendor_buffer_backend::allocate<msg::MessageWithTensor>();
vendor_native_handle output_h =
vendor_buffer_backend::from_buffer(output_msg.data);
output_h = input_h.some_operation();
publisher_.publish(output_msg);
}
This code snippet does the following:
First, it extracts the native buffer handle from the message using a static method provided by the vendor backend. Vendors are free to provide any interface they choose for providing this interface, but would be encouraged to provide a static method interface for ease of use.
It then allocates the output message to be published using another vendor-specific interface. Note that this allocation creates an empty buffer, it only sets up the relationship between output_msg.data and the vendor_buffer_backend by creating an instance of the backend buffer, and registering it in the impl field of rosidl_runtime_cpp::Buffer<T> class.
The native handle from the output message is also extracted, so it can be used with the native interfaces provided.
Afterwards, it performs some native operations on the input data, and assigns the result of that operation to the output data. Note that this is happening on the vendor native data types, but since the handles are linked to the buffers, the results show up in the output message without additional code.
Finally, the output message is published the same as any other ROS2 message. rmw then takes care of vendor-specific serialization, see the following sections on details of that process.
This design keeps any vendor-specific code completely out of rclcpp. All that rclcpp sees and links against is the generic rosidl_runtime_cpp::Buffer<T> class, which has no direct ties to any specific vendor. Hence there is no need for rclcpp to even know about all vendor backends that exist.
It also allows vendors to provide specific interfaces for their respective platforms, allowing them to implement allocation and handling schemes particular to their underlying systems.
A similar type would exist for rclpy and rclrs. We anticipate both of those easier to implement due to the duck typing facilities in rclpy, and the traits-based object system in rclrs, respectively, which make it much easier to implement drop-in compatible systems.
Backends as plugins
Backends are implemented as plugins using ROS’s pluginlib. On startup, each rmw instance scans for available backend-compatible plugins on the system, and registers them through pluginlib.
A standard implementation of a backend using CPU memory to offer std::vector<T> compatibility is provided by default through the ROS2 distribution, to ensure that there is always a CPU implementation available.
Additional vendor-specific plugins are implemented by the respective hardware vendors. For example, NVIDIA would implement and provide a CUDA backend, while AMD might implement and provide a ROCm backend.
Backends can either be distributed as individual packages, or be pre-installed on the target hardware. As an example, the NVIDIA Jetson systems would likely have a CUDA backend pre-installed as part of their system image.
Instances of rosidl_runtime_cpp::Buffer<T> are tied to a particular backend at allocation time, as illustrated in the section above.
Integration with rmw
rmw implementations can choose to integrate with vendor backends to provide accelerated transports through the backends. rmw implementations that do not choose to integrate with backends, or any existing legacy backends, automatically fall back onto converting all data to CPU data, and will continue working without any changes.
A rmw implementation that chooses to integrate with vendor backends does the following. At graph startup when publishers and subscribers are being created, each endpoint shares a list of installed backends, alongside vendor-specific data to establish any required side channels, and establishes dedicated channels for passing backend-enabled messages based on 4 different data points:
- The message type for determining if it contains any buffer-typed fields
- The list of backends supported by the current endpoint
- The list of backends supported by the associated endpoint on the other side
- The distance between the two endpoints (same process, different process, across a network etc.)
rmw can choose any mechanism it wants to perform this task, since this step is happening entirely internal to the currently loaded rmw implementation. Side channel creation is entirely hidden inside the vendor plugins, and not visible to rmw.
For publishing a message type that contains buffer-typed fields, if the publisher and the subscriber(s) share the same supported backend list, and there is a matching serialization method implemented in the backend for the distance to the subscriber(s), then instead of serializing the payload of the buffer bytewise, the backend can choose to use a custom serialization method instead.
The backend is then free to serialize into a ROS message type of its choice. This backend-custom message type is called a descriptor. It should contain all information the backend needs to deserialize the message at the subscriber side, and reconstruct the buffer. This descriptor message may contain pointer values, virtual memory handles, IPC handles or even the raw payload if the backend chooses to send that data through rmw.
The descriptor message can be inspected as usual if desired since it is just a normal ROS2 message, but deserializing requires the matching backend. However, since the publisher knows the backends available to the subscriber(s), it is guaranteed that a subscriber only receives a descriptor message if it is able to deserialize it.
Integration with rosidl
While the above sections show the implications visible in rclcpp, the bulk of the changes necessary to make that happen go into rosidl. It is rosidl that is generating the C++ message structures, and hence rosidl that would map to the Buffer type instead of std::vector. Hence the bulk of the work done in order to get this scheme to work is done in rosidl, not in rclcpp.
Layering semantics on top
Having only a buffer is not very useful, as most robotics data has higher level semantics, like images, tensors, point clouds etc..
However, all of those data types ultimately map to one or more large, contiguous regions of memory, in CPU or accelerator memory.
We also observe that a healthy ecosystem of higher level abstractions already exists. There is PCL for point clouds, Torch for tensor handling etc.. Hence, we propose to not try to replicate those ecosystems in ROS, but instead allow those ecosystems to bridge into ROS, and use the buffer abstraction as their backend for storage and transport.
As a demonstration of this, we are providing a Torch backend that allows linking (Py)Torch tensors to the ROS buffers. This allows users to use the rich ecosystem of Torch to perform tensor operations, while relying on the ROS buffers to provide accelerator-native storage and zero-copy transport between nodes, even across processes and chips if supported by the backend.
The Torch backend does not provide a raw buffer type itself, but relies on vendors implementing backends for their platforms (CUDA, ROCm, TPUs etc.). The Torch backend then depends on the vendor-specific backends, and provides the binding of the low-level buffers to the Torch tensors. The coupling between the Torch backend and the hardware vendor buffer types is loose, it is not visible from the node’s code, but is established after the fact.
From a developer’s perspective, all of this is hidden. All a developer writing a Node does is to interact with a Torch buffer, and it maps to the correct backend available on the current hardware automatically. An example of such a code could look like this:
void topic_callback(const msg::MessageWithTensor & input_msg) {
// extract tensor from input message
torch::Tensor input_tensor =
torch_backend::from_buffer(input_msg.tensor);
// allocate output message
msg::MessageWithTensor output_msg =
torch_backend::allocate<MessageWithTensor>();
// get handle to allocated output tensor
torch::Tensor & output_tensor =
torch_backend::from_buffer(output_msg.tensor);
// perform some torch operations
output_tensor = torch.abs(input_tensor);
// publish message as usual
publisher_.publish(output_msg);
}
Note how this code segment is using Torch-native datatypes (torch::Tensor), and is performing Torch-native operations on the tensors (in this case, torch.abs). There is no mention of any hardware backend in the code.
By keeping the coupling loose, this node can run unmodified on NVIDIA, AMD, TPU or even CPU hardware, with the framework, in this case Torch, being mapped to the correct hardware, and receiving locally available accelerations for free.
Prior work
NITROS
https://docs.nvidia.com/learning/physical-ai/getting-started-with-isaac-ros/latest/an-introduction-to-ai-based-robot-development-with-isaac-ros/05-what-is-nitros.html
NITROS is NVIDIA’s implementation of a similar design based on Type Negotiation. It is specific to NVIDIA and not broadly compatible, nor is it currently possible to layer hardware-agnostics frameworks like Torch on top.
AgnoCast
https://github.com/tier4/agnocast
AgnoCast creates a zero-copy regime for CPU data. However, it is limited to CPU data, and does not have a plugin architecture for accelerator memory regions. It also requires kernel modifications, which some may find intrusive.
Future work
NVIDIA has been working on this proposal, alongside a prototype implementation that implements support for the mechanisms described above. We are working on CPU, CUDA and Torch backends, as well as integration with the Zenoh rmw implementation.
The prototype will move into a branch on the respective ROS repositories in the next two weeks, and continue development into a full-fledged implementation in public.
In parallel, a dedicated working group tasked with formalizing this effort is being formed, with the goal of reaching consensus on the design, and getting the required changes into ROS2 Lyrical.
5 posts - 4 participants
Read full topic