双重AEB：将基于规则的方法与多模态大型语言模型相结合，以实现有效的紧急制动（202410）

Dual-AEB: Synergizing Rule-Based and Multimodal Large Language Models for Effective Emergency Braking

双重AEB：将基于规则的方法与多模态大型语言模型相结合，以实现有效的紧急制动

在这里插入图片描述

Abstract

Automatic Emergency Braking (AEB) systems are a crucial component in ensuring the safety of passengers in autonomous vehicles. Conventional AEB systems primarily rely on closed-set perception modules to recognize traffic conditions and assess collision risks. To enhance the adaptability of AEB systems in open scenarios, we propose Dual-AEB, a system combines an advanced multimodal large language model (MLLM) for comprehensive scene understanding and a conventional rule-based rapid AEB to ensure quick response times. To the best of our knowledge, Dual-AEB is the first method to incorporate MLLMs within AEB systems. Through extensive experimentation, we have validated the effectiveness of our method. Codes will be publicly available at https: //github.com/ChipsICU/Dual-AEB.
自动紧急制动（AEB）系统是确保自动驾驶车辆乘客安全的关键组成部分。传统的AEB系统主要依赖于封闭集感知模块来识别交通状况和评估碰撞风险。为了增强AEB系统在开放场景中的适应性，我们提出了双重AEB，该系统结合了先进的 多模态大型语言模型（MLLM） 以全面理解场景和传统的基于规则的快速AEB以确保快速响应时间。据我们所知，双重AEB是将MLLM整合到AEB系统中的第一种方法。通过广泛的实验，我们验证了我们方法的有效性。代码将在https://github.com/ChipsICU/Dual-AEB上公开提供。

I. INTRODUCTION

The Autonomous Emergency Braking (AEB) system is a critical safety feature in autonomous vehicles, designed to mitigate or prevent collisions by automatically activating the brakes when a potential collision is detected [1]. Numerous studies [1]–[5] have demonstrated the effectiveness of AEB systems, with reductions in rear-end collisions ranging from 25% to 50%.
自动紧急制动（AEB）系统是自动驾驶车辆中的关键安全特性，旨在通过在检测到潜在碰撞时自动激活刹车来减轻或防止碰撞[1]。众多研究[1]–[5]已经证明了AEB系统的有效性，它们显示AEB系统能将追尾碰撞减少25%到50%。
Conventionally, AEB systems can be roughly categorized into two types: decision-making-only methods [6]–[15] and end-to-end methods [16], [17]. Decision-making-only methods use perception results of predefined perception categories (e.g., people, cars, bicycles) and apply rule-based techniques [10]–[12], [18] or deep reinforcement learning [13], [14] for braking decisions. End-to-end methods [16], [17], meanwhile, process raw sensory data directly to inform AEB decisions, allowing the system to benefit from comprehensive sensory inputs. These methods generally ensure safety in most driving scenarios.
传统上，自动紧急制动（AEB）系统大致可以分为两类：仅决策方法[6]–[15]和端到端方法[16]，[17]。仅决策方法使用预定义感知类别（例如，人、汽车、自行车）的感知结果，并应用基于规则的技术[10]–[12]，[18]或深度强化学习[13]，[14]来做出制动决策。与此同时，端到端方法[16]，[17]直接处理原始感官数据以通知AEB决策，使系统能够从全面的感官输入中获益。这些方法通常确保在大多数驾驶场景中的安全性。

学习一下这两篇文章
《Emergency-braking distance prediction using deep learning》，2021
《Fully convolutional neural network for vehicle speed and emergency-brake prediction》，2024

However, their ability to handle complex driving situations is limited due to a lack of comprehensive scene understanding. For example, in Fig. 1 (a), the scene describes a pedestrian positioned in the ego vehicle’s blind spot, intending to cross at a green-light intersection. Typically, decision-making-only methods would not activate braking in this scenario due to the absence of pedestrian perception information, making it impossible to predict an impending collision. Similarly, while end-to-end methods process raw sensory data, they often lack the reasoning capacity to interpret indirect cues—such as the illuminated brake lights on the vehicle to the left of the ego vehicle—that may indicate a potential hazard ahead. In Fig. 1 (b), a truck with a facial advertisement is driving on the ego vehicle’s left. Both decision-making-only and end-to-end methods may misinterpret the advertisement as a pedestrian, potentially triggering the AEB system and causing unnecessary braking. A truly effective AEB system should incorporate comprehensive scene understanding, enabling it to differentiate between real hazards and non-threatening elements, thereby ensuring appropriate braking responses.
然而，由于缺乏全面的场景理解，它们处理复杂驾驶情况的能力是有限的。例如，在**图1(a)**中，场景描述了一个行人位于本车盲点位置，打算在绿灯路口穿越。通常情况下，仅决策方法由于缺乏行人感知信息，不会在这种情况下激活制动，这使得预测即将发生的碰撞变得不可能。同样，尽管端到端方法处理原始感官数据，它们往往缺乏解释间接线索的推理能力——比如本车左侧车辆亮起的刹车灯——这可能表明前方存在潜在危险。在图1(b)中，一辆带有面部广告的卡车正在本车的左侧行驶。仅决策和端到端方法都可能将广告误认为是行人，从而可能触发AEB系统，导致不必要的制动。一个真正有效的AEB系统应该整合全面的场景理解，使其能够区分真正的危险和非威胁性元素，从而确保适当的制动响应。
在这里插入图片描述
图1。传统AEB系统在以下情况下往往失败：(a) 当需要提前检测到行人以提前刹车并避免危险时，以及 (b) 当错误的感知不必要地触发AEB时。这些场景对传统AEB方法来说是一个挑战。
To address these challenges, we propose the Dual-AEB system, which offers the following main advantages: (1) Comprehensive Scene Understanding: The Dual-AEB system integrates advanced Multimodal Large Language Models (MLLMs) to achieve a deep understanding of the driving environment. By processing comprehensive data—including environmental conditions, critical perception information, and ego-vehicle states—MLLMs enhance overall situational awareness while reducing the risk of false positives and missed detections. (2) Optimized Response Time: The Dual-AEB system leverages the strengths of both the conventional AEB module and MLLM components. The conventional AEB ensures a quick initial response to imminent threats, while the MLLM component provides detailed analyses in complex scenarios. This synergistic approach minimizes response time and maximizes accuracy during critical moments. (3) Flexible Modular Design: The Dual-AEB system’s modular architecture facilitates seamless upgrades and component replacements as technology advances. This feature guarantees long-term efficacy and continuous improvement, preparing the system to meet future challenges.

为了应对这些挑战，我们提出了 双重AEB系统，它具有以下主要优势：
全面场景理解：双重AEB系统整合了先进的多模态大型语言模型（MLLMs），以实现对驾驶环境的深入理解。通过处理包括环境条件、关键感知信息和自车状态在内的全面数据，MLLMs增强了整体的情境意识，同时降低了误报和漏检的风险。
优化响应时间：双重AEB系统利用传统AEB模块和MLLM组件的优势。传统AEB确保对迫在眉睫的威胁做出快速初步响应，而MLLM组件在复杂场景中提供详细分析。这种协同方法在关键时刻最小化响应时间并最大化准确性。
灵活的模块化设计：双重AEB系统的模块化架构便于随着技术进步进行无缝升级和组件替换。这一特性保证了长期的效能和持续改进，使系统准备好应对未来的挑战。
To summarize, our contributions are as follows:
• We present Dual-AEB, the first work that integrates MLLMs to enhance conventional AEB systems by leveraging their comprehensive scene understanding to improve braking decisions.
• Our method is validated through extensive experiments on both open-loop and closed-loop benchmarks, demonstrating its effectiveness.
• Qualitative analysis on our in-house real-world scenario dataset further confirms the practicality of deploying this system.
总结来说，我们的贡献如下：

我们提出了双重AEB，这是首次将MLLMs整合到传统AEB系统中的工作，通过利用它们全面的场景理解来改善制动决策。
我们的方法通过在开放循环和封闭循环基准测试上的广泛实验得到了验证，展示了其有效性。
对我们内部真实世界场景数据集的定性分析进一步确认了部署该系统的实用性。

II. RELATED WORK

A. Autonomous Emergency Braking (AEB)

AEB systems are essential for vehicle safety, as it autonomously detects risks and activates brakes to mitigate or avoid collisions, significantly reducing traffic accident rates [1]–[5], [19]–[21]. Over time, AEB systems have evolved to utilize either decision-making-only or end-to-end methods, ensuring safety in general scenarios.
自动紧急制动（AEB）系统对于车辆安全至关重要，因为它能够自动检测风险并在检测到潜在碰撞时激活刹车以减轻或避免碰撞，显著降低交通事故率[1]–[5]、[19]–[21]。随着时间的推移，AEB系统已经发展到使用仅决策或端到端方法，以确保在一般场景中的安全。
Decision-Making-Only Methods. These methods typically rely on a limited set of closed-set perception results, such as detecting pedestrians, vehicles, and bicycles, to determine the necessity of braking actions. These decisions are often based on metrics like Time To Collision (TTC) [10]–[12], [18] or are designed using control algorithms [22]–[25], and sometimes involve learning-based approaches [13]–[15]. While these methods are straightforward and computationally efficient, they suffer from significant limitations in complex, dynamic environments [6]–[9], [26], [27]. Relying on a predefined closed-set of objects can result in the omission of vital environmental information, potentially leading to the failure of the AEB in critical situations.
仅决策方法。这些方法通常依赖于有限的封闭集感知结果，例如检测行人、车辆和自行车，以确定是否需要采取制动行动。这些决策通常基于诸如碰撞时间（TTC）[10]–[12]、[18]等指标，或使用控制算法[22]–[25]设计，有时还涉及基于学习的方法[13]–[15]。虽然这些方法简单且计算效率高，但它们在复杂、动态的环境中存在显著局限性[6]–[9]、[26]、[27]。依赖于预定义的封闭对象集可能导致遗漏关键的环境信息，这可能会导致AEB在关键时刻失败。
End-to-End Methods. End-to-end methods bypass traditional decision-making pipelines by directly using raw perception data for AEB decisions [14], [16], [17], [28]– [30]. They offer flexibility and can continuously improve with more data [31], enabling the detection and response to hazards that rule-based systems might miss. However, these approaches face challenges, including the need for large labeled datasets, inconsistent performance on unseen scenarios [32], [33], susceptibility to overfitting, and opaque decision-making processes [34], which hinder their reliability in critical and complex situations.
端到端方法。端到端方法通过直接使用原始感知数据进行AEB决策，绕过了传统的决策流程[14]、[16]、[17]、[28]–[30]。它们提供了灵活性，并且随着更多数据的加入可以持续改进[31]，使得系统能够检测和响应基于规则的系统可能会遗漏的危险。然而，这些方法面临挑战，包括需要大型标记数据集、在未见场景中的不一致性能[32]、[33]、对过拟合的敏感性，以及不透明的决策过程[34]，这些都阻碍了它们在关键和复杂情况下的可靠性。
Overall, both decision-making-only and end-to-end methods face challenges in handling complex driving scenarios. Decision-making-only methods rely on predefined perception categories, limiting their ability to respond to unexpected elements. End-to-end AEB systems, while processing raw sensory data, often struggle with reasoning through complex relationships in the scene. To provide effective braking decisions in complex driving scenarios, a comprehensive understanding of the scene is required for AEB systems.
总体而言，仅决策方法和端到端方法在处理复杂驾驶场景时都面临挑战。仅决策方法依赖于预定义的感知类别，限制了它们对意外元素的响应能力。而端到端AEB系统在处理原始感官数据时，往往难以推理场景中的复杂关系。为了在复杂驾驶场景中提供有效的制动决策，AEB系统需要对场景进行全面的理解。

B. Multimodal Large Language Models (MLLMs)

The advent of large models such as ChatGPT [35], [36] and Gemini has brought us closer to achieving trustworthy autonomous driving [37]–[42] and robotics [43]–[50]. Works like GPT-Driver [37]–[41], [51] have demonstrated superior performance over previous methods through prompt-tuning and fine-tuning techniques in autonomous driving benchmarks. DriveVLM [42] introduces a dual system design, akin to the human brain’s slow and fast thinking processes, which efficiently adapts to varying complexities in driving scenarios. This innovative approach helps end-to-end autonomous driving models address corner cases and enhances the overall system’s performance ceiling, closely aligning with the focus of our work. Inspired by these works, we integrate MLLMs into AEB systems, aiming to enhance their ability for comprehensive scene understanding and, in turn, provide more effective braking decisions.
大型模型的出现，如ChatGPT[35]、[36]和Gemini，使我们更接近于实现值得信赖的自动驾驶[37]–[42]和机器人技术[43]–[50]。像GPT-Driver[37]–[41]、[51]这样的工作通过在自动驾驶基准测试中的提示调整和微调技术，展示了比以往方法更优越的性能。DriveVLM[42]引入了一种双系统设计，类似于人脑的慢思考和快思考过程，有效地适应驾驶场景的复杂性变化。这种创新方法帮助端到端自动驾驶模型解决极端情况，并提高了整个系统的性能上限，与我们的工作重点紧密对齐。受这些工作的启发，我们将MLLMs整合到AEB系统中，旨在增强它们对场景的全面理解能力，进而提供更有效的制动决策。

III. DUAL-AEB

A. Overview

The entire workflow of the Dual-AEB system is illustrated in the Fig. 2. This system consists of two main components: the quick module, called the rule-based AEB, and the slow module, named the MLLM-powered AEB. The rule-based AEB, using conventional rule-based methods, is responsible for the initial decision. When triggered, the quick module packages this initial decision into text, named as the AEBPrompt, and sends it to the slow module, the MLLMpowered AEB. The slow module, utilizing MLLM to analyze the received information, makes the final decision, either confirming or adjusting the quick module’s initial decision. This framework can be seamlessly integrated with other autonomous driving algorithms, making it a flexible framework that can be easily expanded or updated.
双重AEB系统的整个工作流程在图2 中进行了说明。该系统由两个主要组件组成：快速模块，称为基于规则的AEB，以及慢速模块，称为由MLLM驱动的AEB。基于规则的AEB使用传统的基于规则的方法，负责初始决策。当触发时，快速模块将这个初始决策打包成文本，命名为AEBPrompt，并将其发送给慢速模块，即由MLLM驱动的AEB。慢速模块利用MLLM分析接收到的信息，做出最终决策，要么确认要么调整快速模块的初始决策。这个框架可以无缝集成到其他自动驾驶算法中，使其成为一个灵活的框架，可以轻松扩展或更新。
在这里插入图片描述
图2. 我们方法的概览。双重AEB（a）包括快速（基于规则的AEB）和慢速（由MLLM驱动的AEB）模块。在从自动驾驶模型（AD-Models）接收信息后，制动信号可以直接由（c）输出，如虚线所示，或者发送到（b），如实线所示，在那里由MLLM驱动的AEB评估并决定是否确认或调整。

B. Rule-Based AEB Module

The rule-based AEB module receives perception and planning results from the autonomous driving modules (AD models), including the bounding boxes of agents (B), the ego vehicle’s planned trajectory §, and the drivable area (D). This information is used to evaluate potential collisions within the trajectory horizon (H) by utilizing the kinematic bicycle model (K) [52], with a step size (∆t) for temporal progression. This module then uses Time To Collision (TTC) and Collision © [53] as evaluation metrics to assess the need for emergency braking [54]. The decision to initiate braking is then made by comparing the calculated trigger time (ttrigger) with a predefined threshold (tthreshold). The detailed steps of this process are outlined in Algorithm 1.
基于规则的AEB模块从自动驾驶模块（AD模型）接收感知和规划结果，包括 代理的边界框（B） 、 自车的计划轨迹（P） 和 可行驶区域（D） 。利用 运动学自行车模型（K） [52]和 时间步长（∆t）进行时间推进 ，这些信息被用来评估在轨迹视野（H）内潜在的碰撞。然后，该模块使用碰撞时间（TTC）和碰撞（C）[53]作为评估指标，来评估是否需要紧急制动[54]。通过将计算出的触发时间（ttrigger）与预定义的阈值（tthreshold）进行比较，然后做出启动制动的决定。这个过程的详细步骤在 算法1 中进行了概述。
在这里插入图片描述

C. MLLM-Powered AEB Module

The MLLM-Powered AEB module processes a series of sequential front-view driving images accompanied by predefined task prompts. It consists of a trained MLLM backbone and a projection network. The MLLM backbone analyzes these inputs and generates textual responses, termed Text Generation (TG), which include an token to
indicate when emergency braking is needed. The projection network then utilizes the hidden state associated with the token, applying a linear layer followed by a sigmoid activation to output a braking signal—referred to as Braking Signal Generation (BSG)—in the range of 0 to 1.
由MLLM驱动的AEB模块处理一系列顺序的前视驾驶图像，并伴随着预定义的任务提示。它由一个训练有素的MLLM主干和一个投影网络组成。MLLM主干分析这些输入并生成文本响应，称为文本生成（TG），其中包含一个 <AEB>标记，以指示何时需要紧急制动。然后，投影网络利用与标记相关的隐藏状态，应用一个线性层后跟一个sigmoid激活函数，输出一个介于0到1范围内的制动信号——称为制动信号生成（BSG）。

D. Training Data Construction

The data required for Dual-AEB training includes frontview video inputs, question-answer pairs, and braking signals. We generated this data for the MLLM-powered AEB by utilizing the MM-AU [55] dataset, which captures real-world accident scenarios, and the Bench2Drive [56] dataset, featuring complex simulation scenarios from CARLA [57]. Videos are directly available from both datasets, and Bench2Drive also provides braking signals. For MM-AU, braking signals are derived by analyzing the chronological sequence of traffic accidents. For instance, in video sequences where a collision is imminent, emergency braking is required, and in such cases, the ground truth for the braking signal is set to 1.
双AEB 训练所需的数据 包括前视视频输入、问答对和制动信号。我们通过利用 MM-AU[55]数据集和Bench2Drive[56]数据集为MLLM驱动的AEB生成了这些数据 。MM-AU数据集捕捉了现实世界中的事故场景，而Bench2Drive数据集则包含了来自CARLA[57]的复杂模拟场景。两个数据集都直接提供了视频，Bench2Drive还提供了制动信号。对于MM-AU，制动信号是通过分析交通事故的时间序列来派生的。例如，在即将发生碰撞的视频序列中，需要紧急制动，在这种情况下，制动信号的真值被设置为1。

For the question-answer pairs, we designed three sub-tasks focused on driving scenarios, critical objects, and decisionmaking processes, structuring the reasoning process in a stepby-step manner [58], [59]. Using GPT-4, we generated diverse question-answer templates and filled them with ground truth. Details are outlined below.
对于问答对，我们设计了三个子任务，分别关注驾驶场景、关键对象和决策过程，以逐步的方式构建推理过程[58]，[59]。使用GPT-4，我们生成了多样化的问答模板，并用真实情况填充了它们。以下是详细说明。
Scenarios Description. The driving environment significantly influences the complexity of driving tasks [42]. For instance, rural roads are more likely to encounter sudden appearances of animals. Annotations include details on weather variations, visibility, road traction, time of day, and road types. An example of an annotated scenario description question-answer pair is, Q: “What are the environmental details captured in this driving video?” A: “The ego vehicle navigates an arterial roadway under clear, sunny conditions in an urban environment during daylight.”
场景描述。驾驶环境显著影响驾驶任务的复杂性[42]。例如，在农村道路上更有可能突然遇到动物。注释包括天气变化、能见度、路面附着力、一天中的时间以及道路类型的详细信息。一个标注的场景描述问答对的示例是，问：“这个驾驶视频中捕捉到了哪些环境细节？”答：“该车辆在白天的城市环境中，在晴朗的阳光下导航在一条主干道上。”
Critical Objects Description. Annotations for critical objects include details such as 2D bounding boxes in (x min, y min, x max, y max) format, distance to the ego vehicle, traffic signals, and the intentions of surrounding agents. For example, “A black vehicle with a blinking left turn signal, located at [(197, 474), (295, 667)], is 14.26 meters away from the ego vehicle, suggesting it is preparing to make a left turn.” These details provide critical information for the system to determine the precise position of the object within the scene. With this information, Dual-AEB can forecast the object’s future movements, resulting in more accurate and informed decision-making in various scenarios.
关键对象描述。对于关键对象的注释包括细节，例如以 $x_{min}, y_{min}, x_{max}, y_{max})$ 格式的2D边界框、与自车的距离、交通信号以及周围代理的意图。例如，“一辆黑色车辆，其左转弯信号灯在闪烁，位于[(197, 474), (295, 667)]，距离自车14.26米，表明它正准备左转。”这些细节为系统提供了确定对象在场景中精确位置的关键信息。有了这些信息，Dual-AEB可以预测对象的未来运动，在各种场景中实现更准确和明智的决策制定。
Decision Making. In the final sub-task, questions are supplemented with AEB-Prompt, which represent initial decisions generated by the rule-based AEB module. For example, a AEB-Prompt might be, “Initial decision: A collision with the black vehicle on the left is expected in 1.2 seconds, and I decide to brake.” To prevent the MLLM-powered AEB from becoming overly reliant on AEB-Prompt, 50% of the training data includes incorrect initial decisions. We categorize AEB actions into three meta-actions: Normal, Early Warning, and Emergency Braking [1]. A possible answer might be, “Early Warning. The presence of a truck and another black car ahead requires heightened awareness. <AEB>”
决策制定。在最后一个子任务中，问题会辅以AEB提示（AEB-Prompt），这些提示代表了基于规则的AEB模块生成的初始决策。例如，一个AEB提示可能是：“初始决策：预计在1.2秒后与左侧的黑色车辆发生碰撞，我决定刹车。”为了防止由MLLM驱动的AEB过度依赖AEB提示，50%的训练数据包含了错误的初始决策。我们将AEB行动归类为三种元行动：正常、提前警告和紧急制动[1]。一个可能的答案是：“提前警告。前方有一辆卡车和另一辆黑色汽车，需要提高警觉。”。

E. Training and Inference

During training, AD models are trained on Bench2Drive [56], and the MLLM-powered AEB employs a composite loss function that integrates both TG and BSG to ensure cohesive semantic learning. The overall loss L is defined as:
在训练过程中，自动驾驶（AD）模型在Bench2Drive数据集上进行训练[56]，而由 MLLM 驱动的AEB采用一个综合损失函数，该函数整合了文本生成（TG）和制动信号生成（BSG），以确保语义学习的一致性。总体损失 L 定义为：
在这里插入图片描述
where yˆn,i and yn,i represent the predicted and ground truth probabilities of word i at position n in the text sequence, with V being the vocabulary size. The variables z and zˆ denote the ground truth and predicted braking signals, respectively.
其中 $\hat{y}_{n,i}$ 和 $y_{n,i}$ 分别代表文本序列中位置 n 的单词 i 的预测概率和真实概率，V 是词汇表的大小。变量 $z$ 和 $\hat{z}$ 分别表示真实制动信号和预测制动信号。
During inference, the rule-based AEB receives perception and planning results from the pre-trained AD model. It can either directly output its braking signal or initiate an interaction with the MLLM-powered AEB via the AEBPrompt. If interaction is triggered, the MLLM-powered AEB engages in a multi-round Q&A process to address the three sub-tasks. Ultimately, during the decision-making task, it determines whether to confirm or adjust the initial decision made by the rule-based AEB. The final braking signal is then output after passing through the projection network.
在推理过程中，基于规则的AEB从预训练的AD模型接收感知和规划结果。它可以直接输出其制动信号，或者通过AEB提示（AEBPrompt）与由MLLM驱动的AEB启动交互。如果触发了交互，由 MLLM 驱动的AEB将进行多轮问答过程以处理三个子任务。最终，在决策任务中，它决定是否确认或调整基于规则的AEB所做的初始决策。最终的制动信号在通过投影网络后输出。

IV. EXPERIMENTS

A. Datasets

To provide a more comprehensive evaluation of our method, we assess it on two datasets. We perform open-loop evaluations using both MM-AU [55] and Bench2Drive [56], and conduct closed-loop evaluations using the Bench2Drive [56] benchmark. We carefully construct 120,113 samples from MM-AU and 132,922 samples from Bench2Drive, with the proportions of Normal, Early Warning, and Emergency Braking approximately at 1:1:1. Using a 9:1 ratio, we split this data into training and test sets.
为了更全面地评估我们的方法，我们在两个数据集上对其进行了评估。我们使用MM-AU[55]和Bench2Drive[56]进行开环评估，并使用Bench2Drive[56]基准进行闭环评估。我们从MM-AU精心构建了120,113个样本，从Bench2Drive构建了132,922个样本，其中正常、提前警告和紧急制动的比例大致为1:1:1。我们按照9:1的比例将这些数据划分为训练集和测试集。

B. Metrics

In the open-loop evaluation, we focus on the accuracy of the model’s predicted braking signals and the quality of the generated text. For the Braking Signal Generation (BSG), we utilize standard AEB task metrics [17], namely Precision and Recall. Positive cases refer to situations where AEB activation is necessary, and negative cases to those where it is not needed. True Positives (TP) are instances of correctly triggered Emergency Braking, while True Negatives (TN) are correct non-activations (Early Warning/Normal). Conversely, False Positives (FP) represent erroneous activations, and False Negatives (FN) denote instances where necessary AEB actions are missed. The formulas for Precision and Recall are provided as follows:
在开环评估中，我们关注模型预测制动信号的准确性和生成文本的质量。对于制动信号生成（BSG），我们使用标准的AEB任务指标[17]，即精确率（Precision）和召回率（Recall）。正例指的是需要激活AEB的情况，负例则是不需要激活的情况。真正例（TP）是指正确触发的紧急制动，而真负例（TN）是指正确的非激活（提前警告/正常）。相反，假正例（FP）代表错误的激活，假负例（FN）表示错过了必要的AEB行动。精确率和召回率的公式如下：
精确率（Precision）= 真正例（TP）/ (真正例（TP）+ 假正例（FP）)，
召回率（Recall）= 真正例（TP）/ (真正例（TP）+ 假负例（FN）)。
在这里插入图片描述
for the Text Generation (TG) aspect, we evaluate the quality of the generated text using BLEU4 [60], METEOR [61], and ROUGE-L [62] metrics.
对于文本生成（TG）方面，我们使用BLEU4[60]、METEOR[61]和ROUGE-L[62]指标来评估生成文本的质量。
In the closed-loop evaluation, our primary focus is on the model’s overall driving performance. We utilize the Driving Score and Success Rate metrics from Bench2Drive [56]. The Driving Score aggregates multiple driving metrics from Bench2Drive into a weighted sum, while the Success Rate measures the proportion of successfully completed scenarios out of the total number of scenarios. Additionally, we introduced the Collision Rate, defined as the average number of collisions per scenario, to better assess the model’s ability to avoid collisions.
在闭环评估中，我们主要关注模型的整体驾驶性能。我们使用了Bench2Drive[56]中的驾驶得分和成功率指标。驾驶得分将Bench2Drive中的多个驾驶指标聚合成一个加权总和，而成功率则衡量成功完成的场景占总场景数的比例。此外，我们引入了碰撞率这一指标，定义为每个场景中的平均碰撞次数，以更好地评估模型避免碰撞的能力[17]。

C. Implementation Details

For the rule-based AEB module, we instantiate two widely used models, UniAD [63] and VAD [64], to generate perception and planning results, thereby validating the flexibility of our framework. We adopt a step size of 0.2 seconds and a 3-second horizon, aligning with the 3-second future trajectory provided by the Bench2Drive benchmark planner. For the MLLM-powered AEB module, we employ the stateof-the-art LLaVA-OneVision [65] model as the backbone and perform full fine-tuning of all its components. Following the recommendations in [65], a learning rate of 2 × 10−6 is applied to the vision encoder, while a learning rate of 1 × 10−5 is used for the other components.
对于基于规则的AEB模块，我们实例化了两个广泛使用的模型，UniAD[63]和VAD[64]，以生成感知和规划结果，从而验证我们框架的灵活性。我们采用0.2秒的步长和3秒的预测范围，与Bench2Drive基准规划器提供的3秒未来轨迹相一致[17]。对于由MLLM驱动的AEB模块，我们使用最先进的LLaVA-OneVision[65]模型作为主干，并对其所有组件进行全面微调。按照[65]中的建议，视觉编码器的学习能力设置为 $2×10^{-6}$ ，而其他组件的学习能力则使用 $1×10^{-5}$ [65]。

For closed-loop evaluation, the rule-based AEB module is provided with information from VAD or UniAD and interacts with the MLLM-powered AEB module every 2.5 seconds, and the MLLM-powered AEB is trained on the Bench2Drive dataset before being evaluated in CARLA.
在闭环评估中，基于规则的AEB模块会从VAD或UniAD获取信息，并每2.5秒与由MLLM驱动的AEB模块进行一次交互[16]。而由MLLM驱动的AEB模块在被评估前，会在Bench2Drive数据集上进行训练，并在CARLA环境中进行评估[16]。
All experiments are conducted on a server equipped with 8 NVIDIA A100 80G GPUs. To ensure the deployability of the framework in real-world scenarios, inference time consumption tests are performed on consumer-grade devices; here, we use an NVIDIA Jetson Orin.
所有实验都在一台配备了8个NVIDIA A100 80G GPU的服务器上进行。为了确保框架在现实世界场景中的可部署性，我们在消费级设备上进行了推理时间消耗测试；在这里，我们使用了NVIDIA Jetson Orin。

D. Closed-Loop Experimental Results

We leverage Qwen-0.5B as the foundational language model for LLaVA-OneVision to ensure faster response times within the Bench2Drive closed-loop simulation benchmark. Detailed results are provided in Table I. Specifically, with Dual-AEB, VAD’s Driving Score increased by 14.74%, while the Success Rate remained unchanged. For UniADBase, the Driving Score improved by 6.84%, and the Success Rate increased from 9.54 to 10.00. Among all evaluated models [63], [64], [66]–[69], VAD with Dual-AEB achieved the highest performance in terms of Driving Score.
我们利用Qwen-0.5B作为LLaVA-OneVision的基础语言模型，以确保在Bench2Drive闭环仿真基准测试中获得更快的响应时间。详细的结果提供在表I 中。具体来说，使用Dual-AEB后，VAD的驾驶得分提高了14.74%，而成功率保持不变。对于UniADBase，驾驶得分提高了6.84%，成功率从9.54提高到了10.00。在所有评估的模型[63]、[64]、[66]–[69]中，配备Dual-AEB的VAD在驾驶得分方面取得了最高的性能。

Dual-AEB helps improve the closed-loop performance of these models by providing effective braking decisions. However, since it does not alter the autonomous driving model’s output trajectory, the improvement on Success Rate is limited. Despite this, the overall driving performance of these end-to-end models is enhanced, demonstrating the potential for Dual-AEB to be integrated into other autonomous driving systems.
Dual-AEB通过提供有效的制动决策，帮助提高了这些模型的闭环性能。然而，由于它不改变自动驾驶模型的输出轨迹，所以对成功率的提高是有限的[18]。尽管如此，这些端到端模型的整体驾驶性能得到了增强，显示出Dual-AEB有可能被集成到其他自动驾驶系统中[18]。
在这里插入图片描述
表1：Bench2Drive中的闭环结果。*表示专家特征蒸馏。指标包括驾驶得分（D）、成功率（S）和成功率的提高（ΔS）。

E. Open-Loop Experimental Results

In this part, we aim to evaluate the comprehensive scene understanding capabilities of MLLM-powered AEB in realworld scenarios. We compare the Precision, Recall, and text generation quality of the trained LLaVA-OneVision model based on Qwen-0.5B and Qwen-7B, taking YOLO-AEB [30] as the baseline. Since the YOLO model processes single images, we take the last frame from each video split as input. We present the quantification results in Table II, and provide qualitative examples in Fig. 3.
在这一部分，我们的目标是评估MLLM驱动的AEB在现实世界场景中的全面场景理解能力。我们将基于Qwen-0.5B和Qwen-7B训练的LLaVA-OneVision模型的精确度（Precision）、召回率（Recall）和文本生成质量与YOLO-AEB[30]进行比较，以YOLO-AEB[30]作为基线。由于YOLO模型处理的是单张图像，我们取每个视频分割的最后一个帧作为输入。我们在 表II 中展示了量化结果，并在图3 中提供了定性示例[16]。

看一下YOLO-AEB：《Real-time vehicle and distance detection based on improved YOLO v5 network》，2021

在这里插入图片描述
表2：在开环评估中，我们比较了基于Qwen-0.5B（Q0.5B）和Qwen-7B（Q7B）训练的LLaVA-OneVision模型与YOLO-AEB[30]作为基线模型的精确度（Precision）、召回率（Recall）以及文本生成质量（B4: BLEU-4, M: METEOR, R: ROUGE-L）。评估的模型包括Qwen-0.5B（Q0.5B）、Qwen-7B（Q7B）和YOLO，覆盖两个数据集：MM-AU和B2D（Bench2Drive）[16]。在这里插入图片描述
图3. 定性分析：由MLLM驱动的AEB为不同的元行动提供了合理的描述。左侧代表真实结果，而右侧代表预测结果。
Compared to YOLO-AEB, MLLM-powered AEB achieves higher Precision and Recall, thanks to the powerful and comprehensive scene understanding capabilities of MLLMs, which enable more effective braking decisions. Additionally, the increase in model scale further contributes to the observed performance improvements.
与YOLO-AEB相比，由MLLM驱动的AEB实现了更高的精确度（Precision）和召回率（Recall），这得益于MLLM强大的全面场景理解能力，使其能够做出更有效的制动决策。此外，模型规模的增加也进一步促进了观察到的性能提升[28]。
On the Bench2Drive benchmark, both YOLO and the MLLM-powered AEB demonstrate improved performance compared to their results on MM-AU. This enhancement is likely due to the simpler and more uniform objects in the simulated scenarios, which facilitate the models in learning effective braking behaviors. However, these improvements do not extend to Precision. Specifically, the MLLM’s performance is somewhat hindered because Bench2Drive’s simulation data is generated by an expert model [70] that consistently avoids danger by braking. As a result, some scenarios appear inherently safe, reducing the model’s tendency to initiate braking and leading to lower Precision.
在Bench2Drive基准测试中，YOLO和由MLLM驱动的AEB相较于它们在MM-AU数据集上的结果表现出了改善的性能。这种提升很可能是由于模拟场景中的对象更简单、更统一，这有助于模型学习有效的制动行为。然而，这些改进并没有扩展到精确度上。具体来说，MLLM的性能受到了一定程度的阻碍，因为Bench2Drive的模拟数据是由专家模型[70]生成的，该模型通过制动一致地避免危险。因此，一些场景看起来本质上是安全的，减少了模型启动制动的倾向，从而导致精确度降低。

F. Ablation Study

Impact of Different AEB Modules. To assess the effectiveness of various Dual-AEB modules in providing accurate braking decisions, we conducte an evaluation. The specific results are presented in Table III.
不同AEB模块的影响。为了评估各种Dual-AEB模块在提供准确制动决策方面的有效性，我们进行了评估。具体结果呈现在 表III 中。
在这里插入图片描述
表3：在闭环评估中，不同AEB模块的影响（R：基于规则的AEB，M：由MLLM驱动的AEB）。所包括的指标有驾驶得分（D）、成功率（S）和碰撞率（C）。

After integrating the rule-based AEB module into VAD and UniAD, both models exhibit a significant reduction in Collision Rate, resulting in an improved Driving Score. Replacing the rule-based AEB with an MLLM-powered AEB, which possesses advanced scene understanding capabilities, further decreases the Collision Rate, highlighting the MLLM’s stronger ability to perceive potential dangers. Finally, with the implementation of the full Dual-AEB, the Driving Score improves once more. This demonstrates that the rule-based AEB can provide rapid responses in dangerous scenarios when the MLLM is not invoked—since it operates intermittently—thereby compensating for the limitations of the MLLM.
将基于规则的AEB模块集成到VAD和UniAD中后，两个模型的碰撞率显著降低，从而提高了驾驶得分。用具有先进场景理解能力的MLLM驱动的AEB替换基于规则的AEB，进一步降低了碰撞率，突显了MLLM更强的感知潜在危险的能力。最后，实施完整的Dual-AEB后，驾驶得分再次提高。这表明，在MLLM未被调用时（因为它间歇性运行），基于规则的AEB可以在危险场景中提供快速响应，从而弥补MLLM的局限性。
Impact of Different Task Modules. In previous experiments, we design three sequential tasks to guide the model’s TG process. This section compares the performance of different task combinations on the MM-AU and Bench2Drive open-loop datasets (Table IV). The results indicate that incorporating additional tasks enhances the inference performance of the MLLM-powered AEB module, with the critical objects description task providing the most substantial improvement. This task allows the model to better understand the intentions of other objects, resulting in more effective braking.
不同任务模块的影响。在之前的实验中，我们设计了三个连续的任务来指导模型的文本生成（TG）过程。这一部分比较了不同任务组合在MM-AU和Bench2Drive开环数据集上的性能（表IV）。结果表明，加入额外任务增强了由MLLM驱动的AEB模块的推理性能，其中关键对象描述任务提供了最显著的改进。这项任务使模型能够更好地理解其他对象的意图，从而实现更有效的制动[16]。
在这里插入图片描述
Impact of Different Trigger Intervals of the MLLMpowered AEB. In previous experiments, we test the MLLMpowered AEB module in Bench2Drive with a 2.5-second trigger interval. Here, we evaluate the impact of different intervals: 0.5, 2.5, and 5 seconds. As shown in Table V,shorter intervals improve Driving Score, but the 0.5-second interval provides only marginal gains and significantly increases inference time. Thus, a 2.5-second interval offers a better balance between performance and efficiency.
不同触发间隔对MLLM驱动的AEB影响。在之前的实验中，我们在Bench2Drive上 以2.5秒的触发间隔测试了MLLM驱动的AEB模块 。这里，我们评估了不同间隔：0.5、2.5和5秒的影响。如表V 所示，较短的间隔提高了驾驶得分，但0.5秒的间隔只提供了边际增益，并显著增加了推理时间。因此，2.5秒的间隔在性能和效率之间提供了更好的平衡[16]。

MLLM计算的时长太长了

在这里插入图片描述
Comparison of Inference Time Consumption on Different Devices. After conducting frequency analysis, we perform average inference time tests on the NVIDIA Jetson Orin and optimize the models using TensorRT. The results are presented in Table VI. To achieve faster response times, we develop a version of Dual-AEB that executes only the decision-making task (Dual-AEB-S), based on the original Dual-AEB that executes all tasks (Dual-AEB-F). By combining trigger time with accelerated inference, we further reduce the average inference time.
不同设备上的推理时间消耗比较。在进行频率分析之后，我们在NVIDIA Jetson Orin上进行了平均推理时间测试，并使用TensorRT对模型进行了优化。结果呈现在 表VI 中。为了实现更快的响应时间，我们开发了一个只执行决策任务的Dual-AEB版本（Dual-AEB-S），基于执行所有任务的原始Dual-AEB（Dual-AEB-F）。通过结合触发时间和加速推理，我们进一步减少了平均推理时间。

G. Cases Study

MM-AU comes from network dashcam recordings, and Bench2Drive originates from simulations. There are still some differences between them and real-world driving data. To further explore the capabilities of our model in real-world driving scenarios, we conduct tests on our in-house dataset, which is collected from a vehicle equipped with a complete autonomous driving hardware and software system. Fig. 4 shows several case results, which are similar to the scenarios mentioned in Fig. 1.
MM-AU数据集来源于网络行车记录仪（dashcam）录像，而Bench2Drive数据集源自于模拟环境。它们与真实世界的驾驶数据还是存在一些差异。为了进一步探索我们的模型在真实世界驾驶场景中的能力，我们在内部数据集上进行了测试，这些数据集是从配备有完整自动驾驶硬件和软件系统的车辆中收集的。图4 展示了几个案例结果，这些结果与图1 中提到的场景相似。
在这里插入图片描述
图4. 定性分析：Dual-AEB在我们内部数据集中提供了合理的描述。（橙色）代表场景描述任务，（蓝色）代表对象描述任务，（绿色）代表决策制定任务。

V. CONCLUSION

In this study, we present Dual-AEB, an innovative method that integrates conventional rule-based modules with advanced Multimodal Large Language Models (MLLMs) to enhance autonomous emergency braking (AEB). The conventional component ensures rapid initial responses, while the MLLM component enhances decision accuracy by processing complex environmental data and ego-vehicle states. Evaluations on open-loop and closed-loop benchmarks demonstrate the effectiveness of this system in providing robust braking strategies. Further testing on real-world scenarios confirms the practical applicability and strong performance of the Dual-AEB system. Enhancing precision in real-world applications is the focus of our future work.
在本研究中，我们提出了Dual-AEB，这是一种创新的方法，将传统的基于规则的模块与先进的多模态大型语言模型（MLLM）相结合，以增强自动紧急制动（AEB）系统。传统组件确保快速的初始响应，而MLLM组件通过处理复杂的环境数据和自车状态来提高决策的准确性。在开环和闭环基准测试中的评估证明了该系统在提供稳健的制动策略方面的有效性。对真实世界场景的进一步测试确认了Dual-AEB系统的实际适用性和强大性能。提升在实际应用中的精确度是我们未来工作的重点。

在这里插入图片描述