From Words🤬 to Wheels🚗 : Automated Style-Customized Policy Generation for Autonomous Driving

1Data Science and Analytics Thrust, The Hong Kong University of Science and Technology (Guangzhou), China
2Intelligent Transportation Thrust, The Hong Kong University of Science and Technology (Guangzhou)
3Department of Automation, Tianjin University
4Shanghai Artificial Intelligence Laboratory

Corresponding author
Code arXiv Supplementary

The Words2Wheels framework automatically customize driving policies based on user commands.
It employs a Style-Customized Reward Function (Style Reward) to generate a Style-Customized Driving Policy (Style Policy).

Abstract

Autonomous driving technology has witnessed rapid advancements, with foundation models improving interactivity and user experiences. However, current autonomous vehicles (AVs) face significant limitations in delivering command-based driving styles. Most existing methods either rely on predefined driving styles that require expert input or use data-driven techniques like Inverse Reinforcement Learning to extract styles from driving data. These approaches, though effective in some cases, face challenges: difficulty obtaining specific driving data for style matching (e.g., in Robotaxis), inability to align driving style metrics with user preferences, and limitations to pre-existing styles, restricting customization and generalization to new commands. This paper introduces Words2Wheels, a framework that automatically generates customized driving policies based on natural language user commands. Words2Wheels employs a Style-Customized Reward Function to generate a Style-Customized Driving Policy without relying on prior driving data. By leveraging large language models and a Driving Style Database, the framework efficiently retrieves, adapts, and generalizes driving styles. A Statistical Evaluation module ensures alignment with user preferences. Experimental results demonstrate that Words2Wheels outperforms existing methods in accuracy, generalization, and adaptability, offering a novel solution for customized AV driving behavior.

Words2Wheels Framework


  1. Workflow of Words2Wheels: When a natural language command is received, the system matches it with an online-usable style from the database. Style Reward generation and policy training run simultaneously in the backend, resulting in a new Style Policy that may outperform the existing one and replace it;
  2. Driving Style Database: This conceptual database stores Style Rewards (initially from both data-driven and human-designed methods), Style Policies, and their statistics. It manages the increasing variety of driving styles and supports the automated policy customization;
  3. Statistical Evaluation Module: This module ensures that the generated driving styles closely align with user commands by evaluating them against natural driving behaviors.
As Words2Wheels operates, new Style Policies generated by the LLM expand the database, creating a broader range of driving styles.

Automated Policy Generation


Words2Wheels generates a driving policy based on user commands by leveraging Retrieval-Augmented Generation (RAG) principles to select relevant style rewards from a Driving Style Database, improving focus and task performance. The system uses a Statistical Evaluation module to match past commands or generate new reward functions if no match is found. Initial policies are trained and evaluated for effectiveness, with new policies created via Reinforcement Learning if necessary. The automated process ensures optimal policy selection, minimizing subjective biases and enhancing efficiency.




Driving Style Database


The Driving Style Database in Words2Wheels includes Style Rewards, Style Policies, and analytical data. Style Rewards are programming codes, and Style Policies are pre-trained neural networks. Analytical data is saved in JSON format and can be embedded as high-dimensional vectors for efficient retrieval. The LLM can select existing Style Policies or generate new ones using reward functions as templates. Pre-existing reward functions enhance the LLM's output quality, and research on driving reward design provides valuable references. As Words2Wheels operates, new Style Policies expand the database, improving efficiency and reducing reliance on RL training. User commands are also stored for fuzzy memory functionality.




Statistical Evaluation


The Statistical Evaluation module generates data on driving behavior to help the LLM assess how well a Style Policy aligns with user commands. A reserved test dataset simulates driving behavior and collects metrics like speed, acceleration, and spacing. These metrics, informed by prior research, are used by the LLM to evaluate Style Policies. The LLM uses a Chain-of-Thought approach to select relevant metrics and compare policies to natural driving data. Customizing the test set allows for expanded functionality and precise analyses, such as fine-tuning based on specific Style Policies and spatio-temporal filtering.

Results

Customizing Driving Style


We evaluated whether Words2Wheels' driving styles align with user commands, using an initial database of 8 styles. Styles were derived from data-driven and human-designed reward functions, and trained using the Proximal Policy Optimization algorithm. We tested commands for aggressive, normal, and conservative driving, running each 5 times and averaging results. Styles were normalized using natural driving behavior data. Aggressive styles had higher speeds, while conservative styles maintained larger gaps. Words2Wheels effectively adapted to user-specified driving styles.




Generation Capability


In this experiment, Words2Wheels was tested for its ability to generate driving styles based on user commands, despite only having mismatched styles in the database. Results showed that user commands like "I'm going to be late for the train" led to more aggressive driving, while "Safety first. I have plenty of time" resulted in more conservative driving. This demonstrates Words2Wheels' capacity to create styles that align with user commands, crucial for adapting to new scenarios.




Human-in-the-Loop Comparisons


Research suggests human feedback is more consistent than direct quantification, crucial in scientific experiments. A visualization tool was created to compare driving style judgments. Benchmarking Words2Wheels against Intelligent Driver Model (IDM), 10 volunteers assessed 1,000 events. Results showed Words2Wheels aligned better with user commands in 72% of cases, equaled IDM in 18.8%, and was surpassed by IDM in 9.2%. This underscores Words2Wheels' adaptability to user preferences.




Generalization Capability


To evaluate the impact of user commands on driving style customization, we categorized commands into three levels of directness and tested 40 specific commands. We assessed the LLM's interpretation, logical soundness, and reasonableness of generated reward functions. Words2Wheels generally performed well, especially with Level I commands. Overall, the system demonstrated strong capabilities. You can check out the Supplementary for more details.




Fuzzy Memory


Fuzzy memory is a functional feature of Words2Wheels. We tested it by inputting predefined commands into the Driving Style Database and checking recall with similar inputs. The system successfully matched most inputs, demonstrating strong adaptability in recognizing and responding to varied natural language instructions, thus enhancing the user experience and overall system efficiency.

BibTeX

@misc{han2024words,
  title={From Words to Wheels: Automated Style-Customized Policy Generation for Autonomous Driving},
  author={Xu Han and Xianda Chen and Zhenghan Cai and Pinlong Cai and Meixin Zhu and Xiaowen Chu},
  year={2024},
  eprint={2409.11694},
  archivePrefix={arXiv},
  primaryClass={cs.RO}
}