CXRLMaskablePPONode: RL Model Multi-Robot Applications
The CXRLMaskablePPONode class provides a ROS 2 node integrated with a
maskable, multi-robot Proximal Policy Optimization (PPO) agent
(MultiRobotMaskablePPO). This allows training and executing reinforcement
learning policies in environments with multiple robots while automatically
handling invalid action masking.
It extends the CXRLBaseNode and integrates a custom maskable PPO
implementation compatible with Stable-Baselines3 and SB3-Contrib.
Key features:
Multi-robot support with configurable number of parallel agents.
Maskable PPO allowing invalid actions to be ignored during training.
Time-based or step-based rollout collection.
Integration with ROS 2 for lifecycle, parameter management, and execution.
Automatic logging, checkpointing, and training termination.
MultiRobotMaskablePPO
The MultiRobotMaskablePPO class is a custom extension of
MaskableActorCriticPolicy from sb3_contrib. It supports multi-robot environments and
invalid action masking.
Key points:
Rollout collection uses multithreading to simulate multiple robots in parallel. Each thread executes actions for a specific robot and steps the environment concurrently, enabling faster collection of experiences while respecting per-robot action masks.
Supports
time_basedorstep_basedrollout collection.Integrates with
MultiRobotMaskableRolloutBufferandMultiRobotMaskableDictRolloutBufferfor action masking.Computes returns and advantages, accounting for terminal states even if episodes are truncated.
Configuration
The CXRLMaskablePPONode exposes the following ROS 2 parameters for configuring the RL agent and training:
Base PPO Parameters (forwarded from MaskablePPO)
These parameters are standard PPO hyperparameters, forwarded directly to the underlying MultiRobotMaskablePPO implementation:
Parameter |
Type |
Description |
|---|---|---|
model.learning_rate |
double |
Learning rate of the PPO agent |
model.gamma |
double |
Discount factor for future rewards |
model.gae_lambda |
double |
Factor for Generalized Advantage Estimation |
model.ent_coef |
double |
Entropy coefficient for the PPO loss |
model.vf_coef |
double |
Value function coefficient for the PPO loss |
model.max_grad_norm |
double |
Maximum gradient norm for gradient clipping |
model.batch_size |
integer |
Minibatch size for PPO updates |
model.n_steps |
integer |
Number of steps per environment for rollout |
model.seed |
integer |
Random seed for reproducibility |
model.verbose |
integer |
Verbosity level for logging (0=none, 1=info, 2=debug) |
Attention
Not all parameters of the underlying PPO/MPPO model are exposed as ROS parameters.
To access advanced settings (such as custom network architectures, policy_kwargs, or other
Stable Baselines3 options), create a class that derives from CXRLMaskablePPONode override the create_new_model() method (see the Interfaces section for more information).
Multi-Robot / Custom Parameters
These parameters are specific to the multi-robot maskable variant:
The CXRLMaskablePPONode utilizes the following parameters:
- model_name:
Type
Default
string
“default_agent”
- Description
Name of the RL model for saving or loading.
- training.retraining:
Type
Default
boolean
False
- Description
Whether to retrain an existing model.
- training.max_episodes:
Type
Default
integer
1000
- Description
Maximum number of training episodes.
- training.timesteps:
Type
Default
integer
100000
- Description
Total number of timesteps for the learning procedure.
- model.n_robots:
Type
Default
integer
3
- Description
Number of robots to simulate in parallel during rollout collection.
- model.wait_for_all_robots:
Type
Default
boolean
True
- Description
Whether to wait for all robot threads to finish before computing returns. Ensures rollouts are fully collected for all robots.
- model.time_based:
Type
Default
boolean
False
- Description
If True, rollouts are collected for a fixed duration instead of a fixed number of steps.
- model.n_time:
Type
Default
integer
450
- Description
Maximum duration (in seconds) of a time-based rollout when
time_based=True.
- model.deadzone:
Type
Default
integer
10
- Description
Time buffer (in seconds) to prevent starting new threads near the end of a time-based rollout.
Interfaces
The CXRLMaskablePPONode provides the following methods:
set_model() → MultiRobotMaskablePPOPopulate
self.modelwith an RL agent depending on the currentrl_modeand links it with theCXRLGymenvironemnt.Behavior:
TRAINING: creates a new agent or loads an existing one if
training.retraining=True.EXECUTION: loads a previously trained agent.
Returns the loaded or newly created
MultiRobotMaskablePPOinstance.create_new_model() → MultiRobotMaskablePPOHelper method to
set_model. Creates a new MPPO agent from scratch with parameters read from ROS parameters.load_model() → MultiRobotMaskablePPOHelper method to
set_model. Loads a previously trained MPPO agent from disk. Updatesself.modeland links it withself.env. Logs a message once loading is complete.run_training()Starts the training loop of the RL agent according to the parameters
training.max_episodesandtraining.timesteps. Handles callbacks for stopping and checkpointing. Saves the trained agent to disk and callsself.env.env.on_training_end(). Performs node shutdown after training completes.