CXRLMaskablePPONode: RL Model Multi-Robot Applications

The CXRLMaskablePPONode class provides a ROS 2 node integrated with a maskable, multi-robot Proximal Policy Optimization (PPO) agent (MultiRobotMaskablePPO). This allows training and executing reinforcement learning policies in environments with multiple robots while automatically handling invalid action masking.

It extends the CXRLBaseNode and integrates a custom maskable PPO implementation compatible with Stable-Baselines3 and SB3-Contrib.

Key features:

Multi-robot support with configurable number of parallel agents.
Maskable PPO allowing invalid actions to be ignored during training.
Time-based or step-based rollout collection.
Integration with ROS 2 for lifecycle, parameter management, and execution.
Automatic logging, checkpointing, and training termination.

MultiRobotMaskablePPO

The MultiRobotMaskablePPO class is a custom extension of MaskableActorCriticPolicy from sb3_contrib. It supports multi-robot environments and invalid action masking.

Key points:

Rollout collection uses multithreading to simulate multiple robots in parallel. Each thread executes actions for a specific robot and steps the environment concurrently, enabling faster collection of experiences while respecting per-robot action masks.
Supports time_based or step_based rollout collection.
Integrates with MultiRobotMaskableRolloutBuffer and MultiRobotMaskableDictRolloutBuffer for action masking.
Computes returns and advantages, accounting for terminal states even if episodes are truncated.

Configuration

The CXRLMaskablePPONode exposes the following ROS 2 parameters for configuring the RL agent and training:

Base PPO Parameters (forwarded from MaskablePPO)

These parameters are standard PPO hyperparameters, forwarded directly to the underlying MultiRobotMaskablePPO implementation:

Parameter	Type	Description
model.learning_rate	double	Learning rate of the PPO agent
model.gamma	double	Discount factor for future rewards
model.gae_lambda	double	Factor for Generalized Advantage Estimation
model.ent_coef	double	Entropy coefficient for the PPO loss
model.vf_coef	double	Value function coefficient for the PPO loss
model.max_grad_norm	double	Maximum gradient norm for gradient clipping
model.batch_size	integer	Minibatch size for PPO updates
model.n_steps	integer	Number of steps per environment for rollout
model.seed	integer	Random seed for reproducibility
model.verbose	integer	Verbosity level for logging (0=none, 1=info, 2=debug)

Attention

Not all parameters of the underlying PPO/MPPO model are exposed as ROS parameters. To access advanced settings (such as custom network architectures, policy_kwargs, or other Stable Baselines3 options), create a class that derives from CXRLMaskablePPONode override the create_new_model() method (see the Interfaces section for more information).

Multi-Robot / Custom Parameters

These parameters are specific to the multi-robot maskable variant:

The CXRLMaskablePPONode utilizes the following parameters:

model_name:

Type	Default
string	“default_agent”

Description: Name of the RL model for saving or loading.

training.retraining:

Type	Default
boolean	False

Description: Whether to retrain an existing model.

training.max_episodes:

Type	Default
integer	1000

Description: Maximum number of training episodes.

training.timesteps:

Type	Default
integer	100000

Description: Total number of timesteps for the learning procedure.

model.n_robots:

Type	Default
integer	3

Description: Number of robots to simulate in parallel during rollout collection.

model.wait_for_all_robots:

Type	Default
boolean	True

Description: Whether to wait for all robot threads to finish before computing returns. Ensures rollouts are fully collected for all robots.

model.time_based:

Type	Default
boolean	False

Description: If True, rollouts are collected for a fixed duration instead of a fixed number of steps.

model.n_time:

Type	Default
integer	450

Description: Maximum duration (in seconds) of a time-based rollout when time_based=True.

model.deadzone:

Type	Default
integer	10

Description: Time buffer (in seconds) to prevent starting new threads near the end of a time-based rollout.

Interfaces

The CXRLMaskablePPONode provides the following methods:

set_model() → MultiRobotMaskablePPO

Populate self.model with an RL agent depending on the current rl_mode and links it with the CXRLGym environemnt.

Behavior:

TRAINING: creates a new agent or loads an existing one if training.retraining=True.
EXECUTION: loads a previously trained agent.

Returns the loaded or newly created MultiRobotMaskablePPO instance.

create_new_model() → MultiRobotMaskablePPO

Helper method to set_model. Creates a new MPPO agent from scratch with parameters read from ROS parameters.

load_model() → MultiRobotMaskablePPO

Helper method to set_model. Loads a previously trained MPPO agent from disk. Updates self.model and links it with self.env. Logs a message once loading is complete.

run_training()

Starts the training loop of the RL agent according to the parameters training.max_episodes and training.timesteps. Handles callbacks for stopping and checkpointing. Saves the trained agent to disk and calls self.env.env.on_training_end(). Performs node shutdown after training completes.