CXRLMaskablePPONode: RL Model Multi-Robot Applications

The CXRLMaskablePPONode class provides a ROS 2 node integrated with a maskable, multi-robot Proximal Policy Optimization (PPO) agent (MultiRobotMaskablePPO). This allows training and executing reinforcement learning policies in environments with multiple robots while automatically handling invalid action masking.

It extends the CXRLBaseNode and integrates a custom maskable PPO implementation compatible with Stable-Baselines3 and SB3-Contrib.

Key features:

  • Multi-robot support with configurable number of parallel agents.

  • Maskable PPO allowing invalid actions to be ignored during training.

  • Time-based or step-based rollout collection.

  • Integration with ROS 2 for lifecycle, parameter management, and execution.

  • Automatic logging, checkpointing, and training termination.

MultiRobotMaskablePPO

The MultiRobotMaskablePPO class is a custom extension of MaskableActorCriticPolicy from sb3_contrib. It supports multi-robot environments and invalid action masking.

Key points:

  • Rollout collection uses multithreading to simulate multiple robots in parallel. Each thread executes actions for a specific robot and steps the environment concurrently, enabling faster collection of experiences while respecting per-robot action masks.

  • Supports time_based or step_based rollout collection.

  • Integrates with MultiRobotMaskableRolloutBuffer and MultiRobotMaskableDictRolloutBuffer for action masking.

  • Computes returns and advantages, accounting for terminal states even if episodes are truncated.

Configuration

The CXRLMaskablePPONode exposes the following ROS 2 parameters for configuring the RL agent and training:

Base PPO Parameters (forwarded from MaskablePPO)

These parameters are standard PPO hyperparameters, forwarded directly to the underlying MultiRobotMaskablePPO implementation:

Parameter

Type

Description

model.learning_rate

double

Learning rate of the PPO agent

model.gamma

double

Discount factor for future rewards

model.gae_lambda

double

Factor for Generalized Advantage Estimation

model.ent_coef

double

Entropy coefficient for the PPO loss

model.vf_coef

double

Value function coefficient for the PPO loss

model.max_grad_norm

double

Maximum gradient norm for gradient clipping

model.batch_size

integer

Minibatch size for PPO updates

model.n_steps

integer

Number of steps per environment for rollout

model.seed

integer

Random seed for reproducibility

model.verbose

integer

Verbosity level for logging (0=none, 1=info, 2=debug)

Attention

Not all parameters of the underlying PPO/MPPO model are exposed as ROS parameters. To access advanced settings (such as custom network architectures, policy_kwargs, or other Stable Baselines3 options), create a class that derives from CXRLMaskablePPONode override the create_new_model() method (see the Interfaces section for more information).

Multi-Robot / Custom Parameters

These parameters are specific to the multi-robot maskable variant:

The CXRLMaskablePPONode utilizes the following parameters:

model_name:

Type

Default

string

“default_agent”

Description

Name of the RL model for saving or loading.

training.retraining:

Type

Default

boolean

False

Description

Whether to retrain an existing model.

training.max_episodes:

Type

Default

integer

1000

Description

Maximum number of training episodes.

training.timesteps:

Type

Default

integer

100000

Description

Total number of timesteps for the learning procedure.

model.n_robots:

Type

Default

integer

3

Description

Number of robots to simulate in parallel during rollout collection.

model.wait_for_all_robots:

Type

Default

boolean

True

Description

Whether to wait for all robot threads to finish before computing returns. Ensures rollouts are fully collected for all robots.

model.time_based:

Type

Default

boolean

False

Description

If True, rollouts are collected for a fixed duration instead of a fixed number of steps.

model.n_time:

Type

Default

integer

450

Description

Maximum duration (in seconds) of a time-based rollout when time_based=True.

model.deadzone:

Type

Default

integer

10

Description

Time buffer (in seconds) to prevent starting new threads near the end of a time-based rollout.

Interfaces

The CXRLMaskablePPONode provides the following methods:

set_model() MultiRobotMaskablePPO

Populate self.model with an RL agent depending on the current rl_mode and links it with the CXRLGym environemnt.

Behavior:

  • TRAINING: creates a new agent or loads an existing one if training.retraining=True.

  • EXECUTION: loads a previously trained agent.

Returns the loaded or newly created MultiRobotMaskablePPO instance.

create_new_model() MultiRobotMaskablePPO

Helper method to set_model. Creates a new MPPO agent from scratch with parameters read from ROS parameters.

load_model() MultiRobotMaskablePPO

Helper method to set_model. Loads a previously trained MPPO agent from disk. Updates self.model and links it with self.env. Logs a message once loading is complete.

run_training()

Starts the training loop of the RL agent according to the parameters training.max_episodes and training.timesteps. Handles callbacks for stopping and checkpointing. Saves the trained agent to disk and calls self.env.env.on_training_end(). Performs node shutdown after training completes.