cx_rl_multi_robot_mppo.MultiRobotMaskablePPO module

class cx_rl_multi_robot_mppo.MultiRobotMaskablePPO.MultiRobotMaskablePPO(*args: Any, **kwargs: Any)

Bases: MaskablePPO

Policy Optimization (PPO) with Invalid Action Masking for Multi-Robot Scenarios.

Based on the original Stable Baselines 3 implementation and the maskable PPO implementation in Stable Baselines3 contrib.

Introduction to PPO: https://spinningup.openai.com/en/latest/algorithms/ppo.html Background on Invalid Action Masking: https://arxiv.org/abs/2006.14171

Parameters:

policy – The policy model to use (MlpPolicy, CnnPolicy, …)
env – The environment to learn from (if registered in Gym, can be str)
learning_rate – The learning rate; it can be a function of the current progress remaining (from 1 to 0)
n_steps – Number of steps to run for each environment per update (batch size is n_steps * n_env, where n_env is the number of environment copies running in parallel)
batch_size – Minibatch size
n_epochs – Number of epochs when optimizing the surrogate loss
gamma – Discount factor
gae_lambda – Factor for trade-off of bias vs variance for Generalized Advantage Estimator
clip_range – Clipping parameter; can be a function of the current progress remaining (from 1 to 0)
clip_range_vf – Clipping parameter for the value function; can be a function of the current progress remaining (from 1 to 0). If None (default), no clipping is done. Note: clipping depends on reward scaling.
normalize_advantage – Whether to normalize the advantage or not
ent_coef – Entropy coefficient for the loss calculation
vf_coef – Value function coefficient for the loss calculation
max_grad_norm – Maximum value for gradient clipping
target_kl – Limit the KL divergence between updates. The clipping alone may not prevent large updates. See issue #213: https://github.com/hill-a/stable-baselines/issues/213. Default: no limit.
stats_window_size – Window size for rollout logging, specifying the number of episodes to average reported success rate, mean episode length, and mean reward over
tensorboard_log – Log location for tensorboard (if None, no logging)
policy_kwargs – Additional arguments passed to the policy on creation
verbose – Verbosity level: 0 no output, 1 info, 2 debug
seed – Seed for the pseudo-random generators
device – Device (cpu, cuda, …) to run the code on. ‘auto’ uses GPU if available
_init_setup_model – Whether to build the network at instance creation

collect_rollouts(env: stable_baselines3.common.vec_env.VecEnv, callback: stable_baselines3.common.callbacks.BaseCallback, rollout_buffer: stable_baselines3.common.buffers.RolloutBuffer, n_rollout_steps: int, use_masking: bool = True) → bool

Collect experiences using the current policy and fill a RolloutBuffer.

The term rollout here refers to the model-free notion and should not be used with the concept of rollout used in model-based RL or planning.

This method is largely identical to the implementation found in the parent class.

Parameters:

env – The training environment
callback – Callback that will be called at each step (and at the beginning and end of the rollout)
rollout_buffer – Buffer to fill with rollouts
n_steps – Number of experiences to collect per environment
use_masking – Whether or not to use invalid action masks during training

Returns:

True if function returned with at least n_rollout_steps collected, False if callback terminated rollout prematurely.

policy_aliases: ClassVar[Dict[str, Type[stable_baselines3.common.policies.BasePolicy]]] = {'CnnPolicy': sb3_contrib.ppo_mask.policies.CnnPolicy, 'MlpPolicy': sb3_contrib.ppo_mask.policies.MlpPolicy, 'MultiInputPolicy': sb3_contrib.ppo_mask.policies.MultiInputPolicy}