cx_rl_multi_robot_mppo.MultiRobotMaskablePPO module

class cx_rl_multi_robot_mppo.MultiRobotMaskablePPO.MultiRobotMaskablePPO(*args: Any, **kwargs: Any)

Bases: MaskablePPO

Policy Optimization (PPO) with Invalid Action Masking for Multi-Robot Scenarios.

Based on the original Stable Baselines 3 implementation and the maskable PPO implementation in Stable Baselines3 contrib.

Introduction to PPO: https://spinningup.openai.com/en/latest/algorithms/ppo.html Background on Invalid Action Masking: https://arxiv.org/abs/2006.14171

Parameters:
  • policy – The policy model to use (MlpPolicy, CnnPolicy, …)

  • env – The environment to learn from (if registered in Gym, can be str)

  • learning_rate – The learning rate; it can be a function of the current progress remaining (from 1 to 0)

  • n_steps – Number of steps to run for each environment per update (batch size is n_steps * n_env, where n_env is the number of environment copies running in parallel)

  • batch_size – Minibatch size

  • n_epochs – Number of epochs when optimizing the surrogate loss

  • gamma – Discount factor

  • gae_lambda – Factor for trade-off of bias vs variance for Generalized Advantage Estimator

  • clip_range – Clipping parameter; can be a function of the current progress remaining (from 1 to 0)

  • clip_range_vf – Clipping parameter for the value function; can be a function of the current progress remaining (from 1 to 0). If None (default), no clipping is done. Note: clipping depends on reward scaling.

  • normalize_advantage – Whether to normalize the advantage or not

  • ent_coef – Entropy coefficient for the loss calculation

  • vf_coef – Value function coefficient for the loss calculation

  • max_grad_norm – Maximum value for gradient clipping

  • target_kl – Limit the KL divergence between updates. The clipping alone may not prevent large updates. See issue #213: https://github.com/hill-a/stable-baselines/issues/213. Default: no limit.

  • stats_window_size – Window size for rollout logging, specifying the number of episodes to average reported success rate, mean episode length, and mean reward over

  • tensorboard_log – Log location for tensorboard (if None, no logging)

  • policy_kwargs – Additional arguments passed to the policy on creation

  • verbose – Verbosity level: 0 no output, 1 info, 2 debug

  • seed – Seed for the pseudo-random generators

  • device – Device (cpu, cuda, …) to run the code on. ‘auto’ uses GPU if available

  • _init_setup_model – Whether to build the network at instance creation

collect_rollouts(env: stable_baselines3.common.vec_env.VecEnv, callback: stable_baselines3.common.callbacks.BaseCallback, rollout_buffer: stable_baselines3.common.buffers.RolloutBuffer, n_rollout_steps: int, use_masking: bool = True) bool

Collect experiences using the current policy and fill a RolloutBuffer.

The term rollout here refers to the model-free notion and should not be used with the concept of rollout used in model-based RL or planning.

This method is largely identical to the implementation found in the parent class.

Parameters:
  • env – The training environment

  • callback – Callback that will be called at each step (and at the beginning and end of the rollout)

  • rollout_buffer – Buffer to fill with rollouts

  • n_steps – Number of experiences to collect per environment

  • use_masking – Whether or not to use invalid action masks during training

Returns:

True if function returned with at least n_rollout_steps collected, False if callback terminated rollout prematurely.

policy_aliases: ClassVar[Dict[str, Type[stable_baselines3.common.policies.BasePolicy]]] = {'CnnPolicy': sb3_contrib.ppo_mask.policies.CnnPolicy, 'MlpPolicy': sb3_contrib.ppo_mask.policies.MlpPolicy, 'MultiInputPolicy': sb3_contrib.ppo_mask.policies.MultiInputPolicy}