Proximal Policy Optimization (PPO)#

on-policy policy optimization clipping

Paper: Proximal Policy Optimization Algorithms

Pseudocode#

Configuration#

Specific configuration for the PPO algorithm (in config/model_configs/).#

@dataclass
class PPOActorConfig:
    """
    Configuration for the PPO actor network.

    Attributes:
        arch (type): The architecture class to be used for the actor network.
        actor_type (type): The class implementing the PPO actor logic.
        has_target (bool): Whether to maintain a target network for the actor.
        width (int): Width of hidden layers in the actor network.
        max_grad_norm (float): Maximum norm for gradient clipping in the actor.
    """

    arch: type = PPOActorNetProbabilistic
    actor_type: type = PPOActor
    has_target: bool = False
    width: int = 64
    max_grad_norm: float = 0.5


@dataclass
class PPOCriticConfig:
    """
    Configuration for the PPO critic network.

    Attributes:
        arch (type): The architecture class to be used for the critic network.
        critic_type (type): The class implementing the PPO critic logic.
        has_target (bool): Whether to maintain a target network for the critic.
        n_members (int): Number of ensemble members in the critic network.
        width (int): Width of hidden layers in the critic network.
        max_grad_norm (float): Maximum norm for gradient clipping in the critic.
    """

    arch: type = ValueNet
    critic_type: type = PPOCritic
    has_target: bool = False
    n_members: int = 1
    width: int = 64
    max_grad_norm: float = 0.5


@dataclass
class PPOConfig:
    """
    Full configuration for a PPO agent, including actor, critic, and optimization hyperparameters.

    Attributes:
        name (str): Identifier name for the PPO configuration.
        loss (str): Name of the loss function to use (e.g., 'MSELoss').
        tau (float): Polyak averaging coefficient for target network updates.
        policy_delay (int): Delay interval between policy (actor) updates.
        max_grad_norm (float): Maximum norm for gradient clipping globally.
        clip_rate (float): Clipping factor for the PPO objective.
        GAE_lambda (float): Lambda parameter for Generalized Advantage Estimation.
        normalize_advantages (bool): Whether to normalize advantages during training.
        entropy_coef (float): Coefficient for entropy regularization.
        actor (PPOActorConfig): Configuration object for the PPO actor.
        critic (PPOCriticConfig): Configuration object for the PPO critic.
    """

    name: str = "ppo"
    loss: str = "MSELoss"

    tau: float = 0.0
    policy_delay: int = 1
    max_grad_norm: float = 0.5
    clip_rate: float = 0.2
    GAE_lambda: float = 0.95
    normalize_advantages: bool = True
    entropy_coef: float = 0.0

    actor: PPOActorConfig = field(default_factory=PPOActorConfig)
    critic: PPOCriticConfig = field(default_factory=PPOCriticConfig)

UML Diagram#

We use the UML diagram to illustrate the relationships between the classes in our PPO implementation.

The diagram shows how the PPOActor and PPOCritic classes inherit from the base classes Actor and CriticEnsemble, respectively. ProximalPolicyOptimization class also inherits from ActorCritic class which inherits from Agent.

We illustrate each class's crucial attributes and methods for PPO. Specifically:

PPOActorNetProbabilistic class implements a probabilistic actor network with Gaussian policies for continuous actions.

loss() and update() methods in PPOActor class implement the PPO clipped surrogate objective and actor updates.

update() method in PPOCritic class updates the critic using Bellman targets computed externally.

ProximalPolicyOptimization class handles the overall training loop including generalized advantage estimation (GAE) and batched updates.

Classes#

class objectrl.models.ppo.PPOActorNetProbabilistic(dim_state: int, dim_act: int, n_heads: int = 1, depth: int = 3, width: int = 256, act: Literal['crelu', 'relu'] = 'relu', has_norm: bool = False, upper_clamp: float = -1.0)[source]#

Bases: Module

Probabilistic actor network for PPO using a Gaussian policy.

Proximal Policy Optimization (PPO)

Contents

Proximal Policy Optimization (PPO)#

Pseudocode#

Configuration#

UML Diagram#

Classes#