Deep Deterministic Policy Gradient (DDPG)#

off-policy deterministic

Paper: Continuous Control with Deep Reinforcement Learning

Pseudocode#

Configuration#

Specific configuration for the DDPG algorithm (in config/model_configs/).#
@dataclass
class ActorNoiseConfig:
    """
    Configuration for the Ornstein-Uhlenbeck process used to add noise to actions during training.

    Attributes:
        mu (float): Long-running mean of the noise process.
        theta (float): Speed of mean reversion.
        sigma (float): Volatility (standard deviation of noise).
        dt (float): Time step size.
        x0 (float): Initial state of the noise process (optional).
    """

    mu: float = 0
    theta: float = 0.15
    sigma: float = 0.2
    dt: float = 1e-2
    x0: float | None = None


@dataclass
class DDPGActorConfig:
    """
    Configuration for the DDPG actor network.

    Attributes:
        arch (type): Class for the actor network architecture.
        actor_type (type): Actor class to be used.
        has_target (bool): Whether the actor maintains a target network.
    """

    arch: type = ActorNet
    actor_type: type = DDPGActor
    has_target: bool = True


@dataclass
class DDPGCriticConfig:
    """
    Configuration for the DDPG critic network.

    Attributes:
        arch (type): Class for the critic network architecture.
        critic_type (type): Critic class to be used.
        n_members (int): Number of critic networks to use in the ensemble.
        loss (str): Name of the loss function to use ( "MSELoss").
        policy_delay (int): Number of critic updates per actor update.
        tau (float): Soft update coefficient for Polyak averaging.
    """

    arch: type = CriticNet
    critic_type: type = DDPGCritic
    n_members: int = 1


@dataclass
class DDPGConfig:
    """
    Top-level configuration for the Deep Deterministic Policy Gradient (DDPG) agent.

    Attributes:
        name (str): Name of the algorithm.
        noise (ActorNoiseConfig): Noise configuration for exploration.
        loss (str): Loss function for critic training.
        policy_delay (int): How often to update the actor policy.
        tau (float): Soft update coefficient for target networks.
        actor (DDPGActorConfig): Configuration for the actor.
        critic (DDPGCriticConfig): Configuration for the critic.
    """

    name: str = "ddpg"
    noise: ActorNoiseConfig = field(default_factory=ActorNoiseConfig)
    loss: str = "MSELoss"
    policy_delay: int = 1
    tau: float = 0.005
    actor: DDPGActorConfig = field(default_factory=DDPGActorConfig)
    critic: DDPGCriticConfig = field(default_factory=DDPGCriticConfig)

    def __post_init__(self) -> None:
        """
        Converts `noise` from a dictionary to an ActorNoiseConfig if needed.
        Useful when loading from a JSON or dict-based config file.

        Args:
            None
        Returns:
            None
        """
        if isinstance(self.noise, dict):
            self.noise = ActorNoiseConfig(**self.noise)


UML Diagram#

UML diagram for the DDPG algorithm.

UML diagram for the DDPG algorithm.#

We use the UML diagram to illustrate the relationships between the classes in our DDPG implementation.

The diagram shows how the DDPGActor and DDPGCritic classes inherit from the base classes Actor and CriticEnsemble, respectively. DeepDeterministicPolicyGradient class also inherits from ActorCritic class which inherits from Agent.

We illustrate each class's crucial attributes and methods for DDPG. Specifically:

get_bellman_target() method in DDPGCritic class is implemented to compute the Bellman target for the critic in DDPG style.

act() and loss() methods in DDPGActor class are implemented to act in DDPG style and update the actor's policy.

Exploration Noise#

We use the OrnsteinUhlenbeckNoise class to inject noise into the actor's actions to encourage exploration. This noise process is commonly used in DDPG for continuous action spaces due to its temporal correlation.

class objectrl.models.ddpg.OrnsteinUhlenbeckNoise(dim_act: int, mu: float = 0, theta: float = 0.15, sigma: float = 0.2, dt: float = 0.01, x0: float | None = None)[source]#

Bases: object

Implements Ornstein-Uhlenbeck process to generate temporally correlated noise. Commonly used in DDPG to add exploration noise to continuous actions.

Parameters:
  • dim_act (int) – Shape of the action space.

  • mu (float) – Long-running mean.

  • theta (float) – Speed of mean reversion.

  • sigma (float) – Volatility (standard deviation).

  • dt (float) – Time step size.

  • x0 (float, optional) – Initial value of the process.

__init__(dim_act: int, mu: float = 0, theta: float = 0.15, sigma: float = 0.2, dt: float = 0.01, x0: float | None = None) None[source]#
reset() None[source]#

Resets internal state to initial value or mean (mu).

evolve_state() Tensor[source]#

Applies the OU process to evolve the noise state.

sample() Tensor[source]#

Returns a sample from the current noise state.

Parameters:

None

Returns:

A sample tensor generated from the noise process.

Return type:

torch.Tensor

Classes#

class objectrl.models.ddpg.DDPGActor(config: MainConfig, dim_state: int, dim_act: int)[source]#

Bases: Actor

DDPG actor module that produces actions and applies OU noise during training.

interaction_iter#

Counter for interaction steps.

Type:

int

sampling_rate#

Frequency of head sampling for posterior sampling.

Type:

int

idx_active_critic#

Index of the currently active critic head.

Type:

int

is_episode_end#

Flag indicating if the current episode has ended.

Type:

bool

__init__(config: MainConfig, dim_state: int, dim_act: int) None[source]#

Initializes the Actor.

Parameters:
  • config (MainConfig) – Configuration dataclass instance.

  • dim_state (int) – Dimension of observation space.

  • dim_act (int) – Dimension of action space.

Returns:

None

act(state: Tensor, is_training: bool = True) dict[source]#

Produces an action given a state. Adds noise during training.

Parameters:
  • state (torch.Tensor) – The input state.

  • is_training (bool) – Whether to add exploration noise.

Returns:

Dictionary containing ‘action’ and ‘action_wo_noise’.

Return type:

dict

loss(state: Tensor, critics: CriticEnsemble) Tensor[source]#

Computes the loss for the actor by maximizing the critic’s Q-value.

Parameters:
  • state (torch.Tensor) – Batch of input states.

  • critics (CriticEnsemble) – Critic network(s).

Returns:

Actor loss (negative Q-value).

Return type:

torch.Tensor

class objectrl.models.ddpg.DDPGCritic(config: MainConfig, dim_state: int, dim_act: int)[source]#

Bases: CriticEnsemble

DDPG critic module implementing Q-value estimation and Bellman target computation.

loss#

Loss function for training.

Type:

PACBayesLoss

gamma#

Discount factor for future rewards.

Type:

float

__init__(config: MainConfig, dim_state: int, dim_act: int) None[source]#

Initialize the critic ensemble.

Parameters:
  • config (MainConfig) – Configuration object with model parameters.

  • dim_state (int) – Dimension of the state space.

  • dim_act (int) – Dimension of the action space.

Returns:

None

get_bellman_target(reward: Tensor, next_state: Tensor, done: Tensor, actor: DDPGActor) Tensor[source]#

Computes the Bellman target for critic training.

Parameters:
  • reward (torch.Tensor) – Reward signal.

  • next_state (torch.Tensor) – Next state.

  • done (torch.Tensor) – Done flag (1 if terminal, else 0).

  • actor (DDPGActor) – Actor network (used for target action).

Returns:

Bellman target (y).

Return type:

torch.Tensor

class objectrl.models.ddpg.DeepDeterministicPolicyGradient(config: MainConfig, critic_type: type = <class 'objectrl.models.ddpg.DDPGCritic'>, actor_type: type = <class 'objectrl.models.ddpg.DDPGActor'>)[source]#

Bases: ActorCritic

Full DDPG agent, combining actor and critic with target networks and experience replay. Lillicrap et al. (2015): Continuous Control with Deep Reinforcement Learning

_agent_name = 'DDPG'#
__init__(config: MainConfig, critic_type: type = <class 'objectrl.models.ddpg.DDPGCritic'>, actor_type: type = <class 'objectrl.models.ddpg.DDPGActor'>) None[source]#

Initializes DDPG agent.

Parameters:
  • config (MainConfig) – Configuration dataclass instance.

  • critic_type (type) – Critic class type.

  • actor_type (type) – Actor class type.

Returns:

None

store_transition(transition: dict) None[source]#

Stores a transition in the replay buffer. Resets noise if episode is done.

Parameters:

transition (dict) – A dictionary containing keys like ‘state’, ‘action’, ‘reward’, ‘done’.