Soft Actor Critic (SAC)#

off-policy stochastic

Paper: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Pseudocode#

Configuration#

Specific configuration for the SAC algorithm (in config/model_configs/).#
@dataclass
class SACActorConfig:
    """
    Configuration for the SAC actor network.

    Attributes:
        arch (type): Neural network architecture class for the actor.
        actor_type (type): Actor class type.
    """

    arch: type = ActorNetProbabilistic
    actor_type: type = SACActor


@dataclass
class SACCriticConfig:
    """
    Configuration for the SAC critic network ensemble.

    Attributes:
        arch (type): Neural network architecture class for the critic.
        critic_type (type): Critic class type.
    """

    arch: type = CriticNet
    critic_type: type = SACCritic


@dataclass
class SACConfig:
    """
    Main SAC algorithm configuration class.

    Attributes:
        name (str): Algorithm identifier.
        loss (str): Loss function used for critic training.
        policy_delay (int): Number of critic updates per actor update.
        tau (float): Polyak averaging coefficient for target network updates.
        target_entropy (float | None): Target entropy for automatic temperature tuning.
        alpha (float): Initial temperature parameter.
        actor (SACActorConfig): Actor configuration.
        critic (SACCriticConfig): Critic configuration.
    """

    name: str = "sac"
    loss: str = "MSELoss"
    policy_delay: int = 1
    tau: float = 0.005
    target_entropy: float | None = None
    alpha: float = 1.0

    actor: SACActorConfig = field(default_factory=SACActorConfig)
    critic: SACCriticConfig = field(default_factory=SACCriticConfig)


UML Diagram#

UML diagram for the SAC algorithm.

UML diagram for the SAC algorithm.#

We use the UML diagram to illustrate the relationships between the classes in our SAC implementation.

The diagram shows how the SACActor and SACCritic classes inherit from the base classes Actor and CriticEnsemble, respectively. SoftActorCritic class also inherits from ActorCritic class which inherits from Agent.

We illustrate each class's crucial attributes and methods for SAC. Specifically:

get_bellman_target() method in SACCritic class is implemented to compute the Bellman target for the critic in SAC style.

update_alpha(), loss(), and update() methods in SACActor class are implemented to update the actor's policy and temperature parameter in SAC style.

Classes#

class objectrl.models.sac.SACActor(config: MainConfig, dim_state: int, dim_act: int)[source]#

Bases: Actor

Soft Actor network with automatic temperature tuning.

Parameters:
  • config (MainConfig) – Configuration object with hyperparameters.

  • dim_state (int) – Observation space dimensions.

  • dim_act (int) – Action space dimensions.

target_entropy#

Target entropy for temperature tuning.

Type:

float

log_alpha#

Learnable log temperature parameter.

Type:

Tensor

optim_alpha#

Optimizer for temperature parameter.

Type:

Optimizer

__init__(config: MainConfig, dim_state: int, dim_act: int) None[source]#

Initializes the Actor.

Parameters:
  • config (MainConfig) – Configuration dataclass instance.

  • dim_state (int) – Dimension of observation space.

  • dim_act (int) – Dimension of action space.

Returns:

None

update_alpha(act_dict: dict) None[source]#

Updates the temperature parameter alpha based on current policy entropy.

Parameters:

act_dict (dict) – Dictionary with keys ‘action_logprob’ containing log probabilities.

Returns:

None

loss(state: Tensor, critics: CriticEnsemble) tuple[Tensor, dict][source]#

Computes the SAC actor loss.

Parameters:
  • state (Tensor) – Batch of states.

  • critics (CriticEnsemble) – Critic networks for Q-value estimation.

Returns:

Actor loss and action dictionary containing action and log probability.

Return type:

tuple

update(state: Tensor, critics: CriticEnsemble) None[source]#

Performs a gradient step on the actor network and updates alpha.

Parameters:
  • state (Tensor) – Batch of states.

  • critics (CriticEnsemble) – Critic ensemble for Q-value estimates.

Returns:

None

class objectrl.models.sac.SACCritic(config: MainConfig, dim_state: int, dim_act: int)[source]#

Bases: CriticEnsemble

SAC critic ensemble handling Bellman target computation and updates.

Parameters:
  • config (MainConfig) – Configuration object.

  • dim_state (int) – State space dimensions.

  • dim_act (int) – Action space dimensions.

_gamma#

Discount factor for future rewards.

Type:

float

__init__(config: MainConfig, dim_state: int, dim_act: int) None[source]#

Initialize the critic ensemble.

Parameters:
  • config (MainConfig) – Configuration object with model parameters.

  • dim_state (int) – Dimension of the state space.

  • dim_act (int) – Dimension of the action space.

Returns:

None

get_bellman_target(reward: Tensor, next_state: Tensor, done: Tensor, actor: SACActor) Tensor[source]#

Computes target Q-values using entropy-regularized Bellman backup.

Parameters:
  • reward (Tensor) – Reward batch.

  • next_state (Tensor) – Next state batch.

  • done (Tensor) – Done flags batch.

  • actor (SACActor) – Actor network for next action sampling.

Returns:

Target Q-values for critic training.

Return type:

Tensor

class objectrl.models.sac.SoftActorCritic(config: MainConfig, critic_type: type = <class 'objectrl.models.sac.SACCritic'>, actor_type: type = <class 'objectrl.models.sac.SACActor'>)[source]#

Bases: ActorCritic

Soft Actor-Critic agent combining SACActor and SACCritic. Haarnoja et al. (2018): Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

_agent_name = 'SAC'#
__init__(config: MainConfig, critic_type: type = <class 'objectrl.models.sac.SACCritic'>, actor_type: type = <class 'objectrl.models.sac.SACActor'>) None[source]#

Initializes SAC agent.

Parameters:
  • config (MainConfig) – Configuration dataclass instance.

  • critic_type (type) – Critic class type.

  • actor_type (type) – Actor class type.

Returns:

None