Soft Actor Critic (SAC)#
off-policy stochasticPaper: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Pseudocode#
Configuration#
@dataclass
class SACActorConfig:
"""
Configuration for the SAC actor network.
Attributes:
arch (type): Neural network architecture class for the actor.
actor_type (type): Actor class type.
"""
arch: type = ActorNetProbabilistic
actor_type: type = SACActor
@dataclass
class SACCriticConfig:
"""
Configuration for the SAC critic network ensemble.
Attributes:
arch (type): Neural network architecture class for the critic.
critic_type (type): Critic class type.
"""
arch: type = CriticNet
critic_type: type = SACCritic
@dataclass
class SACConfig:
"""
Main SAC algorithm configuration class.
Attributes:
name (str): Algorithm identifier.
loss (str): Loss function used for critic training.
policy_delay (int): Number of critic updates per actor update.
tau (float): Polyak averaging coefficient for target network updates.
target_entropy (float | None): Target entropy for automatic temperature tuning.
alpha (float): Initial temperature parameter.
actor (SACActorConfig): Actor configuration.
critic (SACCriticConfig): Critic configuration.
"""
name: str = "sac"
loss: str = "MSELoss"
policy_delay: int = 1
tau: float = 0.005
target_entropy: float | None = None
alpha: float = 1.0
actor: SACActorConfig = field(default_factory=SACActorConfig)
critic: SACCriticConfig = field(default_factory=SACCriticConfig)
UML Diagram#
UML diagram for the SAC algorithm.#
We use the UML diagram to illustrate the relationships between the classes in our SAC implementation.
The diagram shows how the SACActor and SACCritic classes inherit from the base classes Actor and CriticEnsemble, respectively. SoftActorCritic class also inherits from ActorCritic class which inherits from Agent.
We illustrate each class's crucial attributes and methods for SAC. Specifically:
get_bellman_target() method in SACCritic class is implemented to compute the Bellman target for the critic in SAC style.
update_alpha(), loss(), and update() methods in SACActor class are implemented to update the actor's policy and temperature parameter in SAC style.
Classes#
- class objectrl.models.sac.SACActor(config: MainConfig, dim_state: int, dim_act: int)[source]#
Bases:
ActorSoft Actor network with automatic temperature tuning.
- Parameters:
config (MainConfig) – Configuration object with hyperparameters.
dim_state (int) – Observation space dimensions.
dim_act (int) – Action space dimensions.
- target_entropy#
Target entropy for temperature tuning.
- Type:
float
- log_alpha#
Learnable log temperature parameter.
- Type:
Tensor
- optim_alpha#
Optimizer for temperature parameter.
- Type:
Optimizer
- __init__(config: MainConfig, dim_state: int, dim_act: int) None[source]#
Initializes the Actor.
- Parameters:
config (MainConfig) – Configuration dataclass instance.
dim_state (int) – Dimension of observation space.
dim_act (int) – Dimension of action space.
- Returns:
None
- update_alpha(act_dict: dict) None[source]#
Updates the temperature parameter alpha based on current policy entropy.
- Parameters:
act_dict (dict) – Dictionary with keys ‘action_logprob’ containing log probabilities.
- Returns:
None
- loss(state: Tensor, critics: CriticEnsemble) tuple[Tensor, dict][source]#
Computes the SAC actor loss.
- Parameters:
state (Tensor) – Batch of states.
critics (CriticEnsemble) – Critic networks for Q-value estimation.
- Returns:
Actor loss and action dictionary containing action and log probability.
- Return type:
tuple
- update(state: Tensor, critics: CriticEnsemble) None[source]#
Performs a gradient step on the actor network and updates alpha.
- Parameters:
state (Tensor) – Batch of states.
critics (CriticEnsemble) – Critic ensemble for Q-value estimates.
- Returns:
None
- class objectrl.models.sac.SACCritic(config: MainConfig, dim_state: int, dim_act: int)[source]#
Bases:
CriticEnsembleSAC critic ensemble handling Bellman target computation and updates.
- Parameters:
config (MainConfig) – Configuration object.
dim_state (int) – State space dimensions.
dim_act (int) – Action space dimensions.
- _gamma#
Discount factor for future rewards.
- Type:
float
- __init__(config: MainConfig, dim_state: int, dim_act: int) None[source]#
Initialize the critic ensemble.
- Parameters:
config (MainConfig) – Configuration object with model parameters.
dim_state (int) – Dimension of the state space.
dim_act (int) – Dimension of the action space.
- Returns:
None
- get_bellman_target(reward: Tensor, next_state: Tensor, done: Tensor, actor: SACActor) Tensor[source]#
Computes target Q-values using entropy-regularized Bellman backup.
- Parameters:
reward (Tensor) – Reward batch.
next_state (Tensor) – Next state batch.
done (Tensor) – Done flags batch.
actor (SACActor) – Actor network for next action sampling.
- Returns:
Target Q-values for critic training.
- Return type:
Tensor
- class objectrl.models.sac.SoftActorCritic(config: MainConfig, critic_type: type = <class 'objectrl.models.sac.SACCritic'>, actor_type: type = <class 'objectrl.models.sac.SACActor'>)[source]#
Bases:
ActorCriticSoft Actor-Critic agent combining SACActor and SACCritic. Haarnoja et al. (2018): Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
- _agent_name = 'SAC'#
- __init__(config: MainConfig, critic_type: type = <class 'objectrl.models.sac.SACCritic'>, actor_type: type = <class 'objectrl.models.sac.SACActor'>) None[source]#
Initializes SAC agent.
- Parameters:
config (MainConfig) – Configuration dataclass instance.
critic_type (type) – Critic class type.
actor_type (type) – Actor class type.
- Returns:
None