Skip to content

Optimization Algorithms (old)

JakeMikouchi edited this page Jul 4, 2025 · 1 revision

Genetic Algorithm (GA)

GA is based on the process of natural selection and the survival of the fittest. Initially, a population $N$ of random solutions $x^0_1 \dots x^0_N$ is generated. These solutions are considered the Generation 0 and can be seen as a chromosome with every $x^0_i \in \mathcal{X}^d$ having $d$ genes. The GA optimization continues with an iterative procedure between Evaluation, Selection and Reproduction. In the Evaluation step, every solution is evaluated to compute the corresponding fitness. In the Selection step, based on their fitness some solutions are selected to become parents for the next generation. A tournament selection approach is used for this step. In the Reproduction step, the selected parent solutions mate using a cross-over operator to produce a child. Mutation of one of the child genes can occur with some probability to allow for increased randomness in the process. A total of $N$ children are obtained that constitute the next Generation $x^1_1 \dots x^1_N$. This procedure is repeated until a predefined number of generations G is reached. The total number of code evaluations, excluding the initial population, is $N \times G$.

Simulated Annealing (SA)

SA is based on the process of slow cooldown of a system from a high to low temperature to reach a state of minimum energy. SA works by first generating an initial random solution to the optimization problem $x^0$. SA generates new solutions through local random perturbations to the current optimal solution. This is very similar to the mutation process in GA.

The new solution $x^1$ is accepted based on the acceptance probability function, which is $P_a = 1$ if $f(x^1) > f(x^0)$ and $P_a = \exp \left( -\frac{f(x^0)-f(x^1)}{T} \right)$ otherwise. In this definition, $T$ is the SA temperature that determines the amount of randomness in accepting worse solutions. If the new solution yields a higher objective function value (lower cost), then the new solution is always accepted. If the new solution generates a smaller objective function value (higher cost), then $0 < P_a < 1$ and the new solution is accepted with a probability. The probability depends on the temperature that varies with each iteration based on a decreasing cooling schedule. The initial temperature has a user-defined large value that translates to a large probability of accepting worse solutions at the beginning of the optimization, allowing more solution space to be explored. As the cooling schedule progresses, the temperature decreases, leading to lower probabilities of accepting worse solutions and thus the algorithm mainly explores the solution space region close to the current best solution.

The performance of SA depends strongly on the cooling schedule. Currently, only the exponentially decreasing cooling schedule is available in MOF:

$$ T^{i+1} = \alpha T^i $$

where $T^i$ is the current temperature, $T^{i+1}$ is the new temperature, $i$ is the number of cooling steps, and $\alpha$ is the rate at which the temperature decreases.

The above described SA process is repeated until the desired number of SA iterations $N$ is reached. The total number of code evaluations, excluding the initial population, is $N$.

Parallel Simulated Annealing (PSA)

PSA is a form of SA that was created to address one of SAs weaknesses which is computational efficiency. It does this by utilizing several computer cores to run SA in parallel and drastically reduce the computational time. While PSA is still a form of SA, the algorithm itself has to be modified to support the use of multiple CPUs

PSAFramework

As it is implemented into MOF, PSA works by following the above workflow. During initialization, the algorithm will create [A] random solutions. It will then store these solutions as the initial buffer and it will use these solutions to create an initial temperature. The algorithm will then run [Np] SA runs in parallel, which run at a constant temperature. The initial solution for each SA run will be selected from the buffer using a probability, and each run will iterate for [L] code evaluations. After each SA run is finished, The solutions that were found by each SA run are evaluated and the buffer is updated. The buffer is updated by adding the best solutions found by each SA run and removing poor solutions which have been outperformed while maintaining a size of [A]. It is at this time that the temperature value, [Ct], is updated. The algorithm then starts another set of [Np] SA runs in parallel with the updated temperature and buffer. This cycle repeats [T] number of times. A total number of code evaluations is T x Np x L. Once the algorithm is complete, the most optimal solution can be found in the MOF optimization track file.

Theory

The following are equations used in a PSA run

  • Cost / Fitness function - used to evaluate the effectiveness of a given solution. For the sample problems, a fitness function that is determined by the cycle length is used.

$$ c(x) = CL - W_B( \Delta B) -W_{F_{\Delta H}}(\Delta F_{\Delta H})-W_{F_q}(\Delta F_q) $$

  • Probability function - within each SA run, the active solution may be selected with a probability. The function that defines this probability is the probability function.

$$ p = exp\Biggl(\frac{-\Delta c(x)}{Ct}\Biggl) $$

  • SA initial solution - When starting SA runs in parallel, each run is able to select an initial solution from the buffer using a probability. The probability is determined using the following function.

$$ p_a = \frac{exp\Big(\frac{-C_a}{Ct}\Big)}{\sum\limits_{a=1}^A exp\Big(\frac{-C_a}{Ct}\Big)}$$

  • Initial Temperature - At the start of a PSA run, the initial temperature is determined using the following function.

$$ Ct_0 = \alpha \sigma $$

$\alpha$ = initialization parameter

$\sigma$ = standard deviation of C(x) during initialization

  • LAM temperature update - After a cycle of parallel SA runs, the temperature is updated using the LAM cooling schedule which uses the following functions.

$$ s_{i+1} = s_i + \lambda\biggl(\frac{1}{\sigma(s_i)}\biggl)\biggl(\frac{1}{s_i^2\sigma^2(s_i)}\biggl)G(\rho(s_i)) $$

$$ G(\rho) = \biggl(\frac{4\rho(1-\rho)^2}{(2-\rho)^2}\biggl) $$

"LAM cooling schedule" [1].

Hyperparameters

PSA as any meta-heuristic optimization algorithm has various hyperparameters that need to be selected by the user. In the MOF implementation of PSA these are the main ones:

  • Number of single processor SA iterations (L) - Determines the number of iterations of SA each CPU will run in a single generation. The value of this can be changed in the yaml file for psa problems through the variable "optimization → population_size".

  • Number of temperature iterations (T) - Determines the number of SA cycles that will be initialized. The value of this can be changed in the yaml file for psa problems through the variable "optimization → number_of_generations".

  • Buffer size - Determines how many solutions will be stored in the solution buffer. The value of this can be changed in the yaml file for psa problems through the variable "optimization → buffer_length".

  • initialization parameter (a) - A parameter used to calculate the starting temperature value when starting a psa run. The value of this can be changed in the InitialTemp function in the simulatedAnnealing.py file.

  • quality factor (qualityfactor) - A parameter used to adjust the cooling rate of the temperature where decreases to this value slow the cooling rate, and vice versa. The value of this can be changed in the LAM function in the simulatedAnnealing.py file.

Reinforcement Learning (RL)

RL techniques are framed within an environment that follows a Markov Decision Process (MDP), where an agent learns how to interact with the environment through trial and error across a notion of time hereafter called steps. At step $t$ the agent is able to observe the environment state $s_t$ with $s_t \in \mathcal{S}$ the set of all possible states. The agent then selects an action $a_t$ with $a_t \in \mathcal{A}(s_t)$ the set of all possible actions at the state $s_t$. The selection of the action is performed using the policy $\pi(a | s)$, which can be seen as the probability of taking action $a_t=a$ at the state $s_t=s$. After the action, the agent receives a reward from the environment $R_{t+1} \in \mathcal{R}$ and the new state $s_{t+1}$ is observed.

Proximal Policy Optimization (PPO)

PPO is an on-policy technique that optimizes directly the policy function $\pi$. The policy is approximated by a DNN with parameters $\theta$ as $\pi_{\theta}(a|s)$. The goal of PPO is essentially to tune the parameters $\theta$ to produce a policy that generates high rewards. This is achieved by sampling different trajectories using some fixed parameters and then updating the parameters through gradient descent in the direction that maximizes the rewards. One challenge for policy gradient methods is that there is no information about how much the update should be in the identified direction. Using a predefined learning rate is inefficient and can lead to large steps that inhibit the learning. PPO addresses this by restricting the updates in a clipped region around the current policy. PPO can be used for both discrete and continuous actions. In this work, the PPO implementation in stable-baselines3 with discrete actions is used and its main hyperparameters are: the DNN architecture for the policy ($Net_{\pi}$) and value ($Net_{v}$) networks; the learning rate ($\eta$); the discount factor ($\gamma$); the minibatch size ($B$); the number of steps before each model update ($N_{train}$); and the number of epochs used in the model updates ($N_{e}$).

Soft Actor Critic (SAC)

SAC is an off-policy algorithm that tries to incorporate the benefits of both DQN and PPO, where the former have a better sample efficiency since they can use past experiences, while the later are more stable and have a better convergence. SAC modifies the typical RL objective function of maximizing future rewards to also maximize the entropy of the policy. The entropy can be seen as a measure of randomness in the policy allowing the exploration to be embedded in the RL objective. SAC can be used for only continuous actions. In this work, the SAC implementation in stable-baselines3 is used with main hyperparameters being: the DNN architecture for the actor ($Net_{a}$) and critic ($Net_{c}$) networks; the learning rate ($\eta$); the discount factor ($\gamma$); the minibatch size ($B$); the number of steps that are collected before the learning starts ($N_{start}$); and the number of steps before each model update ($F_{train}$).

Deep Q Networks (DQN)

DQN is an off-policy technique that combines Q-Learning with Deep Neural Networks (DNN). In Q-learning, the goal is not to learn the policy but the Q function instead. For every policy $\pi$, the Q function is defined as $Q^{\pi}(a, s)$ and gives a value to every action $a$ at a specific state $s$. This value represents the discounted rewards obtained starting from state $s$, selecting the action $a$ and following the policy $\pi$ thereafter. In DQN, the Q function is approximated by a DNN that is updated along the optimization at specific intervals. Additional techniques such as Experience Replay, that re-use some of the previous interactions, can further improve the DNN learning. DQN can be used for discrete actions only. In this work, the vanilla DQN in stable-baselines3 is used with main hyperparameters being: the DNN architecture ($Net_q$); the learning rate ($\eta$); the discount factor ($\gamma$); the minibatch size ($B$); the number of steps that are collected before the learning starts ($N_{start}$); the number of steps before each model update ($F_{train}$); and the number of steps before the target network is updated ($N_{target}$).

References

[1] David J. Kropaczek (2008). Concept for Multi-cycle Nuclear Fuel Optimization Based On Parallel Simulated Annealing With Mixing of States

Clone this wiki locally