Navigation
API > API/Plugins > API/Plugins/LearningTraining
References
| Module | LearningTraining |
| Header | /Engine/Plugins/Experimental/LearningAgents/Source/LearningTraining/Public/LearningPPOTrainer.h |
| Include | #include "LearningPPOTrainer.h" |
Syntax
struct FPPOTrainerTrainingSettings
Remarks
Settings used for training with PPO
Variables
| Type | Name | Description | |
|---|---|---|---|
| float | ActionEntropyWeight | Weighting used for the entropy bonus. | |
| float | ActionRegularizationWeight | Weight used to regularize actions.Larger values will encourage exploration and smaller actions, but too large will cause noisy actions centered around zero. | |
| float | ActionSurrogateWeight | Weight for the loss used to train the policy via the PPO surrogate objective. | |
| float | AdvantageMax | The maximum advantage to allow. | |
| float | AdvantageMin | The minimum advantage to allow. | |
| bool | bAdvantageNormalization | When true, advantages are normalized. | |
| bool | bSaveSnapshots | If to save snapshots of the trained networks every 1000 iterations. | |
| bool | bUseGradNormMaxClipping | If true, uses gradient norm max clipping. Set this as True if training is unstable or leave as False if unused. | |
| bool | bUseTensorboard | If to use TensorBoard for logging and tracking the training progress. | |
| uint32 | CriticBatchSize | Batch size to use for training the critic. Large batch sizes are much more computationally efficient when training on the GPU. | |
| uint32 | CriticWarmupIterations | Number of iterations of training to perform to warm-up the Critic. | |
| ETrainerDevice | Device | Which device to use for training. | |
| float | DiscountFactor | The discount factor causes future rewards to be scaled down so that the policy will favor near-term rewards over potentially uncertain long-term rewards. | |
| float | EpsilonClip | Clipping ratio to apply to policy updates. | |
| float | GaeLambda | This is used in the Generalized Advantage Estimation, where larger values will tend to assign more credit to recent actions. | |
| float | GradNormMax | The maximum gradient norm to clip updates to. | |
| uint32 | IterationNum | Number of iterations to train the network for. | |
| uint32 | IterationsPerGather | Number of training iterations to perform per buffer of experience gathered. | |
| float | LearningRateCritic | Learning rate of the critic network. | |
| float | LearningRateDecay | Amount by which to multiply the learning rate every 1000 iterations. | |
| float | LearningRatePolicy | Learning rate of the policy network. Typical values are between 0.001f and 0.0001f. | |
| uint32 | PolicyBatchSize | Batch size to use for training the policy. Large batch sizes are much more computationally efficient when training on the GPU. | |
| uint32 | PolicyWindow | The number of consecutive steps of observations and actions over which to train the policy. | |
| float | ReturnRegularizationWeight | Weight used to regularize predicted returns. Encourages the critic not to over or under estimate returns. | |
| uint32 | Seed | Random Seed to use for training. | |
| int32 | TrimEpisodeEndStepNum | Number of steps to trim from the end of each episode during training. | |
| int32 | TrimEpisodeStartStepNum | Number of steps to trim from the start of each episode during training. | |
| float | WeightDecay | Amount of weight decay to apply to the network. |