Here is an introduction to the experiments we conducted on SafetyGym:
Firstly, as there is no open-source offline dataset available on SafetyGym, we created our own offline safety dataset based on the PointGoal environment. To construct this dataset, we trained the PPO-lag algorithm with different constraint thresholds and generated trajectories using the trained models. Specifically, we used constraint thresholds of 25, 40, and 80 with the PPO-lag algorithm, and the final cost and reward performance of these models are shown in the following figure.
We then saved these three models and generated 333/334 trajectories for each model, resulting in a total of 1000 offline trajectory data that composed our PointGoal dataset. The distribution of this dataset is shown in figure below, which demonstrates that our dataset still has a certain degree of distinguishability.
After generating the dataset, we tested our model and used a DT model as a baseline, setting the constraint threshold to d=30. The experimental results, shown in Figure below, demonstrate that our model can still meet the constraint requirements in the SafetyGym environment. This experiment showcases the potential of our model for safety-critical applications on SafetyGym.
DATASET | SAFORMER | CoptiDICE | |
---|---|---|---|
HALFCHEETAH_MEDIUM | REWARD | 4661.20 ± 52.46 | 4847 ± 50 |
COST | 4487.2 ± 6.05 | 4335 ± 23 | |
LIMIT | 4490 | 4490 | |
HALFCHEETAH_MEDIUM_REPLAY | REWARD | 3019.17 ± 234.09 | 3314±188 |
COST | 4289.54 ± 6.31 | 4412±64 | |
LIMIT | 4300 | 4300 | |
HALFCHEETAH_MEDIUM_EXPERT | REWARD | 11000.37 ± 141.93 | 482 ± 149 |
COST | 4296.87 ± 9.92 | 3141 ± 125 | |
LIMIT | 4300 | 4300 |
Here, we compared our method with CoptiDICE. Specifically, we conducted experiments with CoptiDICE on three datasets of halfcheetah, and the results are shown in the following figure. The results demonstrate that our method outperforms CoptiDICE in halfcheetah_medium_replay and halfcheetah_meidum_expert tasks,proved the effectiveness of our model.
In particular, we need to note that Halfcheetah_medium_expert is a more challenging dataset and learning the expert policy on this dataset is a relatively difficult task. On this task, Saformer demonstrated outstanding performance, while CoptiDICE failed. We believe that the reason for Saformer's poor performance on the Halfcheetah_medium task is that the distribution of
Note:We currently support the following algorithms: [Saformer, BCQ-L, CPQ, DT, CoptiDICE, CRR]. If you are interested, we can also conduct additional supplementary experiments.
1. subsequence length
Model | K=2 | K=5 | K=10 | K=20 | K=40 | |
---|---|---|---|---|---|---|
Saformer | REWARD | 4555 ± 117 | 4601 ± 120 | 4632 ±94 | 4713 ± 26 | 4728 ±75 |
COST | 4474 ± 17 | 4476 ± 23 | 4479 ± 20 | 4498 ± 2 | 4506 ± 12 | |
LIMIT | 4503 | 4503 | 4503 | 4503 | 4503 |
In this section, we conducted hyperparameter experiments on the context length K of Transformer, using the 20th percentile of the Halfcheetah_medium dataset. We chose this dataset because the trajectories in the Halfcheetah dataset are of fixed length, unlike Hopper and Walker2d, which have variable-length trajectories, which makes it more conducive to our experiments on the hyperparameter K. We selected K values of [2, 5, 10, 20, 40] and conducted experiments, as shown in the figure below. We found that for all values of K except 40, Saformer can satisfy our constraint settings, which demonstrates the robustness of Saformer to the hyperparameter K.
2. penalty factor
Model | λ=0 | λ=0.25 | λ=0.5 | λ=1.0 | λ=2.0 | |
---|---|---|---|---|---|---|
Saformer | REWARD | 4578±100 | 4683±67 | 4574±178 | 4620±79 | 4599±37 |
COST | 4459±23 | 4491±14 | 4471±40 | 4487±15 | 4478±47 | |
LIMIT | 4503 | 4503 | 4503 | 4503 | 4503 |
In this section, we conducted hyperparameter experiments on the regularization parameter
In this section, we will verify the effectiveness of the posterior safety verification module in Saformer.
First, let me describe the two figures below. In these two figures, the blue line represents the actual cost obtained by the agent when backtracking each time step after completing the entire trajectory, the yellow line represents the CTG_prompt value designed by us for the agent, and the green line represents the CTG value predicted by the Saformer critic module. After explaining the meaning of each line, we can draw the following assumptions:
(1) If the blue line is lower than the yellow line at time step 0, it means that the total cost of the entire trajectory is lower than the constraint threshold we set, that is, the entire trajectory meets the constraint condition.
(2) We expect the blue line to be slightly lower than the yellow line, but not too low. If the blue line is lower than the yellow line, it indicates that the trajectory meets the safety constraint condition, as stated in (1); however, we do not want the blue line to be too low. This is because our research assumes that there is a weak linear correlation between the agent's rewards and constraints, and a very low blue line can lead to worse reward effects.
With the above assumptions in mind, we can analyze the images as follows:
Left : We only train the Saformer Critic without using it to modify the actions of the Saformer Actor. In this part, we can observe that the average value of the blue line is higher than that of the yellow line, indicating that our policy cannot satisfy the safety constraint. The reason is that although Saformer Critic can predict that an action is dangerous at a certain time step (when the green line is above the yellow line, i.e.,
Right : We train the Critic and use it to modify the actions. At this point, we can see that the blue line is very close to the yellow line, indicating that the Saformer with the Critic modification can find the most appropriate action to satisfy the safety constraint.
Through the above analysis and experiments, we conclude that our Critic module is reasonable and effective.