Skip to content

ZQ2413262560/TMP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

1.Additional experiments.

1.1 experiments on SafetyGym

Here is an introduction to the experiments we conducted on SafetyGym:

Firstly, as there is no open-source offline dataset available on SafetyGym, we created our own offline safety dataset based on the PointGoal environment. To construct this dataset, we trained the PPO-lag algorithm with different constraint thresholds and generated trajectories using the trained models. Specifically, we used constraint thresholds of 25, 40, and 80 with the PPO-lag algorithm, and the final cost and reward performance of these models are shown in the following figure.

We then saved these three models and generated 333/334 trajectories for each model, resulting in a total of 1000 offline trajectory data that composed our PointGoal dataset. The distribution of this dataset is shown in figure below, which demonstrates that our dataset still has a certain degree of distinguishability.

After generating the dataset, we tested our model and used a DT model as a baseline, setting the constraint threshold to d=30. The experimental results, shown in Figure below, demonstrate that our model can still meet the constraint requirements in the SafetyGym environment. This experiment showcases the potential of our model for safety-critical applications on SafetyGym.

1.2 experiments comparation with CoptiDICE

DATASETSAFORMERCoptiDICE
HALFCHEETAH_MEDIUMREWARD4661.20 ± 52.464847 ± 50
COST4487.2 ± 6.054335 ± 23
LIMIT44904490
HALFCHEETAH_MEDIUM_REPLAYREWARD3019.17 ± 234.093314±188
COST4289.54 ± 6.314412±64
LIMIT43004300
HALFCHEETAH_MEDIUM_EXPERTREWARD11000.37 ± 141.93482 ± 149
COST4296.87 ± 9.923141 ± 125
LIMIT43004300

Here, we compared our method with CoptiDICE. Specifically, we conducted experiments with CoptiDICE on three datasets of halfcheetah, and the results are shown in the following figure. The results demonstrate that our method outperforms CoptiDICE in halfcheetah_medium_replay and halfcheetah_meidum_expert tasks,proved the effectiveness of our model.

In particular, we need to note that Halfcheetah_medium_expert is a more challenging dataset and learning the expert policy on this dataset is a relatively difficult task. On this task, Saformer demonstrated outstanding performance, while CoptiDICE failed. We believe that the reason for Saformer's poor performance on the Halfcheetah_medium task is that the distribution of $\hat C$ and $\hat R$ in this dataset is too narrow (see Appendix A of the paper), which prevented our model from achieving the best performance.

Note:We currently support the following algorithms: [Saformer, BCQ-L, CPQ, DT, CoptiDICE, CRR]. If you are interested, we can also conduct additional supplementary experiments.

1.3 Hyperparameter Analysis

1. subsequence length $K$

ModelK=2K=5K=10K=20K=40
SaformerREWARD4555 ± 1174601 ± 120 4632 ±944713 ± 26 4728 ±75
COST4474 ± 174476 ± 234479 ± 204498 ± 2 4506 ± 12
LIMIT45034503450345034503

In this section, we conducted hyperparameter experiments on the context length K of Transformer, using the 20th percentile of the Halfcheetah_medium dataset. We chose this dataset because the trajectories in the Halfcheetah dataset are of fixed length, unlike Hopper and Walker2d, which have variable-length trajectories, which makes it more conducive to our experiments on the hyperparameter K. We selected K values of [2, 5, 10, 20, 40] and conducted experiments, as shown in the figure below. We found that for all values of K except 40, Saformer can satisfy our constraint settings, which demonstrates the robustness of Saformer to the hyperparameter K.

2. penalty factor $\lambda$.

Model λ=0λ=0.25λ=0.5λ=1.0λ=2.0
Saformer REWARD4578±1004683±67 4574±1784620±79 4599±37
COST4459±234491±144471±404487±15 4478±47
LIMIT45034503450345034503

In this section, we conducted hyperparameter experiments on the regularization parameter $\lambda$ of the Critic, which controls the weight of the non-decreasing loss of the predicted CTG by the Critic. To ensure consistency, we also used the 20% percentile of the Halfcheetah_medium dataset for experimentation. We determined the values of $\lambda$ as [0,0.25,0.5,1.0,2.0] and conducted experiments. The results showed that changing $\lambda$ had some impact on the satisfaction of the constraints for the entire trajectory, but the overall performance did not differ much when $\lambda$ was between (0,1), demonstrating the robustness of Saformer to the regularization parameter $\lambda$.

2.Effectiveness of the Posterior Safety Verification

In this section, we will verify the effectiveness of the posterior safety verification module in Saformer.

First, let me describe the two figures below. In these two figures, the blue line represents the actual cost obtained by the agent when backtracking each time step after completing the entire trajectory, the yellow line represents the CTG_prompt value designed by us for the agent, and the green line represents the CTG value predicted by the Saformer critic module. After explaining the meaning of each line, we can draw the following assumptions:

(1) If the blue line is lower than the yellow line at time step 0, it means that the total cost of the entire trajectory is lower than the constraint threshold we set, that is, the entire trajectory meets the constraint condition.

(2) We expect the blue line to be slightly lower than the yellow line, but not too low. If the blue line is lower than the yellow line, it indicates that the trajectory meets the safety constraint condition, as stated in (1); however, we do not want the blue line to be too low. This is because our research assumes that there is a weak linear correlation between the agent's rewards and constraints, and a very low blue line can lead to worse reward effects.

With the above assumptions in mind, we can analyze the images as follows:

Left : We only train the Saformer Critic without using it to modify the actions of the Saformer Actor. In this part, we can observe that the average value of the blue line is higher than that of the yellow line, indicating that our policy cannot satisfy the safety constraint. The reason is that although Saformer Critic can predict that an action is dangerous at a certain time step (when the green line is above the yellow line, i.e., $\hat C > CTG$), it cannot modify the action, resulting in a trajectory that does not satisfy the safety constraint.

Right : We train the Critic and use it to modify the actions. At this point, we can see that the blue line is very close to the yellow line, indicating that the Saformer with the Critic modification can find the most appropriate action to satisfy the safety constraint.

Through the above analysis and experiments, we conclude that our Critic module is reasonable and effective.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published