1.Additional experiments.

1.1 experiments on SafetyGym

Here is an introduction to the experiments we conducted on SafetyGym:

Firstly, as there is no open-source offline dataset available on SafetyGym, we created our own offline safety dataset based on the PointGoal environment. To construct this dataset, we trained the PPO-lag algorithm with different constraint thresholds and generated trajectories using the trained models. Specifically, we used constraint thresholds of 25, 40, and 80 with the PPO-lag algorithm, and the final cost and reward performance of these models are shown in the following figure.

We then saved these three models and generated 333/334 trajectories for each model, resulting in a total of 1000 offline trajectory data that composed our PointGoal dataset. The distribution of this dataset is shown in figure below, which demonstrates that our dataset still has a certain degree of distinguishability.

After generating the dataset, we tested our model and used a DT model as a baseline, setting the constraint threshold to d=30. The experimental results, shown in Figure below, demonstrate that our model can still meet the constraint requirements in the SafetyGym environment. This experiment showcases the potential of our model for safety-critical applications on SafetyGym.

1.2 experiments comparation with CoptiDICE

DATASET		SAFORMER	CoptiDICE
HALFCHEETAH_MEDIUM	REWARD	4661.20 ± 52.46	4847 ± 50
	COST	4487.2 ± 6.05	4335 ± 23
	LIMIT	4490	4490
HALFCHEETAH_MEDIUM_REPLAY	REWARD	3019.17 ± 234.09	3314±188
	COST	4289.54 ± 6.31	4412±64
	LIMIT	4300	4300
HALFCHEETAH_MEDIUM_EXPERT	REWARD	11000.37 ± 141.93	482 ± 149
	COST	4296.87 ± 9.92	3141 ± 125
	LIMIT	4300	4300

Here, we compared our method with CoptiDICE. Specifically, we conducted experiments with CoptiDICE on three datasets of halfcheetah, and the results are shown in the following figure. The results demonstrate that our method outperforms CoptiDICE in halfcheetah_medium_replay and halfcheetah_meidum_expert tasks,proved the effectiveness of our model.

In particular, we need to note that Halfcheetah_medium_expert is a more challenging dataset and learning the expert policy on this dataset is a relatively difficult task. On this task, Saformer demonstrated outstanding performance, while CoptiDICE failed. We believe that the reason for Saformer's poor performance on the Halfcheetah_medium task is that the distribution of $\hat C$ and $\hat R$ in this dataset is too narrow (see Appendix A of the paper), which prevented our model from achieving the best performance.

Note：We currently support the following algorithms: [Saformer, BCQ-L, CPQ, DT, CoptiDICE, CRR]. If you are interested, we can also conduct additional supplementary experiments.

1.3 Hyperparameter Analysis

1. subsequence length $K$

Model		K=2	K=5	K=10	K=20	K=40
Saformer	REWARD	4555 ± 117	4601 ± 120	4632 ±94	4713 ± 26	4728 ±75
	COST	4474 ± 17	4476 ± 23	4479 ± 20	4498 ± 2	4506 ± 12
	LIMIT	4503	4503	4503	4503	4503

In this section, we conducted hyperparameter experiments on the context length K of Transformer, using the 20th percentile of the Halfcheetah_medium dataset. We chose this dataset because the trajectories in the Halfcheetah dataset are of fixed length, unlike Hopper and Walker2d, which have variable-length trajectories, which makes it more conducive to our experiments on the hyperparameter K. We selected K values of [2, 5, 10, 20, 40] and conducted experiments, as shown in the figure below. We found that for all values of K except 40, Saformer can satisfy our constraint settings, which demonstrates the robustness of Saformer to the hyperparameter K.

2. penalty factor $\lambda$.

Model		λ=0	λ=0.25	λ=0.5	λ=1.0	λ=2.0
Saformer	REWARD	4578±100	4683±67	4574±178	4620±79	4599±37
	COST	4459±23	4491±14	4471±40	4487±15	4478±47
	LIMIT	4503	4503	4503	4503	4503

In this section, we conducted hyperparameter experiments on the regularization parameter $\lambda$ of the Critic, which controls the weight of the non-decreasing loss of the predicted CTG by the Critic. To ensure consistency, we also used the 20% percentile of the Halfcheetah_medium dataset for experimentation. We determined the values of $\lambda$ as [0,0.25,0.5,1.0,2.0] and conducted experiments. The results showed that changing $\lambda$ had some impact on the satisfaction of the constraints for the entire trajectory, but the overall performance did not differ much when $\lambda$ was between (0,1), demonstrating the robustness of Saformer to the regularization parameter $\lambda$.

2.Effectiveness of the Posterior Safety Verification

In this section, we will verify the effectiveness of the posterior safety verification module in Saformer.

First, let me describe the two figures below. In these two figures, the blue line represents the actual cost obtained by the agent when backtracking each time step after completing the entire trajectory, the yellow line represents the CTG_prompt value designed by us for the agent, and the green line represents the CTG value predicted by the Saformer critic module. After explaining the meaning of each line, we can draw the following assumptions:

(1) If the blue line is lower than the yellow line at time step 0, it means that the total cost of the entire trajectory is lower than the constraint threshold we set, that is, the entire trajectory meets the constraint condition.

(2) We expect the blue line to be slightly lower than the yellow line, but not too low. If the blue line is lower than the yellow line, it indicates that the trajectory meets the safety constraint condition, as stated in (1); however, we do not want the blue line to be too low. This is because our research assumes that there is a weak linear correlation between the agent's rewards and constraints, and a very low blue line can lead to worse reward effects.

With the above assumptions in mind, we can analyze the images as follows:

Left : We only train the Saformer Critic without using it to modify the actions of the Saformer Actor. In this part, we can observe that the average value of the blue line is higher than that of the yellow line, indicating that our policy cannot satisfy the safety constraint. The reason is that although Saformer Critic can predict that an action is dangerous at a certain time step (when the green line is above the yellow line, i.e., $\hat C > CTG$), it cannot modify the action, resulting in a trajectory that does not satisfy the safety constraint.

Right : We train the Critic and use it to modify the actions. At this point, we can see that the blue line is very close to the yellow line, indicating that the Saformer with the Critic modification can find the most appropriate action to satisfy the safety constraint.

Through the above analysis and experiments, we conclude that our Critic module is reasonable and effective.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
figure		figure
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1.Additional experiments.

1.1 experiments on SafetyGym

1.2 experiments comparation with CoptiDICE

1.3 Hyperparameter Analysis

2.Effectiveness of the Posterior Safety Verification

About

Releases

Packages

ZQ2413262560/TMP

Folders and files

Latest commit

History

Repository files navigation

1.Additional experiments.

1.1 experiments on SafetyGym

1.2 experiments comparation with CoptiDICE

1.3 Hyperparameter Analysis

2.Effectiveness of the Posterior Safety Verification

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages