Skip to content

Commit 5181870

Browse files
authored
docs: update appendix (#350)
1 parent 080e6b8 commit 5181870

File tree

20 files changed

+4842
-22
lines changed

20 files changed

+4842
-22
lines changed

.github/workflows/lint.yml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,10 +45,6 @@ jobs:
4545
run: |
4646
make pre-commit
4747
48-
- name: ruff
49-
run: |
50-
make ruff
51-
5248
- name: flake8
5349
run: |
5450
make flake8
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Case Study
2+
3+
One important motivation for SafeRL is to enable agents to explore and
4+
learn safely. Therefore, evaluating algorithm performance concerning
5+
*procedural constraint violations* is also important. We have selected
6+
representative experimental results and report as shown in <a href="#analys">Figure 1</a> and <a href="#analys_ppo">Figure 2</a>:
7+
8+
#### Radical vs. Conservative
9+
10+
*Radical* policies often explore higher rewards but violate more safety
11+
constraints, whereas *Conservative* policies are the opposite.
12+
<a href="#analys">Figure 1</a> illustrates this: during training, CPO and
13+
PPOLag consistently pursue the highest rewards among all algorithms, as
14+
depicted in the first row. However, as shown in the second row, they
15+
experience significant fluctuations in constraint violations, especially
16+
for PPOLag. So, they are relatively radical, *i.e.,* higher rewards but
17+
higher costs. In comparison, while P3O achieves slightly lower rewards
18+
than PPOLag, it maintains fewer oscillations in constraint violations,
19+
making it safer in adhering to safety constraints, evident from the
20+
smaller proportion of its distribution crossing the black dashed line. A
21+
similar pattern is also observed when comparing PCPO with CPO.
22+
Therefore, P3O and PCPO are relatively conservative, *i.e.,* lower costs
23+
but lower rewards.
24+
25+
<img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="https://github.com/Gaiejj/omnisafe_benchmarks_cruve/blob/main/on-policy/benchmarks/analys.png?raw=true" id="analys">
26+
<br>
27+
28+
**Figure 1:** PPOLag, P3O, CPO, and PCPO training on four tasks in for 1e7 steps, showing the distribution of all episodic rewards and costs. All data covers over 5 random seeds and filters out data points over 3 standard deviations. The black dashed line in the graph represents the preset `cost_limit`.
29+
30+
31+
#### Oscillation vs. Stability
32+
33+
The oscillations in the degree of constraint violations during the
34+
training process can indicate the performance of SafeRL algorithms.
35+
These oscillations are quantified by *Extremes*, *i.e.,* the maximum
36+
constraint violation, and *Distributions*, *i.e.,* the frequency of
37+
violations remaining below a predefined `cost_limit`. As shown in
38+
<a href="#analys_ppo">Figure 2</a>, PPOLag, a popular baseline in SafeRL,
39+
utilizes the Lagrangian multiplier for constraint handling. Despite its
40+
simplicity and ease of implementation, PPOLag often suffers from
41+
significant oscillations due to challenges in setting appropriate
42+
initial values and learning rates. It consistently seeks higher rewards
43+
but always leads to larger extremes and unsafe distributions.
44+
Conversely, CPPOPID, which employs a PID controller for updating the
45+
Lagrangian multiplier, markedly reduces these extremes. CUP implements a
46+
two-stage projection method that constrains violations' distribution
47+
below the `cost_limit`. Lastly, PPOSaute integrates state observations
48+
with constraints, resulting in smaller extremes and safer distributions
49+
of violations.
50+
51+
<img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="https://github.com/Gaiejj/omnisafe_benchmarks_cruve/blob/main/on-policy/benchmarks/analys_ppo.png?raw=true" id="analys_ppo">
52+
<br>
53+
54+
**Figure 2:** PPOLag, CPPOPID, CUP, and PPOSaute trained on four tasks in for all 1e7 steps, showing the distribution of all episodic rewards and costs. All data covers over 5 random seeds and filters out data points over 3 standard deviations. The black dashed line in the graph represents the preset `cost_limit`.
Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
# Model-based Algorithms
2+
3+
The OmniSafe Navigation Benchmark for model-based algorithms evaluates the effectiveness of OmniSafe's model-based algorithms across two different environments from the [Safety-Gymnasium](https://github.com/PKU-Alignment/safety-gymnasium) task suite. For each supported algorithm and environment, we offer the following:
4+
5+
- Default hyperparameters used for the benchmark and scripts that enable result replication.
6+
- Graphs and raw data that can be utilized for research purposes.
7+
- Detailed logs obtained during training.
8+
9+
Supported algorithms are listed below:
10+
11+
- **[NeurIPS 2001]** [Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models (PETS))](https://arxiv.org/abs/1805.12114)
12+
- **[CoRL 2021]** [Learning Off-Policy with Online Planning (LOOP and SafeLOOP)](https://arxiv.org/abs/2008.10066)
13+
- **[AAAI 2022]** [Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning (CAP)](https://arxiv.org/abs/2112.07701)
14+
- **[ICML 2022 Workshop]** [Constrained Model-based Reinforcement Learning with Robust Cross-Entropy Method (RCE)](https://arxiv.org/abs/2010.07968)
15+
- **[NeurIPS 2018]** [Constrained Cross-Entropy Method for Safe Reinforcement Learning (CCE)](https://proceedings.neurips.cc/paper/2018/hash/34ffeb359a192eb8174b6854643cc046-Abstract.html)
16+
17+
## Safety-Gymnasium
18+
19+
We highly recommend using **Safety-Gymnasium** to run the following experiments. To install, in a linux machine, type:
20+
21+
```bash
22+
pip install safety_gymnasium
23+
```
24+
25+
## Run the Benchmark
26+
27+
You can set the main function of ``examples/benchmarks/experiment_grid.py`` as:
28+
29+
```python
30+
if __name__ == '__main__':
31+
eg = ExperimentGrid(exp_name='Model-Based-Benchmarks')
32+
33+
# set up the algorithms.
34+
model_based_base_policy = ['LOOP', 'PETS']
35+
model_based_safe_policy = ['SafeLOOP', 'CCEPETS', 'CAPPETS', 'RCEPETS']
36+
eg.add('algo', model_based_base_policy + model_based_safe_policy)
37+
38+
# you can use wandb to monitor the experiment.
39+
eg.add('logger_cfgs:use_wandb', [False])
40+
# you can use tensorboard to monitor the experiment.
41+
eg.add('logger_cfgs:use_tensorboard', [True])
42+
eg.add('train_cfgs:total_steps', [1000000])
43+
44+
# set up the environment.
45+
eg.add('env_id', [
46+
'SafetyPointGoal1-v0-modelbased',
47+
'SafetyCarGoal1-v0-modelbased',
48+
])
49+
eg.add('seed', [0, 5, 10, 15, 20])
50+
51+
# total experiment num must can be divided by num_pool
52+
# meanwhile, users should decide this value according to their machine
53+
eg.run(train, num_pool=5)
54+
```
55+
56+
After that, you can run the following command to run the benchmark:
57+
58+
```bash
59+
cd examples/benchmarks
60+
python run_experiment_grid.py
61+
```
62+
63+
You can set the path of ``examples/benchmarks/experiment_grid.py`` :
64+
example:
65+
66+
```python
67+
path ='omnisafe/examples/benchmarks/exp-x/Model-Based-Benchmarks'
68+
```
69+
70+
You can also plot the results by running the following command:
71+
72+
```bash
73+
cd examples
74+
python analyze_experiment_results.py
75+
```
76+
77+
**For a detailed usage of OmniSafe statistics tool, please refer to [this tutorial](https://omnisafe.readthedocs.io/en/latest/common/stastics_tool.html).**
78+
79+
## OmniSafe Benchmark
80+
81+
To demonstrate the high reliability of the algorithms implemented, OmniSafe offers performance insights within the Safety-Gymnasium environment. It should be noted that all data is procured under the constraint of `cost_limit=1.00`. The results are presented in <a href="#performance_model_based">Table 1</a> and <a href="#curve_model_based">Figure 1</a>.
82+
83+
### Performance Table
84+
85+
<!DOCTYPE html>
86+
<html lang="en">
87+
<head>
88+
<meta charset="UTF-8">
89+
<style>
90+
.scrollable-container {
91+
overflow-x: auto;
92+
white-space: nowrap;
93+
width: 100%;
94+
}
95+
table {
96+
border-collapse: collapse;
97+
width: auto;
98+
font-size: 12px;
99+
}
100+
th, td {
101+
padding: 8px;
102+
text-align: center;
103+
border: 1px solid #ddd;
104+
}
105+
th {
106+
font-weight: bold;
107+
}
108+
caption {
109+
font-size: 12px;
110+
font-family: 'Times New Roman', Times, serif;
111+
}
112+
</style>
113+
</head>
114+
<body>
115+
116+
<div class="scrollable-container">
117+
<table id="performance_model_based">
118+
<thead>
119+
<tr class="header">
120+
<th style="text-align: left;"></th>
121+
<th colspan="2" style="text-align: center;"><strong>PETS</strong></th>
122+
<th colspan="2" style="text-align: center;"><strong>LOOP</strong></th>
123+
<th colspan="2"
124+
style="text-align: center;"><strong>SafeLOOP</strong></th>
125+
</tr>
126+
</thead>
127+
<tbody>
128+
<tr class="odd">
129+
<td style="text-align: left;"><strong>Environment</strong></td>
130+
<td style="text-align: center;"><strong>Reward</strong></td>
131+
<td style="text-align: center;"><strong>Cost</strong></td>
132+
<td style="text-align: center;"><strong>Reward</strong></td>
133+
<td style="text-align: center;"><strong>Cost</strong></td>
134+
<td style="text-align: center;"><strong>Reward</strong></td>
135+
<td style="text-align: center;"><strong>Cost</strong></td>
136+
</tr>
137+
<tr class="even">
138+
<td style="text-align: left;"><span
139+
class="smallcaps">SafetyCarGoal1-v0</span></td>
140+
<td style="text-align: center;">33.07 <span class="math inline">±</span>1.33</td>
141+
<td style="text-align: center;">61.20 <span class="math inline">±</span>7.23</td>
142+
<td style="text-align: center;">25.41 <span class="math inline">±</span>1.23</td>
143+
<td style="text-align: center;">62.64 <span class="math inline">±</span>8.34</td>
144+
<td style="text-align: center;">22.09 <span class="math inline">±</span>0.30</td>
145+
<td style="text-align: center;">0.16 <span class="math inline">±</span>0.15</td>
146+
</tr>
147+
<tr class="odd">
148+
<td style="text-align: left;"><span
149+
class="smallcaps">SafetyPointGoal1-v0</span></td>
150+
<td style="text-align: center;">27.66 <span class="math inline">±</span>0.07</td>
151+
<td style="text-align: center;">49.16 <span class="math inline">±</span>2.69</td>
152+
<td style="text-align: center;">25.08 <span class="math inline">±</span>1.47</td>
153+
<td style="text-align: center;">55.23 <span class="math inline">±</span>2.64</td>
154+
<td style="text-align: center;">22.94 <span class="math inline">±</span>0.72</td>
155+
<td style="text-align: center;">0.04 <span class="math inline">±</span>0.07</td>
156+
</tr>
157+
<thead>
158+
<tr class="header">
159+
<th style="text-align: left;"></th>
160+
<th colspan="2" style="text-align: center;"><strong>CCEPETS</strong></th>
161+
<th colspan="2" style="text-align: center;"><strong>RCEPETS</strong></th>
162+
<th colspan="2" style="text-align: center;"><strong>CAPPETS</strong></th>
163+
</tr>
164+
</thead>
165+
<tr class="odd">
166+
<td style="text-align: left;"><strong>Environment</strong></td>
167+
<td style="text-align: center;"><strong>Reward</strong></td>
168+
<td style="text-align: center;"><strong>Cost</strong></td>
169+
<td style="text-align: center;"><strong>Reward</strong></td>
170+
<td style="text-align: center;"><strong>Cost</strong></td>
171+
<td style="text-align: center;"><strong>Reward</strong></td>
172+
<td style="text-align: center;"><strong>Cost</strong></td>
173+
</tr>
174+
<tr class="even">
175+
<td style="text-align: left;"><span
176+
class="smallcaps">SafetyCarGoal1-v0</span></td>
177+
<td style="text-align: center;">27.60 <span class="math inline">±</span>1.21</td>
178+
<td style="text-align: center;">1.03 <span class="math inline">±</span>0.29</td>
179+
<td style="text-align: center;">29.08 <span class="math inline">±</span>1.63</td>
180+
<td style="text-align: center;">1.02 <span class="math inline">±</span>0.88</td>
181+
<td style="text-align: center;">23.33 <span class="math inline">±</span>6.34</td>
182+
<td style="text-align: center;">0.48 <span class="math inline">±</span>0.17</td>
183+
</tr>
184+
<tr class="odd">
185+
<td style="text-align: left;"><span
186+
class="smallcaps">SafetyPointGoal1-v0</span></td>
187+
<td style="text-align: center;">24.98 <span class="math inline">±</span>0.05</td>
188+
<td style="text-align: center;">1.87 <span class="math inline">±</span>1.27</td>
189+
<td style="text-align: center;">25.39 <span class="math inline">±</span>0.28</td>
190+
<td style="text-align: center;">2.46 <span class="math inline">±</span>0.58</td>
191+
<td style="text-align: center;">9.45 <span class="math inline">±</span>8.62</td>
192+
<td style="text-align: center;">0.64 <span class="math inline">±</span>0.77</td>
193+
</tr>
194+
</tbody>
195+
</table>
196+
</div>
197+
198+
<caption><p><b>Table 1:</b> The performance of OmniSafe model-based algorithms, encompassing both reward and cost, was assessed within the Safety-Gymnasium environments. It is crucial to highlight that all model-based algorithms underwent evaluation following 1e6 training steps.</p></caption>
199+
200+
### Performance Curves
201+
202+
<table id="curve_model_based">
203+
<tr>
204+
<td style="text-align:center">
205+
<img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="https://github.com/Gaiejj/omnisafe_benchmarks_cruve/blob/main/model-based/benchmarks/SafetyCarGoal1-v0-modelbased.png?raw=True">
206+
<br>
207+
<div>
208+
SafetyCarGoal1-v0
209+
</div>
210+
</td>
211+
</tr>
212+
<tr>
213+
<td style="text-align:center">
214+
<img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="https://github.com/Gaiejj/omnisafe_benchmarks_cruve/blob/main/model-based/benchmarks/SafetyPointGoal1-v0-modelbased.png?raw=True">
215+
<br>
216+
<div>
217+
SafetyPointGoal1-v0
218+
</div>
219+
</td>
220+
</tr>
221+
</table>
222+
223+
<caption><p><b>Figure 1:</b> Training curves in Safety-Gymnasium environments, covering classical reinforcement learning algorithms and safe learning algorithms mentioned in <a href="#performance_model_based">Table 1</a>.</p></caption>

0 commit comments

Comments
 (0)