Skip to content

Commit

Permalink
deploy: d0fe975
Browse files Browse the repository at this point in the history
  • Loading branch information
lbluque committed Apr 23, 2024
1 parent b90126e commit bda4922
Show file tree
Hide file tree
Showing 31 changed files with 2,459 additions and 1,515 deletions.
16 changes: 8 additions & 8 deletions _downloads/5fdddbed2260616231dbf7b0d94bb665/train.txt
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
2024-04-23 22:36:00 (INFO): Project root: /home/runner/work/ocp/ocp
2024-04-23 22:52:53 (INFO): Project root: /home/runner/work/ocp/ocp
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
warnings.warn(
2024-04-23 22:36:01 (INFO): amp: true
2024-04-23 22:52:55 (INFO): amp: true
cmd:
checkpoint_dir: fine-tuning/checkpoints/2024-04-23-22-36-48-ft-oxides
commit: 51c8869
checkpoint_dir: fine-tuning/checkpoints/2024-04-23-22-53-52-ft-oxides
commit: d0fe975
identifier: ft-oxides
logs_dir: fine-tuning/logs/tensorboard/2024-04-23-22-36-48-ft-oxides
logs_dir: fine-tuning/logs/tensorboard/2024-04-23-22-53-52-ft-oxides
print_every: 10
results_dir: fine-tuning/results/2024-04-23-22-36-48-ft-oxides
results_dir: fine-tuning/results/2024-04-23-22-53-52-ft-oxides
seed: 0
timestamp_id: 2024-04-23-22-36-48-ft-oxides
timestamp_id: 2024-04-23-22-53-52-ft-oxides
dataset:
a2g_args:
r_energy: true
Expand Down Expand Up @@ -138,7 +138,7 @@ val_dataset:
r_forces: true
src: val.db

2024-04-23 22:36:01 (INFO): Loading dataset: lmdb
2024-04-23 22:52:55 (INFO): Loading dataset: lmdb
Traceback (most recent call last):
File "/home/runner/work/ocp/ocp/main.py", line 89, in <module>
Runner()(config)
Expand Down
16 changes: 8 additions & 8 deletions _downloads/819e10305ddd6839cd7da05935b17060/mass-inference.txt
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
2024-04-23 22:36:52 (INFO): Project root: /home/runner/work/ocp/ocp
2024-04-23 22:54:40 (INFO): Project root: /home/runner/work/ocp/ocp
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
warnings.warn(
2024-04-23 22:36:54 (INFO): amp: true
2024-04-23 22:54:42 (INFO): amp: true
cmd:
checkpoint_dir: ./checkpoints/2024-04-23-22-36-48
commit: 51c8869
checkpoint_dir: ./checkpoints/2024-04-23-22-53-52
commit: d0fe975
identifier: ''
logs_dir: ./logs/tensorboard/2024-04-23-22-36-48
logs_dir: ./logs/tensorboard/2024-04-23-22-53-52
print_every: 10
results_dir: ./results/2024-04-23-22-36-48
results_dir: ./results/2024-04-23-22-53-52
seed: 0
timestamp_id: 2024-04-23-22-36-48
timestamp_id: 2024-04-23-22-53-52
dataset:
a2g_args:
r_energy: false
Expand Down Expand Up @@ -117,7 +117,7 @@ test_dataset:
trainer: ocp
val_dataset: null

2024-04-23 22:36:54 (INFO): Loading dataset: lmdb
2024-04-23 22:54:42 (INFO): Loading dataset: lmdb
Traceback (most recent call last):
File "/home/runner/work/ocp/ocp/main.py", line 89, in <module>
Runner()(config)
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 10 additions & 10 deletions _sources/core/inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ kernelspec:
Fast batched inference
------------------

The ASE calculator is not necessarily the most efficient way to run a lot of computations. It is better to do a "mass inference" using a command line utility. We illustrate how to do that here.
The ASE calculator is not necessarily the most efficient way to run a lot of computations. It is better to do a "mass inference" using a command line utility. We illustrate how to do that here.

In this paper we computed about 10K different gold structures:

Expand All @@ -23,12 +23,12 @@ Boes, J. R., Groenenboom, M. C., Keith, J. A., & Kitchin, J. R. (2016). Neural n
You can retrieve the dataset below. In this notebook we learn how to do "mass inference" without an ASE calculator. You do this by creating a config.yml file, and running the `main.py` command line utility.

```{code-cell} ipython3
! wget https://figshare.com/ndownloader/files/11948267 -O data.db
! wget https://figshare.com/ndownloader/files/11948267 -O data.db
```



Inference on this file will be fast if we have a gpu, but if we don't this could take a while. To keep things fast for the automated builds, we'll just select the first 10 structures so it's still approachable with just a CPU.
Inference on this file will be fast if we have a gpu, but if we don't this could take a while. To keep things fast for the automated builds, we'll just select the first 10 structures so it's still approachable with just a CPU.
Comment or skip this block to use the whole dataset!

```{code-cell} ipython3
Expand All @@ -46,15 +46,15 @@ with ase.db.connect('full_data.db') as full_db:
if 'tag' in atoms.info['key_value_pairs']:
atoms.info['key_value_pairs']['tag'] = int(atoms.info['key_value_pairs']['tag'])
subset_db.write(atoms, **atoms.info['key_value_pairs'])
```

```{code-cell} ipython3
! ase db data.db
```

You have to choose a checkpoint to start with. The newer checkpoints may require too much memory for this environment.
You have to choose a checkpoint to start with. The newer checkpoints may require too much memory for this environment.

```{code-cell} ipython3
from ocpmodels.models.model_registry import available_pretrained_models
Expand All @@ -69,7 +69,7 @@ checkpoint_path
```

We have to update our configuration yml file with the dataset. It is necessary to specify the train and test set for some reason.
We have to update our configuration yml file with the dataset. It is necessary to specify the train and test set for some reason.

```{code-cell} ipython3
from ocpmodels.common.tutorial_utils import generate_yml_config
Expand Down Expand Up @@ -110,7 +110,7 @@ print(f'Elapsed time = {time.time() - t0:1.1f} seconds')

```{code-cell} ipython3
with open('mass-inference.txt', 'wb') as f:
f.write(inference.stdout.encode('utf-8'))
f.write(inference.stdout.encode('utf-8'))
```

```{code-cell} ipython3
Expand Down Expand Up @@ -148,7 +148,7 @@ energies = np.array([row.energy for row in db.select('natoms>5,xc=PBE')])
natoms = np.array([row.natoms for row in db.select('natoms>5,xc=PBE')])
```

Now, we can see the predictions. The are only ok here; that is not surprising, the data set has lots of Au configurations that have never been seen by this model. Fine-tuning would certainly help improve this.
Now, we can see the predictions. They are only ok here; that is not surprising, the data set has lots of Au configurations that have never been seen by this model. Fine-tuning would certainly help improve this.

```{code-cell} ipython3
import matplotlib.pyplot as plt
Expand Down Expand Up @@ -193,11 +193,11 @@ plt.ylabel('OCP (eV/atom)');

# Comparing ASE calculator and main.py

The results should be the same.
The results should be the same.

It is worth noting the default precision of predictions is float16 with main.py, but with the ASE calculator the default precision is float32. Supposedly you can specify `--task.prediction_dtype=float32` at the command line to or specify it in the config.yml like we do above, but as of the tutorial this does not resolve the issue.

As noted above (see also [Issue 542](https://github.com/Open-Catalyst-Project/ocp/issues/542)), the ASE calculator and main.py use different precisions by default, which can lead to small differences.
As noted above (see also [Issue 542](https://github.com/Open-Catalyst-Project/ocp/issues/542)), the ASE calculator and main.py use different precisions by default, which can lead to small differences.

```{code-cell} ipython3
np.mean(np.abs(results['energy'][sind] - OCP * natoms)) # MAE
Expand Down
10 changes: 5 additions & 5 deletions _sources/core/lmdb_dataset_creation.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ about these steps as they've been automated as part of this

```{code-cell} ipython3
from ocpmodels.preprocessing import AtomsToGraphs
from ocpmodels.datasets import SinglePointLmdbDataset, TrajectoryLmdbDataset
from ocpmodels.datasets import LmdbDataset
import ase.io
from ase.build import bulk
from ase.build import fcc100, add_adsorbate, molecule
Expand Down Expand Up @@ -149,7 +149,7 @@ db.close()
```

```{code-cell} ipython3
dataset = SinglePointLmdbDataset({"src": "sample_CuCO.lmdb"})
dataset = LmdbDataset({"src": "sample_CuCO.lmdb"})
len(dataset)
```

Expand Down Expand Up @@ -217,7 +217,7 @@ db.close()
```

```{code-cell} ipython3
dataset = TrajectoryLmdbDataset({"src": "s2ef/"})
dataset = LmdbDataset({"src": "s2ef/"})
len(dataset)
```

Expand All @@ -227,7 +227,7 @@ dataset[0]

### Advanced usage

TrajectoryLmdbDataset supports multiple LMDB files because the need to highly parallelize the dataset construction process. With OCP's largest split containing 135M+ frames, the need to parallelize the LMDB generation process for these was necessary. If you find yourself needing to deal with very large datasets we recommend parallelizing this process.
LmdbDataset supports multiple LMDB files because the need to highly parallelize the dataset construction process. With OCP's largest split containing 135M+ frames, the need to parallelize the LMDB generation process for these was necessary. If you find yourself needing to deal with very large datasets we recommend parallelizing this process.

+++

Expand All @@ -236,7 +236,7 @@ TrajectoryLmdbDataset supports multiple LMDB files because the need to highly pa
Below we demonstrate how to interact with an LMDB to extract particular information.

```{code-cell} ipython3
dataset = TrajectoryLmdbDataset({"src": "s2ef/"})
dataset = LmdbDataset({"src": "s2ef/"})
```

```{code-cell} ipython3
Expand Down
36 changes: 18 additions & 18 deletions _sources/core/model_training.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@ python main.py --mode train --config-yml configs/TASK/SIZE/MODEL/MODEL.yml
If you have multiple
GPUs, you can use distributed data parallel training by running:
```
python -u -m torch.distributed.launch --nproc_per_node=8 main.py --distributed --num-gpus 8 [...]
torchrun --standalone --nproc_per_node=8 main.py --distributed --num-gpus 8 [...]
```
`torch.distributed.launch` launches multiple processes for distributed training. For more details, refer to
https://pytorch.org/docs/stable/distributed.html#launch-utility
`torchrun` launches multiple processes for distributed training. For more details, refer to the
[official documentation](https://pytorch.org/docs/stable/elastic/run.html)

If training with multiple GPUs, GPU load balancing may be used to evenly distribute a batch of variable system sizes across GPUs. Load balancing may either balance by number of atoms or number of neighbors. A `metadata.npz` file must be available in the dataset directory to take advantage of this feature. The following command will generate a `metadata.npz` file and place it in the corresponding directory.
```
Expand All @@ -39,7 +39,7 @@ Load balancing is activated by default (in atoms mode). To change modes you can
optim:
load_balancing: neighbors
```
For more details, refer to https://github.com/Open-Catalyst-Project/ocp/pull/267.
For more details, refer to [PR 267](https://github.com/Open-Catalyst-Project/ocp/pull/267).

If you have access to a slurm cluster, we use the [submitit](https://github.com/facebookincubator/submitit) package to simplify multi-node distributed training:
```
Expand All @@ -53,11 +53,10 @@ In the rest of this tutorial, we explain how to train models for each task.
## Initial Structure to Relaxed Energy prediction (IS2RE)

In the IS2RE tasks, the model takes the initial structure as an input and predicts the structure’s adsorption energy
in the relaxed state. To train a model for the IS2RE task, you can use the `EnergyTrainer`
Trainer and `SinglePointLmdb` dataset by specifying the following in your configuration file:
in the relaxed state. To train a model for the IS2RE task, you can use the following in your configuration file:

```yaml
trainer: energy # Use the EnergyTrainer
trainer: ocp

dataset:
# Train data
Expand Down Expand Up @@ -130,11 +129,11 @@ Alternatively, the IS2RE task may be approached by 2 methods as described in our
## Structure to Energy and Forces (S2EF)

In the S2EF task, the model takes the positions of the atoms as input and predicts the adsorption energy and per-atom
forces as calculated by DFT. To train a model for the S2EF task, you can use the `ForcesTrainer` Trainer
forces as calculated by DFT. To train a model for the S2EF task, you can use the `OCPTrainer`
and `TrajectoryLmdb` dataset by specifying the following in your configuration file:

```yaml
trainer: forces # Use the ForcesTrainer
trainer: ocp
dataset:
# Training data
Expand All @@ -159,7 +158,7 @@ You can find examples configuration files in [`configs/s2ef`](https://github.com
To train a SchNet model for the S2EF task on the 2M split using 2 GPUs, run:

```bash
python -u -m torch.distributed.launch --nproc_per_node=2 main.py \
torchrun --standalone --nproc_per_node=2 main.py \
--mode train --config-yml configs/s2ef/2M/schnet/schnet.yml --num-gpus 2 --distributed
```
Similar to the IS2RE task, tensorboard logs are stored in `logs/tensorboard/[TIMESTAMP]` and the
Expand All @@ -175,7 +174,10 @@ The predictions are stored in `[RESULTS_DIR]/ocp_predictions.npz` and later used

## Training OC20 models with total energies (IS2RE/S2EF)

To train and validate an OC20 IS2RE/S2EF model on total energies instead of adsorption energies there are a number of required changes to the config. They include setting: `dataset: oc22_lmdb`, `prediction_dtype: float32`, `train_on_oc20_total_energies: True`, and `oc20_ref: path/to/oc20_ref.pkl` (see example below). Also, please note that our evaluation server does not currently support OC20 total energy models.
To train and validate an OC20 IS2RE/S2EF model on total energies instead of adsorption energies there are a number of
required changes to the config. They include setting: `dataset: oc22_lmdb`, `prediction_dtype: float32`,
`train_on_oc20_total_energies: True`, and `oc20_ref: path/to/oc20_ref.pkl` (see example below).
Also, please note that our evaluation server does not currently support OC20 total energy models.

```yaml
task:
Expand Down Expand Up @@ -278,11 +280,10 @@ EvalAI expects results to be structured in a specific format for a submission to

## Initial Structure to Total Relaxed Energy (IS2RE-Total)

For the IS2RE-Total task, the model takes the initial structure as input and predicts the total DFT energy of the relaxed structure. This task is more general and more challenging than the original OC20 IS2RE task that predicts adsorption energy. To train an OC22 IS2RE-Total model use the `EnergyTrainer` with the `OC22LmdbDataset` by including these lines in your configuration file:
For the IS2RE-Total task, the model takes the initial structure as input and predicts the total DFT energy of the relaxed structure. This task is more general and more challenging than the original OC20 IS2RE task that predicts adsorption energy.
To train an OC22 IS2RE-Total model use the `OC22LmdbDataset` by including these lines in your configuration file:

```yaml
trainer: energy # Use the EnergyTrainer
dataset:
format: oc22_lmdb # Use the OC22LmdbDataset
...
Expand All @@ -291,11 +292,11 @@ You can find examples configuration files in [`configs/oc22/is2re`](https://gith

## Structure to Total Energy and Forces (S2EF-Total)

The S2EF-Total task takes a structure and predicts the total DFT energy and per-atom forces. This differs from the original OC20 S2EF task because it predicts total energy instead of adsorption energy. To train an OC22 S2EF-Total model use the ForcesTrainer with the OC22LmdbDataset by including these lines in your configuration file:
The S2EF-Total task takes a structure and predicts the total DFT energy and per-atom forces. This differs from the
original OC20 S2EF task because it predicts total energy instead of adsorption energy.
To train an OC22 S2EF-Total model the OC22LmdbDataset by including these lines in your configuration file:

```yaml
trainer: forces # Use the ForcesTrainer
dataset:
format: oc22_lmdb # Use the OC22LmdbDataset
...
Expand Down Expand Up @@ -338,4 +339,3 @@ EvalAI expects results to be structured in a specific format for a submission to
```
Where `file.npz` corresponds to the respective `[s2ef/is2re]_predictions.npz` files generated for the corresponding task. The final submission file will be written to `submission_file.npz` (rename accordingly). The `dataset` argument specifies which dataset is being considered — this only needs to be set for OC22 predictions because OC20 is the default.
3. Upload `submission_file.npz` to EvalAI.

8 changes: 4 additions & 4 deletions core/fine-tuning/fine-tuning-oxides.html
Original file line number Diff line number Diff line change
Expand Up @@ -773,7 +773,7 @@ <h1>Fine tuning a model<a class="headerlink" href="#fine-tuning-a-model" title="
warnings.warn(
</pre></div>
</div>
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>Elapsed time 67.3 seconds.
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>Elapsed time 68.1 seconds.
</pre></div>
</div>
<img alt="../../_images/92bd7f94dd548c8cfc2744eb5890cd23fada1ff98e8dc907657e2eb109af0402.png" src="../../_images/92bd7f94dd548c8cfc2744eb5890cd23fada1ff98e8dc907657e2eb109af0402.png" />
Expand Down Expand Up @@ -1138,7 +1138,7 @@ <h2>Running the training job<a class="headerlink" href="#running-the-training-jo
<span class="expanded">Hide code cell output</span>
</summary>
<div class="cell_output docutils container">
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>Elapsed time = 3.9 seconds
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>Elapsed time = 4.1 seconds
</pre></div>
</div>
</div>
Expand All @@ -1154,7 +1154,7 @@ <h2>Running the training job<a class="headerlink" href="#running-the-training-jo
</div>
</div>
<div class="cell_output docutils container">
<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>&#39;fine-tuning/checkpoints/2024-04-23-22-36-48-ft-oxides&#39;
<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>&#39;fine-tuning/checkpoints/2024-04-23-22-53-52-ft-oxides&#39;
</pre></div>
</div>
</div>
Expand Down Expand Up @@ -1204,7 +1204,7 @@ <h2>Running the training job<a class="headerlink" href="#running-the-training-jo
<span class="g g-Whitespace"> </span><span class="mi">425</span> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">mode</span><span class="p">):</span>
<span class="ne">--&gt; </span><span class="mi">426</span> <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">mode</span><span class="p">))</span>

<span class="ne">FileNotFoundError</span>: [Errno 2] No such file or directory: &#39;fine-tuning/checkpoints/2024-04-23-22-36-48-ft-oxides/checkpoint.pt&#39;
<span class="ne">FileNotFoundError</span>: [Errno 2] No such file or directory: &#39;fine-tuning/checkpoints/2024-04-23-22-53-52-ft-oxides/checkpoint.pt&#39;
</pre></div>
</div>
</div>
Expand Down
Loading

0 comments on commit bda4922

Please sign in to comment.