Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem in reading the LMDB dataset #979

Open
vishank-u opened this issue Jan 22, 2025 · 6 comments
Open

Problem in reading the LMDB dataset #979

vishank-u opened this issue Jan 22, 2025 · 6 comments
Labels
question Further information is requested

Comments

@vishank-u
Copy link

vishank-u commented Jan 22, 2025

What would you like to report?

Hi,

I would like to read the LMDB dataset (IS2RE) of ODAC23 but cannot find the legacy tutorial to see the schema or query this data for one of the materials that are presented in the original paper. Any suggestion on directing to the correct dataloader or some scripts for this purpose?

Thanks

@lbluque
Copy link
Collaborator

lbluque commented Jan 22, 2025

Hi @vishank-u 👋

You should be able to read the ODAC23 datasets using LmdbDataset in fairchem.core.datasets.lmdb_dataset:

from fairchem.core.datasets import LmdbDataset

dataset = LmdbDataset(config=dict(src="path_to_ODAC23", r_energy=True, r_forces=True))

# print a datapoint, you can also use a torch DataLoader to loop through batches
print(dataset[0])

Let me know if you run into more issues.

@lbluque lbluque added the question Further information is requested label Jan 22, 2025
@vishank-u
Copy link
Author

Hi @lbluque ,

Thanks for sharing the information. It worked, I can view the dataset entries. I have 2 more related questions:

  1. If I would like to analyze some key property e.g. energy, currently I am using torch.tensor method:
from fairchem.core.datasets.lmdb_dataset import LmdbDataset

file_path = "../is2res_train_val_test_lmdbs/data/is2re/all"

dataset = LmdbDataset({"src": file_path + "/train"})
energies = torch.tensor([data.y_relaxed for data in dataset])

but it takes a lot of time to process, understandably so as the dataset is quite big. Is it the correct way to analyze the data or there is a better way. Also, is it better to store this tensor energies in the same location to avoid repeating this step everytime?

  1. Is there a mapping key to query certain MOF's either by mp-id or by MOF name/id like POLDUQ?

Thanks in advance for answering my questions.

Regards,
Vishank

@lbluque
Copy link
Collaborator

lbluque commented Jan 23, 2025

Looping over the whole dataset in a list comprehension will take a long time. There isnt a straightforward way to query our datasets. An alternative is to loop through batches of data using a torch DataLoader.

@anuroopsriram can you comment on this question:

  1. Is there a mapping key to query certain MOF's either by mp-id or by MOF name/id like POLDUQ?

@anuroopsriram
Copy link
Collaborator

Is there a mapping key to query certain MOF's either by mp-id or by MOF name/id like POLDUQ?

You can loop through the dataset and look at the name field. Unfortunately we don't have a simple way to query other than looping through the whole dataset.

@vishank-u
Copy link
Author

Hi @anuroopsriram ,

Thanks for the suggestion, I tried to look at the name filed but it does not find this field. Here is the print of one datapoint:

Data(edge_index=[2, 2964], pos=[86, 3], cell=[1, 3, 3], atomic_numbers=[86], natoms=86, cell_offsets=[2964, 3], force=[86, 3], distances=[2964], fixed=[86], sid=2472718, tags=[86], y_init=6.282500615000004, y_relaxed=-0.025550085000020317, pos_relaxed=[86, 3], id='0_0')

Let me know if I am missing something.

@vishank-u
Copy link
Author

Hi @anuroopsriram,

Even for looping through the dataset, is there a mapping of "sid" or something else to the corresponding MOF structure or one needs to create an ase object from this to filter the correct candidate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants