Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading pretraining dataset from huggingface #43

Open
rubenweitzman opened this issue Jul 3, 2024 · 3 comments
Open

Downloading pretraining dataset from huggingface #43

rubenweitzman opened this issue Jul 3, 2024 · 3 comments

Comments

@rubenweitzman
Copy link

rubenweitzman commented Jul 3, 2024

Hi,
Thanks for providing the pre-training database with foldseek tokens! having difficulty downloading the dataset and using with hugginface functions. Trying

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("westlake-repl/AF2_UniRef50")

# Load the train split of the dataset
train_dataset = dataset["train"]

but getting error

if not module_name:
    raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
    return module_name, default_builder_kwargs

DataFilesNotFoundError: No (supported) data files found in westlake-repl/AF2_UniRef50

What then is the proper way to load in the dataset from huggingface?

@LTEnjoy
Copy link
Contributor

LTEnjoy commented Jul 4, 2024

Hi,

AF2_UniRef50 is organized in LMDB format. If you want to load it, you have to first download it and then open the file using lmdb package.

Here is the example of how you get samples:

import lmdb

lmdb_dir = "/your/path/to/AF2_UniRef50/train"
with lmdb.open(lmdb_dir, readonly=True).begin() as txn:
    length = int(txn.get('length'.encode()).decode())
    for i in range(length):
        data_str = txn.get(i.encode()).decode()
        data = json.loads(data_str)
        print(data)
        break

Hope this could resolve your problem:)

@heya5
Copy link

heya5 commented Nov 13, 2024

@LTEnjoy Hi, can I download the orginal structure data of the sequence?

@LTEnjoy
Copy link
Contributor

LTEnjoy commented Nov 13, 2024

@LTEnjoy Hi, can I download the orginal structure data of the sequence?

Hi,

I'm sorry but the original structure data is too large to upload so We are unable to share it. You could download all AF2 structures on the official website https://alphafold.ebi.ac.uk/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants