You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
Thanks for providing the pre-training database with foldseek tokens! having difficulty downloading the dataset and using with hugginface functions. Trying
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("westlake-repl/AF2_UniRef50")
# Load the train split of the dataset
train_dataset = dataset["train"]
but getting error
if not module_name:
raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
return module_name, default_builder_kwargs
DataFilesNotFoundError: No (supported) data files found in westlake-repl/AF2_UniRef50
What then is the proper way to load in the dataset from huggingface?
The text was updated successfully, but these errors were encountered:
AF2_UniRef50 is organized in LMDB format. If you want to load it, you have to first download it and then open the file using lmdb package.
Here is the example of how you get samples:
import lmdb
lmdb_dir = "/your/path/to/AF2_UniRef50/train"
with lmdb.open(lmdb_dir, readonly=True).begin() as txn:
length = int(txn.get('length'.encode()).decode())
for i in range(length):
data_str = txn.get(i.encode()).decode()
data = json.loads(data_str)
print(data)
break
@LTEnjoy Hi, can I download the orginal structure data of the sequence?
Hi,
I'm sorry but the original structure data is too large to upload so We are unable to share it. You could download all AF2 structures on the official website https://alphafold.ebi.ac.uk/.
Hi,
Thanks for providing the pre-training database with foldseek tokens! having difficulty downloading the dataset and using with hugginface functions. Trying
but getting error
What then is the proper way to load in the dataset from huggingface?
The text was updated successfully, but these errors were encountered: