This repo is to help the Google Cloud Users to run one awesome JAX-based computer vision research repo, i.e. Scenic
We need some small modifications to make Scenic more friendly to the Google Cloud Users. The Scenic repo always uses the default dataset dir. However, for Google Cloud Users or other GPU users, the dataset location may be different. For instance, the Google Cloud Users usually store their dataset on Google Cloud Storage Budget. Therefore, we should edit the dataset loading function in Scenic. We use MNIST and ImageNet as examples here.
Before all, please fork the Scenic repo. And the you can edit your forked scenic repo.
Frist, edit the get_dataset() function in dataset_lib/mnist_dataset.py
- remove this line of code (sometimes we do not need this since Scenic only have this line of code for a part of XXX_dataset.py):
del dataset_configs
- pass dataset_configs.data_dir to dataset_builder:
train_ds, train_ds_info = dataset_utils.load_split_from_tfds(
'mnist',
batch_size,
split='train',
data_dir=dataset_configs.data_dir, # Added by us.
preprocess_example=preprocess_ex,
shuffle_seed=shuffle_seed
)
- do similar thing for eval_ds:
eval_ds, _ = dataset_utils.load_split_from_tfds(
'mnist', eval_batch_size,
split='test',
data_dir=dataset_configs.data_dir,
preprocess_example=preprocess_ex)
)
- add data_dir in config file, we assume we want to use scenic/projects/baselines/configs/mnist/mnist_config.py for later training:
# Dataset.
config.dataset_name = 'mnist'
config.dataset_configs = ml_collections.ConfigDict()
config.dataset_configs.data_dir = 'YOUR_DATA_DIR' # Added by us.
config.data_dtype_str = 'float32'
The code for MNIST is ready now.
Similarly, we edit the get_dataset() function in dataset_lib/imagenet_dataset.py to pass the data_dir into dataset_builder.
- pass dataset_configs.data_dir to imagenet_load_split for train_ds:
train_ds = imagenet_load_split(
batch_size,
train=True,
onehot_labels=onehot_labels,
dtype=dtype,
shuffle_seed=shuffle_seed,
data_augmentations=data_augmentations,
data_dir=dataset_configs.data_dir # added by us.
)
- pass dataset_configs.data_dir to imagenet_load_split for eval_ds:
eval_ds = imagenet_load_split(eval_batch_size, train=False,
onehot_labels=onehot_labels,
dtype=dtype,
data_dir=dataset_configs.data_dir # added by us
)
)
- edit imagenet_load_split function in dataset_lib/imagenet_dataset.py to support data_dir arg:
def imagenet_load_split(batch_size,
train,
onehot_labels,
dtype=tf.float32,
image_size=IMAGE_SIZE,
prefetch_buffer_size=10,
shuffle_seed=None,
data_augmentations=None,
data_dir=None # added by us.
):
- edit the code inside of imagenet_load_split func:
# replace dataset_builder = tfds.builder('imagenet2012:5.*.*') by:
dataset_builder = tfds.builder('imagenet2012:5.*.*', data_dir=data_dir)
- edit the config file, i.e. projects/baselines/configs/imagenet/imagenet_vit_config.py
config.dataset_configs = ml_collections.ConfigDict()
config.dataset_configs.data_dir = 'gs://YOUR_BUDGET_NAME/imagenet' # added by us.
Okay, the code is ready for both MNIST and ImageNet now. If you want to try other different datasets, you can just follow these two examples. The core is to pass data_dir to tfds.builder manually.
Note that all the commands in this document should be run in the commandline of the TPU VM instance unless otherwise stated.
Please make sure you set your gcloud configs first:
-
Create a GCP project.
-
Install
gcloud
. -
Associate your Google Account (Gmail account) with your GCP project by running:
export GCP_PROJECT=<GCP PROJECT ID> gcloud auth login gcloud auth application-default login gcloud config set project $GCP_PROJECT
-
Create a staging bucket if you do not already have one. We use europe-west4-a as an example:
export GOOGLE_CLOUD_BUCKET_NAME=<GOOGLE_CLOUD_BUCKET_NAME> export ZONE=europe-west4-a gsutil mb -l $ZONE gs://$GOOGLE_CLOUD_BUCKET_NAME
Then, setup the TPU VM:
-
Create a Cloud TPU VM instance following this instruction. We recommend that you develop your workflow in a single v3-8 TPU (i.e.,
--accelerator-type=v3-8
) and scale up to pod slices once the pipeline is ready. In this README, we focus on using a single v3-8 TPU. See here to learn more about TPU architectures. -
With Cloud TPU VMs, you ssh directly into the host machine of the TPU VM. You can install packages, run your code run, etc. in the host machine. Once the TPU instance is created, ssh into it with
export TPU_NAME=v3-8 gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} --zone=${ZONE}
where
TPU_NAME
andZONE
are the name and the zone used above.
-
Install the dependencies, one great way is installing t5x first. Most environment used in Scenic would be covered by that.
git clone --branch=main https://github.com/google-research/t5x cd t5x python3 -m pip install -e '.[tpu]' -f \ https://storage.googleapis.com/jax-releases/libtpu_releases.html
-
Install Scenic and Run
git clone --branch=main https://github.com/YOUR_GITHUB_NAME/scenic.git cd scenic pip install . export WORK_DIR=gs://${GOOGLE_CLOUD_BUCKET_NAME}/scenic/mnist python3 scenic/main.py \ --config=scenic/projects/baselines/configs/mnist/mnist_config.py \ --workdir=$WORK_DIR
Sometimes, we may conduct larger scale experiments with more TPU chips (e.g. v3-128) for larger datasets like ImageNet. In this case, we can run the code in this way:
-
Combine the code in Section
Run your code on the TPU VM
into a single file run_scenic.sh. We provide an example in this repo. -
Run
export TPU_NAME=v3-128 gcloud compute tpus tpu-vm scp run_scenic.sh $TPU_NAME: --worker=all --zone=$ZONE gcloud alpha compute tpus tpu-vm ssh $TPU_NAME --zone=$ZONE --worker=all --command "bash run_scenic.sh"
Thank you for your interest on this repo. If you found any issue when running Scenic on your own cluster, especially on Google Cloud Platform, please feel free to post an issue on this repo (preferred) or ping Fuzhao via email. I would be happy to help.