My implementation of the "Lseg: language driven semantic segmentation" paper by Boyi Li et al.
A dense prediction transformer (DPT) with a modified head encodes at pixel level, and the CLIP model encodes a set of words. Both embeddings are later combined in a multimodal latent space (orange tensor in the image) which will be later compared to the ground truth labels of an annotated image.
We will train our model only on the ADE20K and COCOPanoptic datasets. We use MSeg-API to download and relabel them. I recommend following its instructions step by step but with a few modifications:
mseg-api
should be cloned in the repo main directory.- In the scripts from the
mseg-api/download_scripts
folder you need to comment parts regarding other datasets. - Place the download for the dataset into
data/
(data is a directory in the repo directory). This is done when you defineMSEG_DST_DIR
.
Once everything is downloaded, we use mseg-semantic utils
link to interact with the data and create the dataloader.
- I needed to change
ade20k_images_dir = "data/mseg_dataset/ADE20K/"
toade20k_images_dir = "data/mseg_dataset/ADE20K/ADEChallengeData2016/"
inLseg/utils/util.py
, otherwise an error shows up. - Use
test_data_utils.ipynb
to check we are fetching the images correctly. In my case, for the COCO dataset, nor the foldertrain2017
norval2017
was insidedata/COCOPanoptic/images/
so I had to create the folder myself and put both inside.
- To Richard Zhao for answering all my annoying questions
- Useful repositories: