docs: update docs

dgcnz · Oct 31, 2024 · 62de7db · 62de7db
1 parent 2d311bf
commit 62de7db
Show file tree

Hide file tree

Showing 9 changed files with 79 additions and 20 deletions.
diff --git a/docs/src/_config.yml b/docs/src/_config.yml
@@ -31,3 +31,7 @@ html:
   use_issues_button: false
   use_repository_button: true
   favicon : "favicon.ico"
+
+sphinx:
+  config:
+    html_show_copyright: false
diff --git a/docs/src/part1/getting_started.md b/docs/src/part1/getting_started.md
@@ -15,8 +15,9 @@ The project is structured as follows:
 ├── detrex              # fork of detrex
 ├── docs                # documentation
 ├── logs                
-├── notebooks           # jupyter notebooks
+├── notebooks           # experimental jupyter notebooks
 ├── output              # [Training] `scripts.train_net` outputs (tensorboard logs, weights, etc)
+├── outputs             # [Compilation] `scripts.export_tensorrt` outputs (exported model, logs, etc)
 ├── projects            # configurations and model definitions
 ├── scripts             # utility scripts 
 ├── src                 # python source code
@@ -96,4 +97,16 @@ To point the `detectron2` library to the dataset directory, we need to set the `
 ```bash
 conda env config vars set DETECTRON2_DATASETS=~/datasets
 conda activate cu124
-```
+```
+
+
+(part1:downloadmodel)=
+## Downloading trained model (for compilation and evaluation)
+
+To download the final trained model, download the trained model weights from HuggingFace and place them on `artifacts/model_final.pth` with the following command:
+
+```bash
+wget https://huggingface.co/dgcnz/dinov2_vitdet_DINO_12ep/resolve/main/model_final.pth -O artifacts/model_final.pth ⁠
+```
+
+This is a necessary step for compilation and running benchmarks later on.
diff --git a/docs/src/part1/knowledgetransfer.png b/docs/src/part1/knowledgetransfer.png
diff --git a/docs/src/part1/mim.png b/docs/src/part1/mim.png
diff --git a/docs/src/part1/problem.md b/docs/src/part1/problem.md
@@ -1,25 +1,37 @@
 # Problem Definition
 
-TODO
-- [ ] Rewrite slides in text format
+```{contents}
+```
 
----
+## Motivation
 
-Training a model from scratch requires a labeled dataset that covers the target domain (🌧️, 🌓, 👁️, …).
+Currently, the most popular approach for deploying an object detection model in a production environment is to use YOLO because it's fast and easy to use. However, to make it usable for your specific task you need to recollect a domain-specific dataset and fine-tune or train the model from scratch. This is a time-consuming and expensive process because such a dataset needs to be comprehensive enough to cover most of the possible scenarios that the model will encounter in the real world (weather conditions, different object scales and textures, etc). 
 
-Curating such a dataset is hard and expensive.
+Recently, a new paradigm shift has emerged in the field as described by {cite}`bommasani2022`: instead of training a model from scratch for a specific task, you can use a model that was pre-trained on a generic task with massive data and compute as a backbone for your model and only fine-tune a decoder/head for your specific task (see {numref}`Figure {number} <knowledgetransfer>`). These pre-trained models are called Foundation Models and are great at capturing features that are useful for a wide range of tasks.
 
-However, what if we could rely on a model pre-trained on 141 million images?
+For example, we can train a vision model to predict missing patches in an image (see {numref}`Figure {number} <mim>`) and then fine-tune it for pedestrian detection.
 
----
+:::::{grid} 2
+:::{grid-item-card}  
+:::{figure-md} knowledgetransfer
+<img src="knowledgetransfer.png" alt="knowledgetransfer">
 
-Recent paradigm shift, from deep learning to foundation models (Bommasani, 2022):
-DL: Train a model for a specific task and dataset (e.g. object detection of blood cells)
+Transfer Learning is the process of reusing the knowledge acquired by a network trained in a source task for a network's target task. {cite}`hatakeyama2023`
+:::
+:::
+:::{grid-item-card} 
+:::{figure-md} mim
+<img src="mim.png" alt="mim">
 
-FOMO: Pre-train a large model with a huge unlabeled dataset on a generic task (e.g. predicting missing patches on an image) and then adapt it for downstream tasks
-Adaptations include: fine-tuning, decoder training, distillation, quantization, sparsification, etc.
+Masked Image Modelling is a self-supervised objective that consists in predicting missing patches from an image. {cite}`mae2021`
+:::
+:::
+:::::
 
----
+The drawback of using these foundation models is that they are large and computationally expensive, which makes them unsuitable for deployment in production environments, especially on edge devices. To address this issue, we need to optimize these models.
+
+
+## Objectives
 
 According to {cite}`mcip`, practitioners at Apple do the following when asked to deploy a model to some edge device.
 Find a feasibility model A.
@@ -30,5 +42,5 @@ Compress model B to reach production-ready model C.
 :::{figure-md} apple_practice
 <img src="apple_practice.png" alt="">
 
-Hi
+Caption
 :::
diff --git a/docs/src/part2/choosing.md b/docs/src/part2/choosing.md
@@ -1,5 +1,8 @@
 # Choosing a candidate model
 
+```{contents}
+```
+
 Our task in this chapter is to choose a candidate architecture that allows us to use pre-trained vision foundation models as their backbone's feature extractor.
 
 

diff --git a/docs/src/part3/compilation.ipynb b/docs/src/part3/compilation.ipynb
@@ -725,11 +725,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Before compilation, download the trained model weights from HuggingFace and place them on `artifacts/model_final.pth` or configure the path in the config file. To download the weights, run the following command:\n",
-    "\n",
-    "```sh\n",
-    "!wget https://huggingface.co/dgcnz/dinov2_vitdet_DINO_12ep/resolve/main/model_final.pth -O artifacts/model_final.pth ⁠\n",
-    "```\n",
+    "Before compilation, make sure you have followed the instructions at {ref}`part1:downloadmodel`.\n",
     "\n",
     "The main script to compile our model with the TensorRT backend is `scripts.export_tensorrt`.\n",
     "\n",

diff --git a/docs/src/part3/results.md b/docs/src/part3/results.md
@@ -1,5 +1,8 @@
 # Benchmarks and Results
 
+```{contents}
+```
+
 ## Running the benchmarks
 
 Download the model:

diff --git a/docs/src/references.bib b/docs/src/references.bib
@@ -123,4 +123,32 @@ @misc{pytorchTorchexportSpecification
 	howpublished = {\url{https://pytorch.org/docs/main/export.ir_spec.html}},
 	year = {},
 	note = {[Accessed 25-10-2024]},
+}
+
+@misc{bommasani2022,
+      title={On the Opportunities and Risks of Foundation Models}, 
+      author={Rishi Bommasani and Drew A. Hudson and Ehsan Adeli and Russ Altman and Simran Arora and Sydney von Arx and Michael S. Bernstein and Jeannette Bohg and Antoine Bosselut and Emma Brunskill and Erik Brynjolfsson and Shyamal Buch and Dallas Card and Rodrigo Castellon and Niladri Chatterji and Annie Chen and Kathleen Creel and Jared Quincy Davis and Dora Demszky and Chris Donahue and Moussa Doumbouya and Esin Durmus and Stefano Ermon and John Etchemendy and Kawin Ethayarajh and Li Fei-Fei and Chelsea Finn and Trevor Gale and Lauren Gillespie and Karan Goel and Noah Goodman and Shelby Grossman and Neel Guha and Tatsunori Hashimoto and Peter Henderson and John Hewitt and Daniel E. Ho and Jenny Hong and Kyle Hsu and Jing Huang and Thomas Icard and Saahil Jain and Dan Jurafsky and Pratyusha Kalluri and Siddharth Karamcheti and Geoff Keeling and Fereshte Khani and Omar Khattab and Pang Wei Koh and Mark Krass and Ranjay Krishna and Rohith Kuditipudi and Ananya Kumar and Faisal Ladhak and Mina Lee and Tony Lee and Jure Leskovec and Isabelle Levent and Xiang Lisa Li and Xuechen Li and Tengyu Ma and Ali Malik and Christopher D. Manning and Suvir Mirchandani and Eric Mitchell and Zanele Munyikwa and Suraj Nair and Avanika Narayan and Deepak Narayanan and Ben Newman and Allen Nie and Juan Carlos Niebles and Hamed Nilforoshan and Julian Nyarko and Giray Ogut and Laurel Orr and Isabel Papadimitriou and Joon Sung Park and Chris Piech and Eva Portelance and Christopher Potts and Aditi Raghunathan and Rob Reich and Hongyu Ren and Frieda Rong and Yusuf Roohani and Camilo Ruiz and Jack Ryan and Christopher Ré and Dorsa Sadigh and Shiori Sagawa and Keshav Santhanam and Andy Shih and Krishnan Srinivasan and Alex Tamkin and Rohan Taori and Armin W. Thomas and Florian Tramèr and Rose E. Wang and William Wang and Bohan Wu and Jiajun Wu and Yuhuai Wu and Sang Michael Xie and Michihiro Yasunaga and Jiaxuan You and Matei Zaharia and Michael Zhang and Tianyi Zhang and Xikun Zhang and Yuhui Zhang and Lucia Zheng and Kaitlyn Zhou and Percy Liang},
+      year={2022},
+      eprint={2108.07258},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2108.07258}, 
+}
+
+@article{hatakeyama2023,
+author = {Hatakeyama, Tomoyuki and Wang, Xueting and Yamasaki, Toshihiko},
+year = {2023},
+month = {08},
+pages = {1-15},
+title = {Transferability prediction among classification and regression tasks using optimal transport},
+volume = {83},
+journal = {Multimedia Tools and Applications},
+doi = {10.1007/s11042-023-15852-6}
+}
+
+@Article{mae2021,
+  author  = {Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Doll{\'a}r and Ross Girshick},
+  journal = {arXiv:2111.06377},
+  title   = {Masked Autoencoders Are Scalable Vision Learners},
+  year    = {2021},
 }