diff --git a/docs/src/_toc.yml b/docs/src/_toc.yml index 6673253..437cff8 100644 --- a/docs/src/_toc.yml +++ b/docs/src/_toc.yml @@ -18,6 +18,7 @@ parts: chapters: - file: part3/compilation - file: part3/results - - caption: Other + - caption: Wrapping Up chapters: - - file: bibliography + - file: part4/conclusions + - file: part4/bibliography diff --git a/docs/src/part1/problem.md b/docs/src/part1/problem.md index 7245d9e..88d7f60 100644 --- a/docs/src/part1/problem.md +++ b/docs/src/part1/problem.md @@ -31,6 +31,7 @@ Masked Image Modelling is a self-supervised objective that consists in predictin The drawback of using these foundation models is that they are large and computationally expensive, which makes them unsuitable for deployment in production environments, especially on edge devices. To address this issue, we need to optimize these models. +(part1:objectives)= ## Objectives To ground our problem, we can use the framework described by {cite}`mcip` that Apple engineers use to deploy machine learning models on their devices. As we can see on {numref}`Figure {number} `, this consists on three steps: diff --git a/docs/src/part2/adapting.ipynb b/docs/src/part2/adapting.ipynb index 98bbacb..178ba99 100644 --- a/docs/src/part2/adapting.ipynb +++ b/docs/src/part2/adapting.ipynb @@ -342,15 +342,6 @@ "source": [ "Nice, that worked." ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "TODO:\n", - "- [ ] Mention `cls_token` case\n", - "- [ ] Mention image interpolation case" - ] } ], "metadata": { diff --git a/docs/src/part2/choosing.md b/docs/src/part2/choosing.md index a30e4c6..90ab9e5 100644 --- a/docs/src/part2/choosing.md +++ b/docs/src/part2/choosing.md @@ -6,6 +6,7 @@ Our task in this chapter is to choose a candidate architecture that allows us to use pre-trained vision foundation models as their backbone's feature extractor. +(part2:sota)= ## State of the Art A brief glimpse into the literature gives us some promising picks but also some fundamental questions. The first thing we find is that there is no clear winner between CNN-based and ViT-based models, especially when we factor latency/efficiency into the equation. Furthermore, neither CNN-based and ViT-based models have a clear best architectural variant (e.g vanilla ViT vs EVA's TrV's, ResNet vs ResNext) and sometimes the backbone's architecture itself is modified to better suit the task at hand (e.g Swin is a ViT with hierarchical features, useful for dense prediction tasks). Furthermore, some backbones are finetuned in task-specific datasets, which improves task-specific performance at expense of generality. diff --git a/docs/src/part2/training.md b/docs/src/part2/training.md index 5d17925..026fbfa 100644 --- a/docs/src/part2/training.md +++ b/docs/src/part2/training.md @@ -101,11 +101,7 @@ You can activate automatic mixed precision training by setting `train.amp.enable ## Training Results -In figures {numref}`boxap` and {numref}`loss` we can see the validation BoxAP and training loss over 12 epochs, respectively. - -TODO: -- Mention the little bump at the end from the learning rate scheduler (2eps) -- Mention that the model is not saturated +In figures {numref}`boxap` and {numref}`loss` we can see the validation BoxAP and training loss over 12 epochs, respectively. We can observe the little bump in accuracy at the last epoch of training, which is due to the lower learning rate. The model is not saturated, as we can see that the loss is still decreasing. ::::{grid} 2 :::{grid-item-card} @@ -127,26 +123,22 @@ Training loss over 12 epochs ## Predicting performance at 50 epochs -TODO -- Mention that model is trained for 12eps and 50eps, but the 50ep is the one that is used in evaluations -- Let's fit some curves and forecast performance at 50eps -- Mention the little accuracy increase at the last 10eps of the training -- Mention that the normal vit config can be used as reference: detrex/projects/dino/configs/dino-vitdet/dino_vitdet_base_4scale_50ep.py -- lr scheduler information can be found at :detrex/detrex/config/configs/common/coco_schedule.py +The original model was trained for 50 epochs, so doing a comparison at this stage is unfair. However, we can fit some curves and forecast the performance at 50 epochs. As we can see in {numref}`scaling`, our model's validation BoxAP is well predicted with a power law. If we extrapolate this curve (see {numref}`scaling_prediction`), we can expect a performance of 54.54 at 50 epochs. However, this doesn't account for the bump in accuracy caused by the learning rate decay, so we can expect a slightly higher performance (~56 AP). 55 box AP is +4.8 points over the original model {cite}`vitdet`, which is already a significant improvement. + ::::{grid} 2 :::{grid-item-card} :::{figure-md} scaling -Caption +Trying different curve fits (logarithmic, log-linear, power law). ::: ::: :::{grid-item-card} :::{figure-md} scaling_prediction -Caption +Predicting performance at 50 epochs with power law. ::: ::: :::: \ No newline at end of file diff --git a/docs/src/part3/compilation.ipynb b/docs/src/part3/compilation.ipynb index c8c4d64..b62bd31 100644 --- a/docs/src/part3/compilation.ipynb +++ b/docs/src/part3/compilation.ipynb @@ -111,14 +111,6 @@ "Notice that properties 1 and 2 are in conflict with each other. The more operators we have, the more expressive the IR is, but the harder it is to implement all of them. This is a trade-off that the PyTorch team has to balance. " ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "TODO:\n", - "- [ ] Introduce ATEN (dialects), fx.Graph and link to Export IR, functionalization" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -705,15 +697,6 @@ "torch.export.save(ep, \"simple_net.pt2\")" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## TensorRT\n", - "\n", - "- [ ] Introduction to TensorRT" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -811,13 +794,13 @@ "\n", "At the end of the compilation process, you should see a message indicating the output directory:\n", "\n", - "```txt\n", + "```\n", "OUTPUT DIR: outputs/2024-10-31/10-43-31\n", "```\n", "\n", "This output directory will contain the following files:\n", "\n", - "```txt\n", + "```\n", "├── export_tensorrt.log # log file (useful for debugging process)\n", "├── .hydra\n", "│ ├── config.yaml # config file (useful for remembering the parameters used)\n", @@ -838,7 +821,7 @@ "1. DinoV2 + ViTDet + DINO: Successful compilation, minimal final rewrites.\n", "2. ViT + ViTDet + Cascade Mask RCNN: Almost successful, many final rewrites.\n", "\n", - "To follow the thought process in a single notebook, I've added flags throughout the model's source code to activate or deactivate the most important fixes. To see *all* the changes, you can check all the differences between my forks of `detectron2`, `detrex` and the original repositories." + "To follow the thought process in a single notebook, we've added flags throughout the model's source code to activate or deactivate the most important fixes. To see *all* the changes, you can check all the differences between my forks of `detectron2`, `detrex` and the original repositories." ] }, { @@ -1087,7 +1070,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### [❌] Increasing constant tensor limit" + "#### [x] Increasing constant tensor limit" ] }, { @@ -1140,7 +1123,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### [✅] Rewriting code for non-tensor constants" + "#### [✓] Rewriting code for non-tensor constants" ] }, { @@ -1150,7 +1133,7 @@ "The second solution is to rewrite the code to keep `spatial_shapes` as a list of tuples. This works because PyTorch automatically considers lists and integers as constants. \n", "\n", "The disadvantages of this approach are:\n", - "- It's a bit more intrusive and error-prone.\n", + "- It's a bit more intrusive and error-prone because you need to rewrite all torch operations that use `spatial_shapes` with standard python list operations.\n", "- We will have to disable the deformable attention cuda kernel because it expects a tensor `spatial_shapes`. Maybe the kernel could be rewritten, but TensorRT is already good enough at optimizing the python implementation.\n", "\n", "The advantages are:\n", @@ -1203,7 +1186,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### [❌] Handling model I/O with PyTree Node Registrations" + "#### [x] Handling model I/O with PyTree Node Registrations" ] }, { @@ -1298,14 +1281,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### [✅] Handling model I/O with `TracingAdapter`" + "#### [✓] Handling model I/O with `TracingAdapter`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "I've added PT2E support to `detectron2.export.flatten.TracingAdapter` which does all the flattening for you and also optionally folds the non-tensor inputs as model constants, which applies to our case (height, width are constants)." + "We've added PT2E support to `detectron2.export.flatten.TracingAdapter` which does all the flattening for you and also optionally folds the non-tensor inputs as model constants, which applies to our case (height, width are constants)." ] }, { @@ -1598,7 +1581,7 @@ "\n", "There are two other places with similar data-dependent expressions, so we'll skip them too.\n", "\n", - "For this, I've added the following flags:\n", + "For this, we've added the following flags:\n", "- `detectron2.modeling.proposal_generator.proposal_utils.SKIP_NMS`.\n", "- `detectron2.modeling.roi_heads.fast_rcnn.SKIP_FILTER_CONFIDENCE` \n", "- `detectron2.modeling.roi_heads.fast_rcnn.SKIP_NMS`.\n" @@ -1803,7 +1786,7 @@ "source": [ "No luck.\n", "\n", - "This is where we stop. This framework-specific bugs are hard to debug and fix as they often are bugs in the compiler itself. In my experience with the previous case study, these bugs fixed themselves by rewriting the model in order to avoid graph partitioning alltogether. We can obtain the unsupported nodes by feeding `debug=True` to `torch_tensorrt.dynamo.compile`.\n", + "This is where we stop. This framework-specific bugs are hard to debug and fix as they often are bugs in the compiler itself, so let's just report it ([pytorch/TensorRT/3269](https://github.com/pytorch/TensorRT/issues/3269)). In my experience with the previous case study, these bugs fixed themselves by rewriting the model in order to avoid graph partitioning altogether. We can obtain the unsupported nodes by feeding `debug=True` to `torch_tensorrt.dynamo.compile`.\n", "\n", "For this model, the unsupported nodes after the removing the filtering steps (non-maximum-suppresion, etc) are:\n", "- `torch.ops.aten.nonzero.default`\n", @@ -1811,13 +1794,8 @@ "- `torch.ops.torchvision.roi_align.default`\n", "- `torch.ops.aten.index_put.default`\n", "\n", - "However, we've already rewritten essential parts of the model and my guess is that if we continued with more rewrites, the resulting model would not be usable. For example, the weights of window attention do not have the same the same shape as that of the global attention, so the pre-trained model likely already needs finetuning or might not even work anymore." + "Rewriting the model to avoid these nodes would allow TensorRT to avoid any graph partitioning and thus reduce its dependence on shape analysis. However, we've already rewritten essential parts of the model and my guess is that if we continued with more rewrites, the resulting model would not be usable. For example, the weights of window attention do not have the same the same shape as that of the global attention, so the pre-trained model likely already needs finetuning or might not even work anymore." ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] } ], "metadata": { diff --git a/docs/src/part3/results.md b/docs/src/part3/results.md index 61c3362..b4a85d7 100644 --- a/docs/src/part3/results.md +++ b/docs/src/part3/results.md @@ -88,12 +88,20 @@ We include the previous results for completeness, in case the issue is resolved | model's precision | trt.enabled_precisions | latency | | ----------------- | ---------------------- | ------- | -| fp32 | fp32+fp16 | 13.984 | | fp32 | fp32+bf16+fp16 | 13.898 | +| fp32 | fp32+fp16 | 13.984 | | fp32 | fp32+bf16 | 17.261 | -| bf16 | fp32+bf16 | 22.913 | -| bf16 | bf16 | 22.938 | +| fp32+bf16 | fp32+bf16 | 22.913 | | fp32 | fp32 | 37.639 | ``` ::: + +## Observations + +Some observations we can gather from {numref}`Table {number} `, {numref}`Table {number} `, {numref}`Table {number} ` and {numref}`Table {number} ` are: +- Compared to the baseline (76 ms) we have achieved a 5x speedup (15 ms). +- The C++ runtime is negligibly faster than the Python runtime (<1ms) when using TensorRT. +- Depending on `torch_tensorrt`'s version, either manually set the precision to `fp16` with `torch.amp.autocast` or let `torch_tensorrt` handle mixed precision, for the best performance. +- The memory usage is reduced by half when using TensorRT with mixed precision, compared to full precision in Eager Python. + diff --git a/docs/src/bibliography.md b/docs/src/part4/bibliography.md similarity index 100% rename from docs/src/bibliography.md rename to docs/src/part4/bibliography.md diff --git a/docs/src/part4/conclusions.md b/docs/src/part4/conclusions.md new file mode 100644 index 0000000..24c8358 --- /dev/null +++ b/docs/src/part4/conclusions.md @@ -0,0 +1,8 @@ +# Conclusions + +In terms of results, the main takeaway is that, if we consider the {ref}`sota`, we've achieved 15 ms (or equivalently, 67 FPS) which places us near YOLOv8-X (59 FPS). Although this is not a fair comparison, as YOLOv8-X is ran on a slower T4 GPU, this still leaves us in a good position for real-time applications. Furthermore, we've manged to half the memory usage of the model (from 1GB to 500MB). + +The bulk of this work was spent in optimizing by compilation procedures, which we've thoroughly documented in the two case studies of {ref}`part2:compilingmodel`. However, much work can still be done in the optimization of the model itself (quantization, structured pruning, etc) and in doing architecture search, as described in {ref}`part1:objectives`. + +Although not documented in these tutorials, we've tried quantization techniques that were unsuccessful or whose results were not significant enough to be included in this report. Scripts for these experiments are available in the repository and future work needs to be done in this direction if we aim to deploy these models not only on small edge GPUs but also on mobile devices. +