diff --git a/docs/AI/CS231n/CS231n_notes.md b/docs/AI/CS231n/CS231n_notes.md new file mode 100644 index 00000000..e773b78a --- /dev/null +++ b/docs/AI/CS231n/CS231n_notes.md @@ -0,0 +1,1952 @@ +# Computer Vision + +This note is based on [GitHub - DaizeDong/Stanford-CS231n-2021-and-2022: Notes and slides for Stanford CS231n 2021 & 2022 in English. I merged the contents together to get a better version. Assignments are not included. 斯坦福cs231n的课程笔记(英文版本,不含实验代码),将2021与2022两年的课程进行了合并,分享以供交流。](https://github.com/DaizeDong/Stanford-CS231n-2021-and-2022/) +And I will add some blogs, articles and other understanding. + +| Topic | Chapter | +| ---------------------------------------------------- | ------- | +| Deep Learning Basics | 2 - 4 | +| Perceiving and Understanding the Visual World | 5 - 12 | +| Reconstructing and Interacting with the Visual World | 13 - 16 | +| Human-Centered Applications and Implications | 17 - 18 | + +## 1 - Introduction + +A brief history of computer vision & deep learning... + +## 2 - Image Classification + +**Image Classification:** A core task in Computer Vision. The main drive to the progress of CV. + +**Challenges:** Viewpoint variation, background clutter, illumination, occlusion, deformation, intra-class variation... + +### K Nearest Neighbor + +**Hyperparameters:** Distance metric ($p$ norm), $k$ number. + +Choose hyperparameters using validation set. + +Never use k-Nearest Neighbor with pixel distance. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/2-cross_validation.png) + +### Linear Classifier + +Pass... + +## 3 - Loss Functions and Optimization + +### Loss Functions + +| Dataset | $\big\{(x_i,y_i)\big\}_{i=1}^N\\$ | +| --------------------------------- | ------------------------------------------------------------ | +| Loss Function | $L=\frac{1}{N}\sum_{i=1}^NL_i\big(f(x_i,W),y_i\big)\\$ | +| Loss Function with Regularization | $L=\frac{1}{N}\sum_{i=1}^NL_i\big(f(x_i,W),y_i\big)+\lambda R(W)\\$ | + +**Motivation:** Want to interpret raw classifier scores as probabilities. + +| Softmax Classifier | $p_i=Softmax(y_i)=\frac{\exp(y_i)}{\sum_{j=1}^N\exp(y_j)}\\$ | +| -------------------------------------- | ------------------------------------------------------------ | +| Cross Entropy Loss | $L_i=-y_i\log p_i\\$ | +| Cross Entropy Loss with Regularization | $L=-\frac{1}{N}\sum_{i=1}^Ny_i\log p_i+\lambda R(W)\\$ | + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/3-loss.png) + +### Optimization + +#### SGD with Momentum + +**Problems that SGD can't handle:** + +1. Inequality of gradient in different directions. +2. Local minima and saddle point (much more common in high dimension). +3. Noise of gradient from mini-batch. + +**Momentum:** Build up “velocity” $v_t$ as a running mean of gradients. + +| SGD | SGD + Momentum | +| --------------------------------- | ------------------------------------------------------------ | +| $x_{t+1}=x_t-\alpha\nabla f(x_t)$ | $\begin{align}&v_{t+1}=\rho v_t+\nabla f(x_t)\\&x_{t+1}=x_t-\alpha v_{t+1}\end{align}$ | +| Naive gradient descent. | $\rho$ gives "friction", typically $\rho=0.9,0.99,0.999,...$ | + +**Nesterov Momentum:** Use the derivative on point $x_t+\rho v_t$ as gradient instead point $x_t$. + +| Momentum | Nesterov Momentum | +| ------------------------------------------------------------ | ------------------------------------------------------------ | +| $\begin{align}&v_{t+1}=\rho v_t+\nabla f(x_t)\\&x_{t+1}=x_t-\alpha v_{t+1}\end{align}$ | $\begin{align}&v_{t+1}=\rho v_t+\nabla f(x_t+\rho v_t)\\&x_{t+1}=x_t-\alpha v_{t+1}\end{align}$ | +| Use gradient at current point. | Look ahead for the gradient in velocity direction. | + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/3-momentum.png) + +#### AdaGrad and RMSProp + +**AdaGrad:** Accumulate squared gradient, and gradually decrease the step size. + +**RMSProp:** Accumulate squared gradient while decaying former ones, and gradually decrease the step size. ("Leaky AdaGrad") + +| AdaGrad | RMSProp | +| ------------------------------------------------------------ | ------------------------------------------------------------ | +| $\begin{align}\text{Initialize:}&\\&r:=0\\\text{Update:}&\\&r:=r+\Big[\nabla f(x_t)\Big]^2\\&x_{t+1}=x_t-\alpha\frac{\nabla f(x_t)}{\sqrt{r}}\end{align}$ | $\begin{align}\text{Initialize:}&\\&r:=0\\\text{Update:}&\\&r:=\rho r+(1-\rho)\Big[\nabla f(x_t)\Big]^2\\&x_{t+1}=x_t-\alpha\frac{\nabla f(x_t)}{\sqrt{r}}\end{align}$ | +| Continually accumulate squared gradients. | $\rho$ gives "decay rate", typically $\rho=0.9,0.99,0.999,...$ | + +#### Adam + +Sort of like "RMSProp + Momentum". + +| Adam (simple version) | Adam (full version) | +| ------------------------------------------------------------ | ------------------------------------------------------------ | +| $\begin{align}\text{Initialize:}&\\&r_1:=0\\&r_2:=0\\\text{Update:}&\\&r_1:=\beta_1r_1+(1-\beta_1)\nabla f(x_t)\\&r_2:=\beta_2r_2+(1-\beta_2)\Big[\nabla f(x_t)\Big]^2\\&x_{t+1}=x_t-\alpha\frac{r_1}{\sqrt{r_2}}\end{align}$ | $\begin{align}\text{Initialize:}\\&r_1:=0\\&r_2:=0\\\text{For }i\text{:}\\&r_1:=\beta_1r_1+(1-\beta_1)\nabla f(x_t)\\&r_2:=\beta_2r_2+(1-\beta_2)\Big[\nabla f(x_t)\Big]^2\\&r_1'=\frac{r_1}{1-\beta_1^i}\\&r_2'=\frac{r_2}{1-\beta_2^i}\\&x_{t+1}=x_t-\alpha\frac{r_1'}{\sqrt{r_2'}}\end{align}$ | +| Build up “velocity” for both gradient and squared gradient. | Correct the "bias" that $r_1=r_2=0$ for the first few iterations. | + +#### Overview + +| ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/3-optimization_overview.gif) | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/3-optimization_overview2.gif) | +| :----------------------------------------------------------: | :----------------------------------------------------------: | + +#### Learning Rate Decay + +Reduce learning rate at a few fixed points to get a better convergence over time. + +$\alpha_0$ : Initial learning rate. + +$\alpha_t$ : Learning rate in epoch $t$. + +$T$ : Total number of epochs. + +| Method | Equation | Picture | +| ------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | +| Step | Reduce $\alpha_t$ constantly in a fixed step. | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/3-learning_rate_step.png) | +| Cosine | $\begin{align}\alpha_t=\frac{1}{2}\alpha_0\Bigg[1+\cos(\frac{t\pi}{T})\Bigg]\end{align}$ | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/3-learning_rate_cosine.png) | +| Linear | $\begin{align}\alpha_t=\alpha_0\Big(1-\frac{t}{T}\Big)\end{align}$ | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/3-learning_rate_linear.png) | +| Inverse Sqrt | $\begin{align}\alpha_t=\frac{\alpha_0}{\sqrt{t}}\end{align}$ | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/3-learning_rate_sqrt.png) | + +High initial learning rates can make loss explode, linearly increasing learning rate in the first few iterations can prevent this. + +**Learning rate warm up:** + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/3-learning_rate_increase.png) + +**Empirical rule of thumb:** If you increase the batch size by $N$, also scale the initial learning rate by $N$ . + +#### Second-Order Optimization + +| | Picture | Time Complexity | Space Complexity | +| ------------ | ------------------------------------------------------- | ----------------------------------- | ----------------------------------- | +| First Order | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/3-first_order.png) | $O(n)$ | $O(n)$ | +| Second Order | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/3-second_order.png) | $O(n^2)$ with **BGFS** optimization | $O(n)$ with **L-BGFS** optimization | + +**L-BGFS :** Limited memory BGFS. + +1. Works very well in full batch, deterministic $f(x)$. +2. Does not transfer very well to mini-batch setting. + +#### Summary + +| Method | Performance | +| -------------- | ------------------------------------------------------------ | +| Adam | Often chosen as default method.
Work ok even with constant learning rate. | +| SGD + Momentum | Can outperform Adam.
Require more tuning of learning rate and schedule. | +| L-BGFS | If can afford to do full batch updates then try out. | + +- An article about gradient descent: [Anoverview of gradient descent optimization algorithms](https://arxiv.org/pdf/1609.04747) +- A blog: [An updated overview of recent gradient descent algorithms – John Chen – ML at Rice University](https://johnchenresearch.github.io/demon/) + +## 4 - Neural Networks and Backpropagation + +### Neural Networks + +**Motivation:** Inducted bias can appear to be high when using human-designed features. + +**Activation:** Sigmoid, tanh, ReLU, LeakyReLU... + +**Architecture:** Input layer, hidden layer, output layer. + +**Do not use the size of a neural network as the regularizer. Use regularization instead!** + +**Gradient Calculation:** Computational Graph + Backpropagation. + +### Backpropagation + +Using Jacobian matrix to calculate the gradient of each node in a computation graph. + +Suppose that we have a computation flow like this: + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/4-graph.png) + +| Input X | Input W | Output Y | +| ----------------------------------------------------- | ------------------------------------------------------------ | ----------------------------------------------------- | +| $X=\begin{bmatrix}x_1\\x_2\\\vdots\\x_n\end{bmatrix}$ | $W=\begin{bmatrix}w_{11}&w_{12}&\cdots&w_{1n}\\w_{21}&w_{22}&\cdots&w_{2n}\\\vdots&\vdots&\ddots&\vdots\\w_{m1}&w_{m2}&\cdots&w_{mn}\end{bmatrix}$ | $Y=\begin{bmatrix}y_1\\y_2\\\vdots\\y_m\end{bmatrix}$ | +| $n\times 1$ | $m\times n$ | $m\times 1$ | + +After applying feed forward, we can calculate gradients like this: + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/4-graph2.png) + +| Derivative Matrix of X | Jacobian Matrix of X | Derivative Matrix of Y | +| ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | +| $D_X=\begin{bmatrix}\frac{\partial L}{\partial x_1}\\\frac{\partial L}{\partial x_2}\\\vdots\\\frac{\partial L}{\partial x_n}\end{bmatrix}$ | $J_X=\begin{bmatrix}\frac{\partial y_1}{\partial x_1}&\frac{\partial y_1}{\partial x_2}&\cdots&\frac{\partial y_1}{\partial x_n}\\\frac{\partial y_2}{\partial x_1}&\frac{\partial y_2}{\partial x_2}&\cdots&\frac{\partial y_2}{\partial x_n}\\\vdots&\vdots&\ddots&\vdots\\\frac{\partial y_m}{\partial x_1}&\frac{\partial y_m}{\partial x_2}&\cdots&\frac{\partial y_m}{\partial x_n}\end{bmatrix}$ | $D_Y=\begin{bmatrix}\frac{\partial L}{\partial y_1}\\\frac{\partial L}{\partial y_2}\\\vdots\\\frac{\partial L}{\partial y_m}\end{bmatrix}$ | +| $n\times 1$ | $m\times n$ | $m\times 1$ | + +| Derivative Matrix of W | Jacobian Matrix of W | Derivative Matrix of Y | +| ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | +| $W=\begin{bmatrix}\frac{\partial L}{\partial w_{11}}&\frac{\partial L}{\partial w_{12}}&\cdots&\frac{\partial L}{\partial w_{1n}}\\\frac{\partial L}{\partial w_{21}}&\frac{\partial L}{\partial w_{22}}&\cdots&\frac{\partial L}{\partial w_{2n}}\\\vdots&\vdots&\ddots&\vdots\\\frac{\partial L}{\partial w_{m1}}&\frac{\partial L}{\partial w_{m2}}&\cdots&\frac{\partial L}{\partial w_{mn}}\end{bmatrix}$ | $J_W^{(k)}=\begin{bmatrix}\frac{\partial y_k}{\partial w_{11}}&\frac{\partial y_k}{\partial w_{12}}&\cdots&\frac{\partial y_k}{\partial w_{1n}}\\\frac{\partial y_k}{\partial w_{21}}&\frac{\partial y_k}{\partial w_{22}}&\cdots&\frac{\partial y_k}{\partial w_{2n}}\\\vdots&\vdots&\ddots&\vdots\\\frac{\partial y_k}{\partial w_{m1}}&\frac{\partial y_k}{\partial w_{m2}}&\cdots&\frac{\partial y_k}{\partial w_{mn}}\end{bmatrix}$
$J_W=\begin{bmatrix}J_W^{(1)}&J_W^{(2)}&\cdots&J_W^{(m)}\end{bmatrix}$ | $D_Y=\begin{bmatrix}\frac{\partial L}{\partial y_1}\\\frac{\partial L}{\partial y_2}\\\vdots\\\frac{\partial L}{\partial y_m}\end{bmatrix}$ | +| $m\times n$ | $m\times m\times n$ | $ m\times 1$ | + +For each element in $D_X$ , we have: + +$D_{Xi}=\frac{\partial L}{\partial x_i}=\sum_{j=1}^m\frac{\partial L}{\partial y_j}\frac{\partial y_j}{\partial x_i}\\$ + +## 5 - Convolutional Neural Networks + +### Convolution Layer + +#### Introduction + +**Convolve a filter with an image:** Slide the filter spatially within the image, computing dot products in each region. + +Giving a $32\times32\times3$ image and a $5\times5\times3$ filter, a convolution looks like: + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/5-convolution.png) + +Convolve six $5\times5\times3$ filters to a $32\times32\times3$ image with step size $1$, we can get a $28\times28\times6$ feature: + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/5-convolution_six_filters.png) + +With an activation function after each convolution layer, we can build the ConvNet with a sequence of convolution layers: + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/5-convolution_net.png) + +By **changing the step size** between each move for filters, or **adding zero-padding** around the image, we can modify the size of the output: + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/5-convolution_padding.png) + +#### $1\times1$ Convolution Layer + +This kind of layer makes perfect sense. It is usually used to change the dimension (channel) of features. + +A $1\times1$ convolution layer can also be treated as a full-connected linear layer. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/5-convolution_1times1.png) + +#### Summary + +| **Input** | | +| ------------------------- | ---------------------------- | +| image size | $W_1\times H_1\times C$ | +| filter size | $F\times F\times C$ | +| filter number | $K$ | +| stride | $S$ | +| zero padding | $P$ | +| **Output** | | +| output size | $W_2\times H_2\times K$ | +| output width | $W_2=\frac{W_1-F+2P}{S}+1\\$ | +| output height | $H_2=\frac{H_1-F+2P}{S}+1\\$ | +| **Parameters** | | +| parameter number (weight) | $F^2CK$ | +| parameter number (bias) | $K$ | + +### Pooling layer + +Make the representations smaller and more manageable. + +**An example of max pooling:** + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/5-pooling.png) + +| **Input** | | +| -------------- | ------------------------- | +| image size | $W_1\times H_1\times C$ | +| spatial extent | $F\times F$ | +| stride | $S$ | +| **Output** | | +| output size | $W_2\times H_2\times C$ | +| output width | $W_2=\frac{W_1-F}{S}+1\\$ | +| output height | $H_2=\frac{H_1-F}{S}+1\\$ | + +### Convolutional Neural Networks (CNN) + +CNN stack CONV, POOL, FC layers. + +**CNN Trends:** + +1. Smaller filters and deeper architectures. +2. Getting rid of POOL/FC layers (just CONV). + +**Historically architectures of CNN looked like:** + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/5-model_history.png) + +where usually $m$ is large, $0\le n\le5$, $0\le k\le2$. + +Recent advances such as **ResNet** / **GoogLeNet** have challenged this paradigm. + +## 6 - CNN Architectures + +Best model in ImageNet competition: + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-image_net.png) + +### AlexNet + +8 layers. + +First use of ConvNet in image classification problem. + +Filter size decreases in deeper layer. + +Channel number increases in deeper layer. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-alexnet.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-alexnet_p.png) + +### VGG + +19 layers. (also provide 16 layers edition) + +Static filter size ($3\times3$) in all layers: + +1. The effective receptive field expands with the layer gets deeper. +2. Deeper architecture gets more non-linearities and few parameters. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-vgg_field.png) + +Most memory is in early convolution layers. + +Most parameter is in late FC layers. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-vgg.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-vgg_p.png) + +### GoogLeNet + +22 layers. + +No FC layers, only 5M parameters. ( $8.3\%$ of AlexNet, $3.7\%$ of VGG ) + +Devise efficient "inception module". + +#### Inception Module + +Design a good local network topology (network within a network) and then stack these modules on top of each other. + +**Naive Inception Module:** + +1. Apply parallel filter operations on the input from previous layer. +2. Concatenate all filter outputs together channel-wise. +3. **Problem:** The depth (channel number) increases too fast, costing expensive computation. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-googlenet_inception.png) + +**Inception Module with Dimension Reduction:** + +1. Add "bottle neck" layers to reduce the dimension. +2. Also get fewer computation cost. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-googlenet_inception_revised.png) + +#### Architecture + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-googlenet_p.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-googlenet_p2.png) + +### ResNet + +152 layers for ImageNet. + +Devise "residual connections". + +Use BN in place of dropout. + +#### Residual Connections + +**Hypothesis:** Deeper models have more representation power than shallow ones. But they are harder to optimize. + +**Solution:** Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-resnet_residual.png) + +It is necessary to use ReLU as activation function, in order to apply identity mapping when $F(x)=0$ . + +#### Architecture + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-resnet_train.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-resnet_p.png) + +### SENet + +Using ResNeXt-152 as a base architecture. + +Add a “feature recalibration” module. **(adjust weights of each channel)** + +Using the **global avg-pooling layer** + **FC layers** to determine feature map weights. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-senet_p.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/6-senet_p2.png) + +### Improvements of ResNet + +Wide Residual Networks, ResNeXt, DenseNet, MobileNets... + +### Other Interesting Networks + +**NASNet:** Neural Architecture Search with Reinforcement Learning. + +**EfficientNet:** Smart Compound Scaling. + +## 7 - Training Neural Networks + +### Activation Functions + +| Activation | Usage | +| ----------------------------- | ------------------------------------------------ | +| Sigmoid, tanh | Do not use. | +| ReLU | Use as default. | +| Leaky ReLU, Maxout, ELU, SELU | Replace ReLU to squeeze out some marginal gains. | +| Swish | No clear usage. | + +### Data Processing + +Apply centralization and normalization before training. + +In practice for pictures, usually we apply channel-wise centralization only. + +### Weight Initialization + +Assume that we have 6 layers in a network. + +$D_i$ : input size of layer $i$ + +$W_i$ : weights in layer $i$ + +$X_i$ : output after activation of layer $i$, we have $X_i=g(Z_i)=g(W_iX_{i-1}+B_i)$ + +**We initialize each parameter in $W_i$ randomly in $[-k_i,k_i]$ .** + +| Tanh Activation | Output Distribution | +| :----------------------------------------------------: | :------------------------------------------------------: | +| $k_i=0.01$ | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/7-sigmoid_0.01.png) | +| $k_i=0.05$ | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/7-sigmoid_0.05.png) | +| **Xavier Initialization** $k_i=\frac{1}{\sqrt{D_i}\\}$ | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/7-sigmoid_xavier.png) | + +When $k_i=0.01$, the variance keeps decreasing as the layer gets deeper. As a result, the output of each neuron in deep layer will all be 0. The partial derivative $\frac{\partial Z_i}{\partial W_i}=X_{i-1}=0\\$. (no gradient) + +When $k_i=0.05$, most neurons is saturated. The partial derivative $\frac{\partial X_i}{\partial Z_i}=g'(Z_i)=0\\$. (no gradient) + +**To solve this problem, We need to keep the variance same in each layer.** + +Assuming that $Var\big(X_{i-1}^{(1)}\big)=Var\big(X_{i-1}^{(2)}\big)=\dots=Var\big(X_{i-1}^{(D_i)}\big)$ + +We have $Z_i=X_{i-1}^{(1)}W_i^{(:,1)}+X_{i-1}^{(2)}W_i^{(:,2)}+\dots+X_{i-1}^{(D_i)}W_i^{(:,D_i)}=\sum_{n=1}^{D_i}X_{i-1}^{(n)}W_i^{(:,n)}\\$ + +We want $Var\big(Z_i\big)=Var\big(X_{i-1}^{(n)}\big)$ + +**Let's do some conduction:** + +$\begin{aligned}Var\big(Z_i\big)&=Var\Bigg(\sum_{n=1}^{D_i}X_{i-1}^{(n)}W_i^{(:,n)}\Bigg)\\&=D_i\ Var\Big(X_{i-1}^{(n)}W_i^{(:,n)}\Big)\\&=D_i\ Var\Big(X_{i-1}^{(n)}\Big)\ Var\Big(W_i^{(:,n)}\Big)\end{aligned}$ + +So $Var\big(Z_i\big)=Var\big(X_{i-1}^{(n)}\big)$ only when $Var\Big(W_i^{(:,n)}\Big)=\frac{1}{D_i}\\$, that is to say $k_i=\frac{1}{\sqrt{D_i}}\\$ + +| ReLU Activation | Output Distribution | +| :----------------------------------------------------: | :----------------------------------------------------: | +| **Xavier Initialization** $k_i=\frac{1}{\sqrt{D_i}\\}$ | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/7-relu_xavier.png) | +| **Kaiming Initialization** $k_i=\sqrt{2D_i}$ | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/7-relu_kaiming.png) | + +For ReLU activation, when using xavier initialization, there still exist "variance decreasing" problem. + +We can use kaiming initialization instead to fix this. + +### Batch Normalization + +Force the inputs to be "nicely scaled" at each layer. + +$N$ : batch size + +$D$ : feature size + +$x$ : input with shape $N\times D$ + +$\gamma$ : learnable scale and shift parameter with shape $D$ + +$\beta$ : learnable scale and shift parameter with shape $D$ + +**The procedure of batch normalization:** + +1. Calculate channel-wise mean $\mu_j=\frac{1}{N}\sum_{i=1}^Nx_{i,j}\\$ . The result $\mu$ with shape $D$ . +2. Calculate channel-wise variance $\sigma_j^2=\frac{1}{N}\sum_{i=1}^N(x_{i,j}-\mu_j)^2\\$ . The result $\sigma^2$ with shape $D$ . +3. Calculate normalized $\hat{x}_{i,j}=\frac{x_{i,j}-\mu_j}{\sqrt{\sigma_j^2+\epsilon}}\\$ . The result $\hat{x}$ with shape $N\times D$ . +4. Scale normalized input to get output $y_{i,j}=\gamma_j\hat{x}_{i,j}+\beta_j$ . The result $y$ with shape $N\times D$ . + + **Why scale:** The constraint "zero-mean, unit variance" may be too hard. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/7-batch_norm.png) + +**Pros:** + +1. Makes deep networks much easier to train! +2. Improves gradient flow. +3. Allows higher learning rates, faster convergence. +4. Networks become more robust to initialization. +5. Acts as regularization during training. +6. Zero overhead at test-time: can be fused with conv! + +**Cons:** + + Behaves differently during training and testing: this is a very common source of bugs! + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/7-all_norm.png) + +### Transfer Learning + +Train on a pre-trained model with other datasets. + +**An empirical suggestion:** + +| | **very similar dataset** | **very different dataset** | +| ----------------------- | ----------------------------------- | ------------------------------------------------------------ | +| **very little data** | Use Linear Classifier on top layer. | You’re in trouble… Try linear classifier from different stages. | +| **quite a lot of data** | Finetune a few layers. | Finetune a larger number of layers. | + +### Regularization + +#### Common Pattern of Regularization + +Training: Add some kind of randomness. $y=f(x,z)$ + +Testing: Average out randomness (sometimes approximate). $y=f(x)=E_z\big[f(x,z)\big]=\int p(z)f(x,z)dz\\$ + +#### Regularization Term + +L2 regularization: $R(W)=\sum_k\sum_lW_{k,l}^2$ (weight decay) + +L1 regularization: $R(W)=\sum_k\sum_l|W_{k,l}|$ + +Elastic net : $R(W)=\sum_k\sum_l\big(\beta W_{k,l}^2+|W_{k,l}|\big)$ (L1+L2) + +#### Dropout + +Training: Randomly set some neurons to 0 with a probability $p$ . + +Testing: Each neuron multiplies by dropout probability $p$ . (scale the output back) + +**More common:** Scale the output with $\frac{1}{p}$ when training, keep the original output when testing. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/7-dropout_p.png) + +**Why dropout works:** + +1. Forces the network to have a redundant representation. Prevents co-adaptation of features. +2. **Another interpretation:** Dropout is training a large ensemble of models (that share parameters). + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/7-dropout.png) + +#### Batch Normalization + +See above. + +#### Data Augmentation + +1. Horizontal Flips +2. Random Crops and Scales +3. Color Jitter +4. Rotation +5. Stretching +6. Shearing +7. Lens Distortions +8. ... + +There also exists automatic data augmentation method using neural networks. + +#### Other Methods and Summary + +**DropConnect**: Drop connections between neurons. + +**Fractional Max Pooling:** Use randomized pooling regions. + +**Stochastic Depth**: Skip some layers in the network. + +**Cutout:** Set random image regions to zero. + +**Mixup:** Train on random blends of images. + +| Regularization Method | Usage | +| --------------------------------------- | ---------------------------------- | +| Dropout | For large fully-connected layers. | +| Batch Normalization & Data Augmentation | Almost always a good idea. | +| Cutout & Mixup | For small classification datasets. | + +### Hyperparameter Tuning + +| Most Common Hyperparameters | Less Sensitive Hyperparameters | +| ------------------------------------------------------------ | ------------------------------ | +| learning rate
learning rate decay schedule
weight decay | setting of momentum
... | + +**Tips on hyperparameter tuning:** + +1. Prefer one validation fold to cross-validation. +2. Search for hyperparameters on log scale. (e.g. multiply the hyperparameter by a fixed number $k$ at each search) +3. Prefer **random search** to grid search. +4. Careful with best values on border. +5. Stage your search from coarse to fine. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/7-random_search.png) + +#### Implementation + +Have a **worker** that continuously samples random hyperparameters and performs the optimization. During the training, the worker will keep track of the validation performance after every epoch, and writes a model checkpoint to a file. + +Have a **master** that launches or kills workers across a computing cluster, and may additionally inspect the checkpoints written by workers and plot their training statistics. + +#### Common Procedures + +1. **Check initial loss.** + + Turn off weight decay, sanity check loss at initialization $\log(C)$ for softmax with $C$ classes. + +2. **Overfit a small sample. (important)** + + Try to train to 100% training accuracy on a small sample of training data. + + Fiddle with architecture, learning rate, weight initialization. + +3. **Find learning rate that makes loss go down.** + + Use the architecture from the previous step, use all training data, turn on small weight decay, find a learning rate that makes the loss drop significantly within 100 iterations. + + Good learning rates to try: $0.1,0.01,0.001,0.0001,\dots$ + +4. **Coarse grid, train for 1-5 epochs.** + + Choose a few values of learning rate and weight decay around what worked from Step 3, train a few models for 1-5 epochs.\ + + Good weight decay to try: $0.0001,0.00001,0$ + +5. **Refine grid, train longer.** + + Pick best models from Step 4, train them for longer (10-20 epochs) without learning rate decay. + +6. **Look at loss and accuracy curves.** +7. **GOTO step 5.** + +### Gradient Checks + +[CS231n Convolutional Neural Networks for Visual Recognition](https://cs231n.github.io/neural-networks-3/#gradcheck) + +Compute analytical gradient manually using $f_a'=\frac{\partial f(x)}{\partial x}=\frac{f(x-h)-f(x+h)}{2h}\\$ + +Get relative error between numerical gradient $f_n'$ and analytical gradient $f_a'$ using $E=\frac{|f_n'-f_a'|}{\max{|f_n'|,|f_a'|}}\\$ + +| Relative Error | Result | +| ------------------- | ------------------------------------------------------------ | +| $E>10^{-2}$ | Probably $f_n'$ is wrong. | +| $10^{-2}>E>10^{-4}$ | Not good, should check the gradient. | +| $10^{-4}>E>10^{-6}$ | Okay for objectives with kinks. (e.g. ReLU)
Not good for objectives with no kink. (e.g. softmax, tanh) | +| $10^{-7}>E$ | Good. | + +**Tips on gradient checks:** + +1. Use double precision. +2. Use only few data points. +3. Careful about kinks in the objective. (e.g. $x=0$ for ReLU activation) +4. Careful with the step size $h$. +5. Use gradient check after the loss starts to go down. +6. Remember to turn off anything that may affect the gradient. (e.g. **regularization / dropout / augmentations**) +7. Check only few dimensions for **every parameter**. (reduce time cost) + +## 8 - Visualizing and Understanding + +### Feature Visualization and Inversion + +#### Visualizing what models have learned + +| Visualize Areas | | +| -------------------- | ------------------------------------------------------------ | +| Filters | Visualize the raw weights of each convolution kernel. (better in the first layer) | +| Final Layer Features | Run dimensionality reduction for features in the last FC layer. (PCA, t-SNE...) | +| Activations | Visualize activated areas. ([Understanding Neural Networks Through Deep Visualization](https://arxiv.org/abs/1506.06579)) | + +#### Understanding input pixels + +##### Maximally Activating Patches + +1. Pick a layer and a channel. +2. Run many images through the network, record values of the chosen channel. +3. Visualize image patches that correspond to maximal activation features. + +For example, we have a layer with shape $128\times13\times13$. We pick the 17th channel from all 128 channels. Then we run many pictures through the network. During each run we can find a maximal activation feature among all the $13\times13$ features in channel 17. We then record the corresponding picture patch for each maximal activation feature. At last, we visualize all picture patches for each feature. + +This will help us find the relationship between each maximal activation feature and its corresponding picture patches. + +(each row of the following picture represents a feature) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-activating_patches.png) + +##### Saliency via Occlusion + +Mask part of the image before feeding to CNN, check how much predicted probabilities change. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-saliency_via_occlusion.png) + +##### Saliency via Backprop + +1. Compute gradient of (unnormalized) class score with respect to image pixels. +2. Take absolute value and max over RGB channels to get **saliency maps**. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-saliency_via_backprop.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-saliency_via_backprop_p.png) + +##### Intermediate Features via Guided Backprop + +1. Pick a single intermediate neuron. (e.g. one feature in a $128\times13\times13$ feature map) +2. Compute gradient of neuron value with respect to image pixels. + +[Striving for Simplicity: The All Convolutional Net](https://arxiv.org/abs/1412.6806) + +Just like "Maximally Activating Patches", this could find the part of an image that a neuron responds to. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-guided_backprop.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-guided_backprop_p.png) + +##### Gradient Ascent + +Generate a synthetic image that maximally activates a neuron. + +1. Initialize image $I$ to zeros. +2. Forward image to compute current scores $S_c(I)$ (for class $c$ before softmax). +3. Backprop to get gradient of neuron value with respect to image pixels. +4. Make a small update to the image. + +Objective: $\max S_c(I)-\lambda\lVert I\lVert^2$ + +[Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps](https://arxiv.org/abs/1312.6034) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-gradient_ascent.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-gradient_ascent_p.png) + +### Adversarial Examples + +Find an fooling image that can make the network misclassify correctly-classified images when it is added to the image. + +1. Start from an arbitrary image. +2. Pick an arbitrary class. +3. Modify the image to maximize the class. +4. Repeat until network is fooled. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-adversarial_examples.png) + +### DeepDream and Style Transfer + +#### Feature Inversion + +Given a CNN feature vector $\Phi_0$ for an image, find a new image $x$ that: + +1. Features of new image $\Phi(x)$ matches the given feature vector $\Phi_0$. +2. "looks natural”. (image prior regularization) + +Objective: $\min \lVert\Phi(x)-\Phi_0\lVert+\lambda R(x)$ + +[Understanding Deep Image Representations by Inverting Them](https://arxiv.org/abs/1412.0035) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-feature_inversion.png) + +#### DeepDream: Amplify Existing Features + +Given an image, amplify the neuron activations at a layer to generate a new one. + +1. Forward: compute activations at chosen layer. +2. Set gradient of chosen layer equal to its activation. +3. Backward: Compute gradient on image. +4. Update image. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-deepdream.png) + +#### Texture Synthesis + +##### Nearest Neighbor + +1. Generate pixels one at a time in scanline order +2. Form neighborhood of already generated pixels, copy the nearest neighbor from input. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-texture_synthesis_nn.png) + +##### Neural Texture Synthesis + +Gram Matrix: [格拉姆矩阵(Gram matrix)详细解读](https://zhuanlan.zhihu.com/p/187345192) + +1. Pretrain a CNN on ImageNet. +2. Run input texture forward through CNN, record activations on every layer. + + Layer $i$ gives feature map of shape $C_i\times H_i\times W_i$. + +3. At each layer compute the **Gram matrix** $G_i$ giving outer product of features. + + - Reshape feature map at layer $i$ to $C_i\times H_iW_i$. + - Compute the **Gram matrix** $G_i$ with shape $C_i\times C_i$. + +4. Initialize generated image from random noise. +5. Pass generated image through CNN, compute **Gram matrix** $\hat{G}_l$ on each layer. +6. Compute loss: Weighted sum of L2 distance between **Gram matrices**. + + - $E_l=\frac{1}{aN_l^2M_l^2}\sum_{i,j}\Big(G_i^{(i,j)}-\hat{G}_i^{(i,j)}\Big)^2\\$ + - $\mathcal{L}(\vec{x},\hat{\vec{x}})=\sum_{l=0}^L\omega_lE_l\\$ + +7. Backprop to get gradient on image. +8. Make gradient step on image. +9. GOTO 5. + +[Texture Synthesis Using Convolutional Neural Networks](https://arxiv.org/abs/1505.07376) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-texture_synthesis_neural.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-texture_synthesis_neural_p.png) + +#### Style Transfer + +##### Feature + Gram Reconstruction + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-style_transfer.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-style_transfer_p.png) + +**Problem:** Style transfer requires many forward / backward passes. Very slow! + +##### Fast Style Transfer + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-style_transfer_fast.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/8-style_transfer_fast_p.png) + +## 9 - Object Detection and Image Segmentation + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-tasks.png) + +### Semantic Segmentation + +**Paired Training Data:** For each training image, each pixel is labeled with a semantic category. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-sematic_segmetation.png) + +**Fully Convolutional Network:** Design a network with only convolutional layers without downsampling operators to make predictions for pixels all at once! + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-sematic_segmetation_full_conv.png) + +**Problem:** Convolutions at original image resolution will be very expensive... + +**Solution:** Design fully convolutional network with **downsampling** and **upsampling** inside it! + +- **Downsampling:** Pooling, strided convolution. +- **Upsampling:** Unpooling, transposed convolution. + +**Unpooling:** + +| Nearest Neighbor | "Bed of Nails" | "Position Memory" | +| :----------------------------------------------------: | :----------------------------------------------------------: | :--------------------------------------------------------: | +| ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-unpooling_nn.png) | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-unpooling_bed_of_nails.png) | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-unpooling_memory.png) | + +**Transposed Convolution:** (example size $3\times3$, stride $2$, pad $1$) + +| Normal Convolution | Transposed Convolution | +| :----------------------------------------------------------: | :----------------------------------------------------------: | +| ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-transposed_convolution_normal.png) | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-transposed_convolution.png) | +| ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-transposed_convolution_normal_m.png) | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-transposed_convolution_m.png) | + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-sematic_segmetation_full_conv_down.png) + +### Object Detection + +#### Single Object + +Classification + Localization. (classification + regression problem) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-object_detection_single.png) + +#### Multiple Object + +##### R-CNN + +Using selective search to find “blobby” image regions that are likely to contain objects. + +1. Find regions of interest (RoI) using selective search. (region proposal) +2. Forward each region through ConvNet. +3. Classify features with SVMs. + +**Problem:** Very slow. Need to do 2000 independent forward passes for each image! + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-rcnn.png) + +##### Fast R-CNN + +Pass the image through ConvNet before cropping. Crop the conv feature instead. + +1. Run whole image through ConvNet. +2. Find regions of interest (RoI) from conv features using selective search. (**region proposal**) +3. Classify RoIs using CNN. + +**Problem:** Runtime is dominated by region proposals. (about $90\%$ time cost) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-fast_rcnn.png) + +##### Faster R-CNN + +Insert Region Proposal Network (**RPN**) to predict proposals from features. + +Otherwise same as Fast R-CNN: Crop features for each proposal, classify each one. + +**Region Proposal Network (RPN) :** Slide many fixed windows over ConvNet features. + +1. Treat each point in the feature map as the **anchor**. + + We have $k$ fixed windows (**anchor boxes**) of different size/scale centered with each anchor. + +2. For each anchor box, predict whether it contains an object. + + For positive boxes, also predict a corrections to the ground-truth box. + +3. Slide anchor over the feature map, get the **“objectness” score** for each box at each point. +4. Sort the “objectness” score, take top $300$ as the proposals. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-faster_rcnn_rpn.png) + +**Faster R-CNN is a Two-stage object detector:** + +1. First stage: Run once per image + + Backbone network + + Region proposal network + +2. Second stage: Run once per region + + Crop features: RoI pool / align + + Predict object class + + Prediction bbox offset + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-faster_rcnn.png) + +##### Single-Stage Object Detectors: YOLO + +[You Only Look Once: Unified, Real-Time Object Detection](https://arxiv.org/abs/1506.02640) + +1. Divide image into grids. (example image grids shape $7\times7$) +2. Set anchors in the middle of each grid. +3. For each grid: + - Using $B$ anchor boxes to regress $5$ numbers: $\text{dx, dy, dh, dw, confidence}$. + - Predict scores for each of $C$ classes. +4. Finally the output is $7\times7\times(5B+C)$. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-yolo.png) + +### Instance Segmentation + +**Mask R-CNN:** Add a small mask network that operates on each RoI and predicts a $28\times28$ binary mask. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-mask_rcnn.png) + +Mask R-CNN performs very good results! + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/9-mask_rcnn_p.png) + +## 10 - Recurrent Neural Networks + +Supplement content added according to [Deep Learning Book - RNN](https://www.deeplearningbook.org/contents/rnn.html). + +### Recurrent Neural Network (RNN) + +#### Motivation: Sequence Processing + +| One to One | One to Many | Many to One | Many to Many | Many to Many | +| :--------------------------------------------------------: | :--------------------------------------------------------: | :--------------------------------------------------------: | :--------------------------------------------------------: | :----------------------------------------------------------: | +| ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/10-rnn_seqnence_11.png) | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/10-rnn_seqnence_1m.png) | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/10-rnn_seqnence_m1.png) | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/10-rnn_seqnence_mm.png) | ![](https://raw.githubusercontent.com/WncFht/picture/main/picture/10-rnn_seqnence_mm_2.png) | +| Vanilla Neural Networks | Image Captioning | Action Prediction | Video Captioning | Video Classification on Frame Level | + +#### Vanilla RNN + +$x^{(t)}$ : Input at time $t$. + +$h^{(t)}$ : State at time $t$. + +$o^{(t)}$ : Output at time $t$​​. + +$y^{(t)}$ : Expected output at time $t$. + +##### Many to One + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/10-rnn_structure_vanilla_m1.png) + +| Calculation | | +| ------------------ | ---------------------------------------------------- | +| State Transition | $h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+b)$ | +| Output Calculation | $o^{(\tau)}=\text{sigmoid}\ \big(Vh^{(\tau)}+c\big)$ | + +##### Many to Many (type 2) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/10-rnn_structure_vanilla_mm.png) + +| Calculation | | +| ------------------ | ---------------------------------------------- | +| State Transition | $h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+b)$ | +| Output Calculation | $o^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+c\big)$ | + +#### RNN with Teacher Forcing + +Update current state according to last-time **output** instead of last-time **state**. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/10-rnn_structure_tf.png) + +| Calculation | | +| ------------------ | ---------------------------------------------- | +| State Transition | $h^{(t)}=\tanh(Wo^{(t-1)}+Ux^{(t)}+b)$ | +| Output Calculation | $o^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+c\big)$ | + +#### RNN with "Output Forwarding" + +We can also combine last-state **output** with this-state **input** together. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/10-rnn_structure_output.png) + +| Calculation | | +| --------------------------- | ------------------------------------------------- | +| State Transition (training) | $h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+Ry^{(t-1)}+b)$ | +| State Transition (testing) | $h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+Ro^{(t-1)}+b)$ | +| Output Calculation | $o^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+c\big)$ | + +Usually we use $o^{(t-1)}$ in place of $y^{(t-1)}$ at testing time. + +#### Bidirectional RNN + +When dealing with **a whole input sequence**, we can process features from two directions. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/10-rnn_structure_bidirectional.png) + +| Calculation | | +| --------------------------- | ------------------------------------------------------- | +| State Transition (forward) | $h^{(t)}=\tanh(W_1h^{(t-1)}+U_1x^{(t)}+b_1)$ | +| State Transition (backward) | $g^{(t)}=\tanh(W_2g^{(t+1)}+U_2x^{(t)}+b_2)$ | +| Output Calculation | $o^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+Wg^{(t)}+c\big)$ | + +#### Encoder-Decoder Sequence to Sequence RNN + +This is a **many-to-many structure (type 1)**. + +First we encode information according to $x$ with no output. + +Later we decode information according to $y$ with no input. + +$C$ : Context vector, often $C=h^{(T)}$ (last state of encoder). + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/10-rnn_structure_encoder.png) + +| Calculation | | +| ----------------------------------- | ----------------------------------------------- | +| State Transition (encode) | $h^{(t)}=\tanh(W_1h^{(t-1)}+U_1x^{(t)}+b_1)$ | +| State Transition (decode, training) | $s^{(t)}=\tanh(W_2s^{(t-1)}+U_2y^{(t)}+TC+b_2)$ | +| State Transition (decode, testing) | $s^{(t)}=\tanh(W_2s^{(t-1)}+U_2o^{(t)}+TC+b_2)$ | +| Output Calculation | $o^{(t)}=\text{sigmoid}\ \big(Vs^{(t)}+c\big)$ | + +#### Example: Image Captioning + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/10-rnn_example.png) + +#### Summary + +**Advantages of RNN:** + +1. Can process any length input. +2. Computation for step $t$ can (in theory) use information from many steps back. +3. Model size doesn’t increase for longer input. +4. Same weights applied on every timestep, so there is symmetry in how inputs are processed. + +**Disadvantages of RNN:** + +1. Recurrent computation is slow. +2. In practice, difficult to access information from many steps back. +3. Problems with gradient exploding and gradient vanishing. **(check [Deep Learning Book - RNN](https://www.deeplearningbook.org/contents/rnn.html) Page 396, Chap 10.7)** + +### Long Short Term Memory (LSTM) + +Add a "cell block" to store history weights. + +$c^{(t)}$ : Cell at time $t$. + +$f^{(t)}$ : **Forget gate** at time $t$. Deciding whether to erase the cell. + +$i^{(t)}$ : **Input gate** at time $t$. Deciding whether to write to the cell. + +$g^{(t)}$ : **External input gate** at time $t$. Deciding how much to write to the cell. + +$o^{(t)}$ : **Output gate** at time $t$. Deciding how much to reveal the cell. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/10-lstm.png) + +| Calculation (Gate) | | +| ------------------- | ------------------------------------------------------------ | +| Forget Gate | $f^{(t)}=\text{sigmoid}\ \big(W_fh^{(t-1)}+U_fx^{(t)}+b_f\big)$ | +| Input Gate | $i^{(t)}=\text{sigmoid}\ \big(W_ih^{(t-1)}+U_ix^{(t)}+b_i\big)$ | +| External Input Gate | $g^{(t)}=\tanh(W_gh^{(t-1)}+U_gx^{(t)}+b_g)$ | +| Output Gate | $o^{(t)}=\text{sigmoid}\ \big(W_oh^{(t-1)}+U_ox^{(t)}+b_o\big)$ | + +| Calculation (Main) | | +| ------------------ | ----------------------------------------------------- | +| Cell Transition | $c^{(t)}=f^{(t)}\odot c^{(t-1)}+i^{(t)}\odot g^{(t)}$ | +| State Transition | $h^{(t)}=o^{(t)}\odot\tanh(c^{(t)})$ | +| Output Calculation | $O^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+c\big)$ | + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/10-lstm_gradient.png) + +### Other RNN Variants + +GRU... + +## 11 - Attention and Transformers + +### RNN with Attention + +**Encoder-Decoder Sequence to Sequence RNN Problem:** + +Input sequence bottlenecked through a fixed-sized context vector $C$. (e.g. $T=1000$) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-rnn_sequence.png) + +**Intuitive Solution:** + +Generate new context vector $C_t$ at each step $t$ ! + +$e_{t,i}$ : Alignment score for input $i$ at state $t$. **(scalar)** + +$a_{t,i}$ : Attention weight for input $i$ at state $t$. + +$C_t$ : Context vector at state $t$. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-rnn_attention_1.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-rnn_attention_2.png) + +| Calculation | | +| ------------------------ | ------------------------------------------------------------ | +| Alignment Score | $e_i^{(t)}=f(s^{(t-1)},h^{(i)})$.
Where $f$ is an MLP. | +| Attention Weight | $a_i^{(t)}=\text{softmax}\ (e_i^{(t)})$.
Softmax includes all $e_i$ at state $t$. | +| Context Vector | $C^{(t)}=\sum_i a_i^{(t)}h^{(i)}$ | +| Decoder State Transition | $s^{(t)}=\tanh(Ws^{(t-1)}+Uy^{(t)}+TC^{(t)}+b)$ | + +**Example on Image Captioning:** + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-rnn_attention_example.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-rnn_attention_example_2.png) + +### General Attention Layer + +Add linear transformations to the input vector before attention. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-general_attention.png) + +**Notice:** + +1. Number of queries $q$ is variant. (can be **different** from the number of keys $k$) +2. Number of outputs $y$ is equal to the number of queries $q$. + + Each $y$ is a linear weighting of values $v$. + +3. Alignment $e$ is divided by $\sqrt{D}$ to avoid "explosion of softmax", where $D$ is the dimension of input feature. + +### Self-attention Layer + +The query vectors $q$ are also generated from the inputs. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-self_attention.png) + +In this way, the shape of $y$ is equal to the shape of $x$. + +**Example with CNN:** + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-self_attention_example.png) + +### Positional Encoding + +Self-attention layer doesn’t care about the orders of the inputs! + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-self_attention_problem.png) + +To encode ordered sequences like language or spatially ordered image features, we can add positional encoding to the inputs. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-self_attention_positional_encoding.png) + +We use a function $P:R\rightarrow R^d$ to process the **position** $i$ into a **d-dimensional vector** $p_i=P(i)$. + +| Constraint Condition of $P$ | | +| --------------------------- | ---------------------------------------------------------- | +| Uniqueness | $P(i)\ne P(j)$ | +| Equidistance | $\lVert P(i+k)-P(i)\rVert^2=\lVert P(j+k)-P(j)\rVert^2$ | +| Boundness | $P(i)\in[a,b]$ | +| Determinacy | $P(i)$ is always a static value. (function is not dynamic) | + +We can either train a encoder model, or design a fixed function. + +**A Practical Positional Encoding Method:** Using $\sin$ and $\cos$ with different frequency $\omega$ at different dimension. + +$P(t)=\begin{bmatrix}\sin(\omega_1,t)\\\cos(\omega_1,t)\\\\\sin(\omega_2,t)\\\cos(\omega_2,t)\\\vdots\\\sin(\omega_{\frac{d}{2}},t)\\\cos(\omega_{\frac{d}{2}},t)\end{bmatrix}$, where frequency $\omega_k=\frac{1}{10000^{\frac{2k}{d}}}\\$. (wave length $\lambda=\frac{1}{\omega}=10000^{\frac{2k}{d}}\\$) + +$P(t)=\begin{bmatrix}\sin(1/10000^{\frac{2}{d}},t)\\\cos(1/10000^{\frac{2}{d}},t)\\\\\sin(1/10000^{\frac{4}{d}},t)\\\cos(1/10000^{\frac{4}{d}},t)\\\vdots\\\sin(1/10000^1,t)\\\cos(1/10000^1,t)\end{bmatrix}$, after we substitute $\omega_k$ into the equation. + +$P(t)$ is a vector with size $d$, where $d$ is a hyperparameter to choose according to the length of input sequence. + +An intuition of this method is the binary encoding of numbers. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-self_attention_positional_encoding_intuition.png) + +[[lecture 11d] 注意力和transformer (positional encoding 补充,代码实现,距离计算)](https://www.bilibili.com/video/BV1E3411B7Bz) + +**It is easy to prove that $P(t)$ satisfies "Equidistance":** (set $d=2$ for example) + +$\begin{aligned}\lVert P(i+k)-P(i)\rVert^2&=\big[\sin(\omega_1,i+k)-\sin(\omega_1,i)\big]^2+\big[\cos(\omega_1,i+k)-\cos(\omega_1,i)\big]^2\\&=2-2\sin(\omega_1,i+k)\sin(\omega_1,i)-2\cos(\omega_1,i+k)\cos(\omega_1,i)\\&=2-2\cos(\omega_1,k)\end{aligned}$ + +So the distance is not associated with $i$, we have $\lVert P(i+k)-P(i)\rVert^2=\lVert P(j+k)-P(j)\rVert^2$. + +**Visualization of $P(t)$ features:** (set $d=32$, $x$ axis represents the position of sequence) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-self_attention_positional_encoding_p.png) + +### Masked Self-attention Layer + +To prevent vectors from looking at future vectors, we manually set alignment scores to $-\infty$. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-masked_self_attention.png) + +### Multi-head Self-attention Layer + +Multiple self-attention heads in parallel. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-multihead_self_attention.png) + +### Transformer + +[Attention Is All You Need](https://arxiv.org/abs/1706.03762) + +#### Encoder Block + +**Inputs:** Set of vectors $z$. (in which $z_i$ can be a **word** in a sentence, or a **pixel** in a picture...) + +**Output:** Set of context vectors $c$. (encoded **features** of $z$) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-transformer_encoder.png) + +The number of blocks $N=6$ in original paper. + +**Notice:** + +1. Self-attention is the only interaction **between vectors** $x_0,x_1,\dots,x_n$. +2. Layer norm and MLP operate independently **per vector**. +3. Highly scalable, highly parallelizable, but high memory usage. + +#### Decoder Block + +**Inputs:** Set of vectors $y$. ($y_i$ can be a **word** in a sentence, or a **pixel** in a picture...) + +**Inputs:** Set of context vectors $c$. + +**Output:** Set of vectors $y'$. (decoded result, $y'_i=y_{i+1}$ for the first $n-1$ number of $y'$) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-transformer_decoder.png) + +The number of blocks $N=6$ in original paper. + +**Notice:** + +1. Masked self-attention only interacts with **past inputs**. +2. Multi-head attention block is **NOT** self-attention. It attends over encoder outputs. +3. Highly scalable, highly parallelizable, but high memory usage. (same as encoder) + +**Why we need mask in decoder:** + +1. Needs for the special formation of output $y'_i=y_{i+1}$. +2. Needs for parallel computation. + +[举个例子讲下transformer的输入输出细节及其他](https://zhuanlan.zhihu.com/p/166608727) + +[在测试或者预测时,Transformer里decoder为什么还需要seq mask?](https://blog.csdn.net/season77us/article/details/104144613) + +#### Example on Image Captioning (Only with Transformers) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-transformer_example.png) + +### Comparing RNNs to Transformer + +| | RNNs | Transformer | +| -------- | ------------------------------------------------------------ | ------------------------------------------------------------ | +| **Pros** | LSTMs work reasonably well for **long sequences**. | 1. Good at **long sequences**. Each attention calculation looks at all inputs.
2. Can operate over unordered sets or **ordered sequences** with positional encodings.
3. **Parallel computation:** All alignment and attention scores for all inputs can be done in parallel. | +| **Cons** | 1. Expects an **ordered sequences** of inputs.
2. **Sequential computation:** Subsequent hidden states can only be computed after the previous ones are done. | **Requires a lot of memory:** $N\times M$ alignment and attention scalers need to be calculated and stored for a single self-attention head. | + +### Comparing ConvNets to Transformer + +ConvNets strike back! + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/11-transformer_compare.png) + +## 12 - Video Understanding + +### Video Classification + +Take video classification task for example. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-video_classification.png) + +Input size: $C\times T\times H\times W$. + +The problem is, videos are quite big. We can't afford to train on raw videos, instead we train on video clips. + +| Raw Videos | Video Clips | +| ------------------------------- | -------------------------------------- | +| $1920\times1080,\ 30\text{fps}$ | $112\times112,\ 5\text{f}/3.2\text{s}$ | +| $10\text{GB}/\text{min}$ | $588\text{KB}/\text{min}$ | + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-video_classification_clips.png) + +### Plain CNN Structure + +#### Single Frame 2D-CNN + +Train a normal 2D-CNN model. + +Classify each frame independently. + +Average the result of each frame as the final result. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-single_frame_cnn.png) + +#### Late Fusion + +Get high-level appearance of each frame, and combine them. + +Run 2D-CNN on each frame, pool features and feed to Linear Layers. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-late_fusion.png) + +**Problem:** Hard to compare low-level motion between frames. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-late_fusion_problem.png) + +#### Early Fusion + +Compare frames with very first Conv Layer, after that normal 2D-CNN. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-early_fusion.png) + +**Problem:** One layer of temporal processing may not be enough! + +#### 3D-CNN + +**Convolve on 3 dimensions:** Height, Width, Time. + +**Input size:** $C_{in}\times T\times H\times W$. + +**Kernel size:** $C_{in}\times C_{out}\times 3\times 3\times 3$. + +**Output size:** $C_{out}\times T\times H\times W$. (with zero paddling) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-3d_cnn.png) + +#### C3D (VGG of 3D-CNNs) + +The cost is quite expensive... + +| Network | Calculation | +| ------- | -------------- | +| AlexNet | 0.7 GFLOP | +| VGG-16 | 13.6 GFLOP | +| C3D | **39.5** GFLOP | + +#### Two-Stream Networks + +Separate motion and appearance. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-two_stream_flow.png) + +#### I3D (Inflating 2D Networks to 3D) + +Take a 2D-CNN architecture. + +Replace each 2D conv/pool layer with a 3D version. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-i3d.png) + +### Modeling Long-term Temporal Structure + +#### Recurrent Convolutional Network + +Similar to multi-layer RNN, we replace the **dot-product** operation with **convolution**. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-rcn.png) + +Feature size in layer $L$, time $t-1$: $W_h\times H\times W$. + +Feature size in layer $L-1$, time $t$: $W_x\times H\times W$. + +Feature size in layer $L$, time $t$: $(W_h+W_x)\times H\times W$. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-rcn_inside.png) + +**Problem:** RNNs are slow for long sequences. (can’t be parallelized) + +#### Spatio-temporal Self-attention + +Introduce self-attention into video classification problems. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-self_attention.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-self_attention_net.png) + +#### Vision Transformers for Video + +Factorized attention: Attend over space / time. + +So many papers... + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-vision_transformer.png) + +### Visualizing Video Models + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-video_visualizing.png) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-video_visualizing_2.png) + +### Multimodal Video Understanding + +#### Temporal Action Localization + +Given a long untrimmed video sequence, identify frames corresponding to different actions. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-multimodal_temporal_localization.png) + +#### Spatio-Temporal Detection + +Given a long untrimmed video, detect all the people in both space and time and classify the activities they are performing. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-multimodal_s_t_detection.png) + +#### Visually-guided Audio Source Separation + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/12-multimodal_voice_separation.png) + +And So on... + +## 13 - Generative Models + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/13-generative_model.png) + +### PixelRNN and PixelCNN + +#### Fully Visible Belief Network (FVBN) + +$p(x)$ : Likelihood of image $x$. + +$p(x_1,x_2,\dots,x_n)$ : Joint likelihood of all $n$ pixels in image $x$. + +$p(x_i|x_1,x_2,\dots,x_{i-1})$ : Probability of pixel $i$ value given all previous pixels. + +For explicit density models, we have $p(x)=p(x_1,x_2,\dots,x_n)=\prod_{i=1}^np(x_i|x_1,x_2,\dots,x_{i-1})\\$. + +**Objective:** Maximize the likelihood of training data. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/13-likelihood.png) + +#### PixelRNN + +Generate image pixels starting from corner. + +Dependency on previous pixels modeled using an RNN (LSTM). + +**Drawback:** Sequential generation is slow in both training and inference! + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/13-pixel_rnn.png) + +#### PixelCNN + +Still generate image pixels starting from corner. + +Dependency on previous pixels modeled using a CNN over context region (masked convolution). + +**Drawback:** Though its training is faster, its generation is still slow. **(pixel by pixel)** + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/13-pixel_cnn.png) + +### Variational Autoencoder + +Supplement content added according to [Tutorial on Variational Autoencoders](https://arxiv.org/abs/1606.05908). (**paper with notes:** [VAE Tutorial.pdf](..\Variational Autoencoder\papes\VAE Tutorial.pdf)) + +[变分自编码器VAE:原来是这么一回事 | 附开源代码](https://zhuanlan.zhihu.com/p/34998569) + +#### Autoencoder + +Learn a lower-dimensional feature representation with unsupervised approaches. + +$x\rightarrow z$ : Dimension reduction for input features. + +$z\rightarrow \hat{x}$ : Reconstruct input features. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/13-autoencoder.png) + +After training, we throw the decoder away and use the encoder for transferring. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/13-autoencoder_transfer.png) + +**For generative models, there is a problem:** + +We can’t generate new images from an autoencoder because we don’t know the space of $z$. + +#### Variational Autoencoder + +##### Character Description + +$X$ : Images. **(random variable)** + +$Z$ : Latent representations. **(random variable)** + +$P(X)$ : True distribution of all training images $X$. + +$P(Z)$ : True distribution of all latent representations $Z$. + +$P(X|Z)$ : True **posterior** distribution of all images $X$ with condition $Z$. + +$P(Z|X)$ : True **prior** distribution of all latent representations $Z$ with condition $X$. + +$Q(Z|X)$ : Approximated **prior** distribution of all latent representations $Z$ with condition $X$. + +$x$ : A specific image. + +$z$ : A specific latent representation. + +$\theta$: Learned parameters in decoder network. + +$\phi$: Learned parameters in encoder network. + +$p_\theta(x)$ : Probability that $x\sim P(X)$. + +$p_\theta(z)$ : Probability that $z\sim P(Z)$. + +$p_\theta(x|z)$ : Probability that $x\sim P(X|Z)$. + +$p_\theta(z|x)$ : Probability that $z\sim P(Z|X)$. + +$q_\phi(z|x)$ : Probability that $z\sim Q(Z|X)$. + +##### Decoder + +**Objective:** + +Generate new images from $\mathscr{z}$. + +1. Generate a value $z^{(i)}$ from the prior distribution $P(Z)$. +2. Generate a value $x^{(i)}$ from the conditional distribution $P(X|Z)$. + +**Lemma:** + +Any distribution in $d$ dimensions can be generated by taking a set of $d$ variables that are **normally distributed** and mapping them through a sufficiently complicated function. (source: [Tutorial on Variational Autoencoders](https://arxiv.org/abs/1606.05908), Page 6) + +**Solutions:** + +1. Choose prior distribution $P(Z)$ to be a simple distribution, for example $P(Z)\sim N(0,1)$. +2. Learn the conditional distribution $P(X|Z)$ through a neural network (decoder) with parameter $\theta$. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/13-var_autoencoder_decoder.png) + +##### Encoder + +**Objective:** + +Learn $\mathscr{z}$ with training images. + +**Given:** (From the decoder, we can deduce the following probabilities.) + +1. *data likelihood:* $p_\theta(x)=\int p_\theta(x|z)p_\theta(z)dz$. +2. *posterior density:* $p_\theta(z|x)=\frac{p_\theta(x|z)p_\theta(z)}{p_\theta(x)}=\frac{p_\theta(x|z)p_\theta(z)}{\int p_\theta(x|z)p_\theta(z)dz}$. + +**Problem:** + +Both $p_\theta(x)$ and $p_\theta(z|x)$ are intractable. (can't be optimized directly as they contain *integral operation*) + +**Solution:** + +Learn $Q(Z|X)$ to approximate the true posterior $P(Z|X)$. + +Use $q_\phi(z|x)$ in place of $p_\theta(z|x)$. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/13-var_autoencoder_encoder.png) + +##### Variational Autoencoder (Combination of Encoder and Decoder) + +**Objective:** + +Maximize $p_\theta(x)$ for all $x^{(i)}$ in the training set. + +$$ +\begin{aligned} +\log p_\theta\big(x^{(i)}\big)&=\mathbb{E}_{z\sim q_\phi\big(z|x^{(i)}\big)}\Big[\log p_\theta\big(x^{(i)}\big)\Big]\\ + +&=\mathbb{E}_z\Bigg[\log\frac{p_\theta\big(x^{(i)}|z\big)p_\theta\big(z\big)}{p_\theta\big(z|x^{(i)}\big)}\Bigg]\quad\text{(Bayes' Rule)}\\ + +&=\mathbb{E}_z\Bigg[\log\frac{p_\theta\big(x^{(i)}|z\big)p_\theta\big(z\big)}{p_\theta\big(z|x^{(i)}\big)}\frac{q_\phi\big(z|x^{(i)}\big)}{q_\phi\big(z|x^{(i)}\big)}\Bigg]\quad\text{(Multiply by Constant)}\\ + +&=\mathbb{E}_z\Big[\log p_\theta\big(x^{(i)}|z\big)\Big]-\mathbb{E}_z\Bigg[\log\frac{q_\phi\big(z|x^{(i)}\big)}{p_\theta\big(z\big)}\Bigg]+\mathbb{E}_z\Bigg[\log\frac{p_\theta\big(z|x^{(i)}\big)}{q_\phi\big(z|x^{(i)}\big)}\Bigg]\quad\text{(Logarithm)}\\ + +&=\mathbb{E}_z\Big[\log p_\theta\big(x^{(i)}|z\big)\Big]-D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]+D_{\text{KL}}\Big[p_\theta\big(z|x^{(i)}\big)||q_\phi\big(z|x^{(i)}\big)\Big]\quad\text{(KL Divergence)} +\end{aligned} +$$ + +**Analyze the Formula by Term:** + +$\mathbb{E}_z\Big[\log p_\theta\big(x^{(i)}|z\big)\Big]$: Decoder network gives $p_\theta\big(x^{(i)}|z\big)$, can compute estimate of this term through sampling. + +$D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]$: This KL term (between Gaussians for encoder and $z$ prior) has nice closed-form solution! + +$D_{\text{KL}}\Big[p_\theta\big(z|x^{(i)}\big)||q_\phi\big(z|x^{(i)}\big)\Big]$: The part $p_\theta\big(z|x^{(i)}\big)$ is intractable. **However, we know KL divergence always $\ge0$.** + +**Tractable Lower Bound:** + +We can maximize the lower bound of that formula. + +As $D_{\text{KL}}\Big[p_\theta\big(z|x^{(i)}\big)||q_\phi\big(z|x^{(i)}\big)\Big]\ge0$ , we can deduce that: + +$$ +\begin{aligned} +\log p_\theta\big(x^{(i)}\big)&=\mathbb{E}_z\Big[\log p_\theta\big(x^{(i)}|z\big)\Big]-D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]+D_{\text{KL}}\Big[p_\theta\big(z|x^{(i)}\big)||q_\phi\big(z|x^{(i)}\big)\Big]\\ + +&\ge\mathbb{E}_z\Big[\log p_\theta\big(x^{(i)}|z\big)\Big]-D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big] +\end{aligned} +$$ + +So the loss function $\mathcal{L}\big(x^{(i)},\theta,\phi\big)=-\mathbb{E}_z\Big[\log p_\theta\big(x^{(i)}|z\big)\Big]+D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]$. + +$\mathbb{E}_z\Big[\log p_\theta\big(x^{(i)}|z\big)\Big]$: ***Decoder***, reconstruct the input data. + +$D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]$: ***Encoder***, make approximate posterior distribution close to prior. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/13-var_autoencoder_combination.png) + +### Generative Adversarial Networks (GANs) + +#### Motivation & Modeling + +**Objective:** Not modeling any explicit density function. + +**Problem:** Want to sample from complex, high-dimensional training distribution. **No direct way to do this!** + +**Solution:** Sample from a simple distribution, e.g. **random noise**. Learn the transformation to training distribution. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/13-gan_stage1.png) + +**Problem:** We can't learn the **mapping relation** between sample $z$ and training images. + +**Solution:** Use a **discriminator network** to tell whether the generate image is within data distribution or not. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/13-gan_stage2.png) + +**Discriminator network:** Try to distinguish between real and fake images. + +**Generator network:** Try to fool the discriminator by generating real-looking images. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/13-gan_stage3.png) + +$x$ : Real data. + +$y$ : Fake data, which is generated by the generator network. $y=G_{\theta_g}(z)$. + +$D_{\theta_d}(x)$ : Discriminator score, which is the likelihood of real image. $D_{\theta_d}(x)\in[0,1]$. + +**Objective of discriminator network:** + +$\max_{\theta_d}\bigg[\mathbb{E}_x\Big(\log D_{\theta_d}(x)\Big)+\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]$ + +**Objective of generator network:** + +$\min_{\theta_g}\max_{\theta_d}\bigg[\mathbb{E}_x\Big(\log D_{\theta_d}(x)\Big)+\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]$ + +#### Training Strategy + +Two combine this two networks together, we can train them alternately: + +1. Gradient **ascent** on discriminator. + + $\max_{\theta_d}\bigg[\mathbb{E}_x\Big(\log D_{\theta_d}(x)\Big)+\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]$ + +2. Gradient **descent** on generator. + + $\min_{\theta_g}\bigg[\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]$ + +However, the gradient of generator decreases with the value itself, making it **hard to optimize**. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/13-gan_gradient.png) + +So we replace $\log\big(1-D_{\theta_d}(y)\big)$ with $-\log D_{\theta_d}(y)$, and use gradient ascent instead. + +1. Gradient **ascent** on discriminator. + + $\max_{\theta_d}\bigg[\mathbb{E}_x\Big(\log D_{\theta_d}(x)\Big)+\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]$ + +2. Gradient **ascent** on generator. + + $\max_{\theta_g}\bigg[\mathbb{E}_{z\sim p(z)}\Big(\log D_{\theta_d}(y)\Big)\bigg]$ + +#### Summary + +**Pros:** Beautiful, state-of-the-art samples! + +**Cons:** + +1. Trickier / more unstable to train. +2. Can’t solve inference queries such as $p(x), p(z|x)$. + +## 14 - Self-supervised Learning + +**Aim:** Solve “pretext” tasks that produce good features for downstream tasks. + +**Application:** + +1. Learn a feature extractor from pretext tasks. **(self-supervised)** +2. Attach a shallow network on the feature extractor. +3. Train the shallow network on target task with small amount of labeled data. **(supervised)** + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/14-self_supervised_learning.png) + +### Pretext Tasks + +Labels are generated automatically. + +#### Rotation + + Train a classifier on randomly rotated images. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/14-pretext_rotation.png) + +#### Rearrangement + +Train a classifier on randomly shuffled image pieces. + +Predict the location of image pieces. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/14-pretext_rearrangement.png) + +#### Inpainting + +Mask part of the image, train a network to predict the masked area. + +Method referencing [Context Encoders: Feature Learning by Inpainting](https://arxiv.org/pdf/1604.07379.pdf). + +Combine two types of loss together to get better performance: + +1. **Reconstruction loss (L2 loss):** Used for reconstructing global features. +2. **Adversarial loss:** Used for generating texture features. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/14-pretext_inpainting.png) + +#### Coloring + +Transfer between greyscale images and colored images. + +**Cross-channel predictions for images:** [Split-Brain Autoencoders](https://openaccess.thecvf.com/content_cvpr_2017/papers/Zhang_Split-Brain_Autoencoders_Unsupervised_CVPR_2017_paper.pdf). + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/14-pretext_coloring_sb_ae.png) + +**Video coloring:** Establish mappings between reference and target frames in a learned feature space. [Tracking Emerges by Colorizing Videos](https://arxiv.org/abs/1806.09594). + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/14-pretext_coloring_video.png) + +#### Summary for Pretext Tasks + +1. Pretext tasks focus on **“visual common sense”**. +2. The models are forced learn good features about natural images. +3. We **don’t** care about the performance of these **pretext tasks**. + + What we care is the performance of **downstream tasks**. + +#### Problems of Specific Pretext Tasks + +1. Coming up with **individual** pretext tasks is tedious. +2. The learned representations may **not be general**. + +**Intuitive Solution:** Contrastive Learning. + +### Contrastive Representation Learning + +**Local additional references:** [Contrastive Learning.md](..\..\DL\Contrastive Learning\Contrastive Learning.md). + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/14-contrastive.png) + +**Objective:** + +Given a chosen score function $s$, we aim to learn an encoder function $f$ that yields: + +1. For each sample $x$, increase the similarity $s\big(f(x),f(x^+)\big)$ between $x$ and positive samples $x^+$. +2. Finally we want $s\big(f(x),f(x^+)\big)\gg s\big(f(x),f(x^-)\big)$. + +**Loss Function:** + +Given $1$ positive sample and $N-1$ negative samples: + +| InfoNCE Loss | Cross Entropy Loss | +| ------------------------------------------------------------ | ------------------------------------------------------------ | +| $\begin{aligned}\mathcal{L}=-\mathbb{E}_X\Bigg[\log\frac{\exp{s\big(f(x),f(x^+)\big)}}{\exp{s\big(f(x),f(x^+)\big)}+\sum_{j=1}^{N-1}\exp{s\big(f(x),f(x^+)\big)}}\Bigg]\\\end{aligned}$ | $\begin{aligned}\mathcal{L}&=-\sum_{i=1}^Np(x_i)\log q(x_i)\\&=-\mathbb{E}_X\big[\log q(x)\big]\\&=-\mathbb{E}_X\Bigg[\log\frac{\exp(x)}{\sum_{j=1}^N\exp(x_j)}\Bigg]\end{aligned}$ | + +The *InfoNCE Loss* is a lower bound on the *mutual information* between $f(x)$ and $f(x^+)$: + +$\text{MI}\big[f(x),f(x^+)\big]\ge\log(N)-\mathcal{L}$ + +The *larger* the negative sample size $N$, the *tighter* the bound. + +So we use $N-1$ negative samples. + +#### Instance Contrastive Learning + +##### [SimCLR](https://arxiv.org/pdf/2002.05709.pdf) + +Use a projection function $g(\cdot)$ to project features to a space where contrastive learning is applied. + +The extra projection contributes a lot to the final performance. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/14-contrastive_simclr_frame.jpg) + +**Score Function:** Cos similarity $s(u,v)=\frac{u^Tv}{||u||||v||}\\$. + +**Positive Pair:** Pair of augmented data. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/14-contrastive_simclr_algo.png) + +##### [Momentum Contrastive Learning (MoCo)](https://arxiv.org/pdf/1911.05722.pdf) + +There are mainly $3$ training strategy in contrastive learning: + +1. *end-to-end:* Keys are updated together with queries, e.g. ***SimCLR***. + + **(limited by GPU size)** + +2. *memory bank:* Store last-time keys for sampling. + + **(inconsistency between $q$ and $k$)** + +3. ***MoCo**:* Use momentum methods to encode keys. + + **(combination of *end-to-end* & *memory bank*)** + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/14-contrastive_moco_cate.png) + +**Key differences to SimCLR:** + +1. Keep a running **queue** of keys (negative samples). +2. Compute gradients and update the encoder **only through the queries**. +3. Decouple min-batch size with the number of keys: can support **a large number of negative samples**. +4. The key encoder is **slowly progressing** through the momentum update rules: + + $\theta_k\leftarrow m\theta_k+(1-m)\theta_q$ + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/14-contrastive_moco_algo.png) + +#### Sequence Contrastive Learning + +##### Contrastive Predictive Coding (CPC) + +**Contrastive:** Contrast between “right” and “wrong” sequences using contrastive learning. + +**Predictive:** The model has to *predict* future patterns given the current context. + +**Coding:** The model learns useful *feature vectors*, or “code”, for downstream tasks, similar to other self-supervised methods. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/14-contrastive_cpc.png) + +#### Other Examples (Frontier) + +##### Contrastive Language Image Pre-training (CLIP) + +Contrastive learning between image and natural language sentences. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/14-contrastive_clip.png) + +## 15 - Low-Level Vision + +Pass... + +## 16 - 3D Vision + +### Representation + +#### Explicit vs Implicit + +**Explicit:** Easy to sample examples, hard to do inside/outside check. + +**Implicit:** Hard to sample examples, easy to do inside/outside check. + +| | Non-parametric | Parametric | +| ------------ | ---------------------- | --------------------------------------------------- | +| **Explicit** | Points.
Meshes. | Splines.
Subdivision Surfaces. | +| **Implicit** | Level Sets.
Voxels. | Algebraic Surfaces.
Constructive Solid Geometry. | + +#### Point Clouds + +The simplest representation. + +Collection of $(x,y,z)$ coordinates. + +**Cons:** + +1. Difficult to draw in under-sampled regions. +2. No simplification or subdivision. +3. No direction smooth rendering. +4. No topological information. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/16-representation_point_clouds.png) + +#### Polygonal Meshes + +Collection of vertices $v$ and edges $e$. + +**Pros:** + +1. Can apply downsampling or upsampling on meshes. +2. Error decreases by $O(n^2)$ while meshes increase by $O(n)$. +3. Can approximate arbitrary topology. +4. Efficient rendering. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/16-representation_poly.png) + +#### Splines + +Use specific functions to approximate the surface. (e.g. Bézier Curves) + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/16-representation_bezier.png) + +#### Algebraic Surfaces + +Use specific functions to represent the surface. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/16-representation_algebra.png) + +#### Constructive Solid Geometry + +Combine implicit geometry with Boolean operations. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/16-representation_boolean.png) + +#### Level Sets + +Store a grim of values to approximate the function. + +Surface is found where interpolated value equals to $0$. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/16-representation_level_set.png) + +#### Voxels + +Binary thresholding the volumetric grid. + +![](https://raw.githubusercontent.com/WncFht/picture/main/picture/16-representation_binary.png) + +### AI + 3D + +Pass...