Skip to content

Latest commit

 

History

History
68 lines (49 loc) · 4.92 KB

DL.md

File metadata and controls

68 lines (49 loc) · 4.92 KB

Deep Learning Basics [Quick reads]

Prerequisites

  • Linear Algebra and Calculus
  • Intro to Probability
  • Entropy and information
  • Loss Functions
  • Maximum Likelihood Estimate

Gradient Descent and Optimization Link1 Link2

  • Taylor Series Approximation Link
  • Momentum Link

Backpropagation and Initializations

  • Xavier/Glorot Link
  • He init Link
  • Softmax CrossEntropy Backprop Link
  • RNN Backprop
  • Attention Backprop
  • Activations and differentiations

Unsupervised Pre-training, Fine-tuningLink

Types of Network:

  • Convolutional Neural Networks Link
  • LSTMs, GRUs and RNNs
  • Autoencoder, PCA
  • GANs Link Link
  • Transformer and attention Link
  • Why a deeper network? Link Link2

Regularizations

  • Dropouts Link

    • The core concept of Srivastava el al. (2014) is that “each hidden unit in a neural network trained with dropout must learn to work with a randomly chosen sample of other units. This should make each hidden unit more robust and drive it towards creating useful features on its own without relying on other hidden units to correct its mistakes.”. “In a standard neural network, the derivative received by each parameter tells it how it should change so the final loss function is reduced, given what all other units are doing. Therefore, units may change in a way that they fix up the mistakes of the other units. This may lead to complex co-adaptations. This in turn leads to overfitting because these co-adaptations do not generalize to unseen data.” Srivastava et al. (2014) hypothesize that by making the presence of other hidden units unreliable, dropout prevents co-adaptation of each hidden unit.
  • L1

  • L2

Common Problems and Intuitions

  • Best practice Andrej
  • Vanishing Gradients Link Link2
  • Exploding Gradients Link
  • Batch Normalization Link
    • Covariate shift in inputs Link
    • Faster convergence: Then during gradient descent, in order to “move the needle” for the Loss, the network would have to make a large update to one weight compared to the other weight. This can cause the gradient descent trajectory to oscillate back and forth along one dimension, thus taking more steps to reach the minimum.
    • Vanishing exploding gradients
  • Skip Connections, ResNets Link ResNet
    • skip connections allow the gradient to reach beginning weights with greater magnitude by skipping some layers in between.
  • Modal Collapse in GANs Link
  • Data Augment Stanford

Deep Learning General

Practical Deep Learning Link

Deep Learning Optimization Link

Structuring Your Tensorflow Models Link

AUTODIFF Link JAX

Self-supervised Learning Link