This documents records the main takeaways during my learning of Deep Learning.
First read the FAQ!
- Tensorflow defines a graph (a collection of operations stacked in order) first, and then executes the operations on real data at run time. So when we perform operations in TF, we are designing the architect without running any calculation. The calculation happens inside a Session. So there are two stages in TF's code: the Graph level and the Session (evaluation) level.
- Real data are not needed at definition of the graph. For this, inputs are represented by a datatype named
placeholder
. Only at run time, real data is fed into theplaceholder
s through a mapping dictionaryfeed_dict
. Variable
: a class. The Variable() constructor requires an initial value for the variable, which can be a Tensor of any type and shape. The initial value defines the type and shape of the variable. Only variables keeps their data between multiple evaluation (across calls torun()
). All other tensors are temporary which means that they will be destroyed and inaccessible in your training for-loop without a proper feed_dict. Variables have to be explicitly initialized before you can run Ops that use their value.- Nodes in Tensorflow's graph represents operation (op), and the edge represents the data (tensor) that flows between them.
- All data in Tensorflow are represented by a data type Tensor, with a static type, a rank and a shape. However its values have to be evaluated through tf.Session().run().
- Once a graph is defined, it has to be deployed with a session to get the output. A session is an environment that supports the execution of the operations. If two or more tensors needs to be evaluated, put them in a list and pass to run().
- The role of Python code is therefore to build this external computation graph, and to dictate which parts of the computation graph should be run.
- In the TensorFlow system, tensors are described by a unit of dimensionality known as rank. Tensor rank is NOT the same as matrix rank.
-
Define a loss. A loss node has to be defined if we want to train the model. It is very common to use op like
tf.reduce_sum()
to sum across a certain axis.pred = tf.nn.softmax() # prediction model label = tf.placeholder(tf.float32, [n_batch, n_class]) loss = -tf.reduce_sum(label * tf.log(pred), axis=1) # cross entropy
-
Compute gradients.
optimizer = tf.train.GradientDescentOptimizer(lr) train_step = optimizer.minimize(loss)
-
Train the model.
batch_x, batch_label = data.next_batch() with tf.Session() as sess: sess.run(tf.global_variables_initializer()) sess.run(train_step, feed_dict={x: batch_x, label: batch_label})
Tensorflow has been evolving fast. When running code snippet online, especially those written in 2016, before many changes was implemented in API r1.0 in early 2017, we may encounter various syntax errors. See most updated list of API changes here. Among them, the most common ones are:
tf.multiply()
replacedtf.mul()
tf.subtract()
replacedtf.sub()
tf.negative()
replacedtf.neg()
tf.split
now takes argument in the order oftf.split(value, num_or_size_splits, axis)
, reversing the previous orders foraxis
andvalue
.tf.global_variables_initializer
replacedtf.initialize_all_variables
Tensor Shapes link
A tensor has a dynamic shape and a static (or inferred) shape, accessible by tf.shape()
and tf.Tensor.get_shape()
respectively.
The static shape is a tuple or a list. The static shape is very useful to debug your code with print so you can check your tensors have the right shapes. The dynamic shape is itself a tensor describing the shape of the original tensor.
By default, a placeholder
has a completely unconstrained shape, but you can constrain it by passing the optional shape argument.
w = tf.placeholder(tf.float32) # Unconstrained shape
x = tf.placeholder(tf.float32, shape=[None, None]) # Matrix of unconstrained size
y = tf.placeholder(tf.float32, shape=[None, 32]) # Matrix with 32 columns
z = tf.placeholder(tf.float32, shape=[128, 32]) # 128x32-element matrix
In general shape=[None, 32]
is the most common way as to put some constraint to feature dimension but also be able to accommodate different batch sizes.
In contrast, the learnable Variable
generally has a known static shape.
- difference between
tf.shape()
andTensor.get_shape()
a = tf.random_normal([2, 5, 4])
print(a.get_shape()) # ==> (2, 5, 4)
print(sess.run(tf.shape(a))) # ==> [2 5 4]
-
Static shape
Tensor.get_shape()
is evaluated at graph construction time, while dynamic shapetf.shape(Tensor)
is evaluated at runtime. -
N.B. Static (inferred) shape may be incomplete. Eval dynamic shape in a session.
In TensorFlow, a tensor has both a static (inferred) shape and a dynamic (true) shape. The static shape can be read using the tf.Tensor.get_shape method: this shape is inferred from the operations that were used to create the tensor, and may be partially complete. If the static shape is not fully defined, the dynamic shape of a Tensor t can be determined by evaluating tf.shape(t).
set_shape
specifies undefined information, no copy involved. tf.reshape does a shallow copy (not expensive either). link
a = tf.placeholder(tf.float32, (None, 10))
print('{} initial'.format(a.get_shape()))
a.set_shape((5, 10))
print('{} after set shape'.format(a.get_shape()))
try:
a.set_shape((1, 5, 10))
except:
print('cannot set_shape to (1, 5, 10) shape incompatible!')
a = tf.reshape(a, [1, 5,-1])
print('{} reshaped (copied)'.format(a.get_shape()))
# ==>
"""
(?, 10) initial
(5, 10) after set shape
cannot set_shape to (1, 5, 10) shape incompatible!
(1, 5, 10) reshaped (copied)
"""
The current recommendation is that users support both formats in their models. In the long term, we plan to rewrite graphs to make switching between the formats transparent. link
Use the following code to list all available devices:
import tensorflow as tf
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
with tf.device('/cpu:1'):
# creates a graph
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True))
# Runs the op.
print sess.run(c)
The output may be produced on the console from where you ran the Jupyter Notebook.
Tensorflow's broadcasting rules are designed to follow those of numpy
's.
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when
- they are equal, or
- one of them is 1
- An example taken from O'Reilly's Hands-on machine learning with scikit-learn and tensorflow: input 150x100 RGB image, one conv layer with 200 5x5 filters, 1x1 stride and
SAME
padding. The output parameters would be 200 feature maps of size 150x100, with a total number of parameters of (5x5x3+1)x200 = 15200 parameters. - Computation: Each of the 200 feature maps contains 150x100 neurons, and the each neuron needs to make weighted sum of 5x5x3 inputs, that is (5x5x3)x150x100x200=225 million multiplications. Including the same amount of addition, this requires 450 million flops.
- Storage: if each weight is stored in 32-bit float (double), then the output features takes 200x150x100x32/8~11.4 MB of RAM per instance. If a training batch contains 100 instance, the this layer would take up 1 GB of RAM!
- During training, every layer computed during the forward pass needs to be preserved for back-propagation, so the RAM needed is at least the total amount of RAM needed.
- During inference, the RAM occupied by one layer can be released as soon as the next layer has been completed, so only as much RAM as required by two consecutive layers are needed.
- Pooling reduces the input image size and also makes the NN tolerate a bit more image shift (location invariance).
- Pooling works on every input channel independently. Generally you can pool over the height and width in each channel, or pool over the channels. You can not do both currently in tensorflow.
- Typical CNN architectures stack a few convolutional layers (each one followed by a ReLU layer) and a pooling layer. The image gets smaller and smaller but gets deeper and deeper as well.
- Common mistake is to make kernels too large. We can get the same effect as 9x9 kernels by stacking two 3x3 kernels.
- Cross entropy cost function is preferred as it penalizes bad predictions much more, producing larger gradients and thus converging faster.
tf.set_random_seed(seed)
's interactions with operation-level seeds is as follows: docstring
- If neither the graph-level nor the operation seed is set: A random seed is used for this op.
- If the graph-level seed is set, but the operation seed is not: The system deterministically picks an operation seed in conjunction with the graph-level seed so that it gets a unique random sequence.
- If the graph-level seed is not set, but the operation seed is set: A default graph-level seed and the specified operation seed are used to determine the random sequence.
- If both the graph-level and the operation seed are set: Both seeds are used in conjunction to determine the random sequence.
- Youtube Video Link, Detailed documentation
- Variables have to be explicitly initialized.
a = tf.Variable(tf.random_uniform([1], 0, 1))
sess = tf.Session()
sess.run(a.initializer)
print(sess.run(a))
b = tf.random_normal([1], 0, 1)
print(sess.run(b))
- There is no
tf.sum()
method. Instead,tf.reduce_sum(keep_dims=False)
is very similar tonp.sum(keepdims=False)
. Pay special attention to the broadcasting rule. np.random.normal()
does not take anydtype
argument. It has to be explicitly defined as, for example,np.random.normal().astype(np.float32)
.- Best practice of
import
:
from package.subpackage1.subpackage2 import subpackage3
subpackage3.name
- placeholder (NxD) Hidden layer 1 weight (DxH1) Hidden layer 2 weight (H1xH2). y = X * W + b
sparse_softmax_cross_entropy_with_logits
andsoftmax_cross_entropy_with_logits
link
An epoch usually means one iteration over all of the training data. For instance if you have 20,000 images and a batch size of 100 then the epoch should contain 20,000 / 100 = 200 steps. link
-
blog link
-
A typical scenario has three steps:
- Creating a Saver and telling the Saver which variables you want to save,
- Save the variables to a file,
- Restore the variables from a file when they are needed.
-
tf.train.import_meta_graph
link -
Checkpoint file only saves the weights, and the graph itself can be recovered from the meta file. (link) There are two parts to the model, the model definition, saved by Supervisor as graph.pbtxt in the model directory and the numerical values of tensors, saved into checkpoint files like model.ckpt-1003418. (link2)
-
How to save model link
Note that in this example, while the saver actually saves both the current values of the variables as a checkpoint and the structure of the graph (*.meta), no specific care was taken w.r.t how to retrieve e.g. the placeholders x and y once the model was restored. E.g. if the restoring is done anywhere else than this training script, it can be cumbersome to retrieve x and y from the restored graph (especially in more complicated models). To avoid that, always give names to your variables / placeholders / ops or think about using tf.collections as shown in one of the remarks.
Three ways to load data link
-
Feeding: using
feed_dict
when running each step.A placeholder exists solely to serve as the target of feeds. It is not initialized and contains no data. A placeholder generates an error if it is executed without a feed, so you won't forget to feed it.
This is the easiest way, but parsing could be a bottleneck. In that case, build input pipelines.
Unless for a special circumstance or for example code, DO NOT feed data into the session from Python variables, e.g. dictionary. (link)
# This will result in poor performance. sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
-
Reading from files: input pipeline read from file at the beginning of a TF graph
-
Preloaded data: a constant or Variable in the graph holds all the data (for small datasets)
import tensorflow as tf
tf.reset_default_graph()
w1 = tf.Variable(tf.truncated_normal([10]), name='w1')
tf.add_to_collection('weights', w1)
saver = tf.train.Saver()
# save graph
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
saver.save(sess, r'/tmp/mnist/mymodel', global_step=10)
From a different station
import tensorflow as tf
# load graph (even from a different station)
with tf.Session() as sess:
new_saver = tf.train.import_meta_graph(r'/tmp/mnist/mymodel-10.meta')
new_saver.restore(sess,r'/tmp/mnist/mymodel-10')
new_weights = tf.get_collection('weights')[0]
print(sess.run(new_weights))
Another example
import tensorflow as tf
def save(checkpoint_file='/tmp/mnist/hello.chk'):
with tf.Session() as session:
x = tf.Variable([42.0, 42.1, 42.3], name='x')
y = tf.Variable([[1.0, 2.0], [3.0, 4.0]], name='y')
not_saved = tf.Variable([-1, -2], name='not_saved')
session.run(tf.global_variables_initializer())
print(session.run(tf.global_variables()))
# saver = tf.train.Saver([x, y])
saver = tf.train.Saver()
saver.save(session, checkpoint_file)
def restore(checkpoint_file='/tmp/mnist/hello.chk'):
x = tf.Variable(-1.0, validate_shape=False, name='x')
y = tf.Variable(-1.0, validate_shape=False, name='y')
with tf.Session() as session:
saver = tf.train.Saver()
saver.restore(session, checkpoint_file)
print(session.run(tf.global_variables()))
def restore2(checkpoint_file='/tmp/mnist/hello.chk'):
with tf.Session() as session:
saver = tf.train.import_meta_graph(checkpoint_file + ".meta")
saver.restore(session, checkpoint_file)
session.run(tf.global_variables_initializer()) #needed
print(session.run(tf.global_variables()))
def reset():
tf.reset_default_graph() # destroys the graph
save() # saves [x, y, not_saved]
reset()
restore() # loads [x, y]
reset()
restore2() # loads [x, y, not_saved]
tf.trainable_variables() returns a list of all trainable variables. tf.Variable(trainable=False) will not add variables to this list.
- TF how-to
- iPython Notebook
v = tf.get_variable(name, shape, dtype, initializer)
retrieves variable, and creates one if not existed yet.
- Case 1: the scope is set for creating new variables, as evidenced by tf.get_variable_scope().reuse == False.
- Case 2: the scope is set for reusing variables, as evidenced by tf.get_variable_scope().reuse == True.
tf.variable_scope()
carries a prefix name and a reuse flag.reuse
parameter is inherited in all sub-scopes.- name scope vs variable scope: name scope is ignored by tf.get_variable().
-
Why do we have to normalize the stddev parameter during initialization?
One should generally initialize weights with a small amount of noise for symmetry breaking, and to prevent 0 gradients. Since we're using ReLU neurons, it is also good practice to initialize them with a slightly positive initial bias to avoid "dead neurons". link
-
What does
import_meta_graph
andrestore
do?import_meta_graph
loads the graph from the meta file, andrestore
recovers the weights of the trainable variables. -
What does
add_to_collection
andget_collection
do?Makes it easier to retrieve variables from restored graph
-
How to use input pipeline?
- Remember to use
tf.reset_default_graph()
to clear before training, especially during interactive development. - Use fused batch norm in DNN.
- Use placeholder to hold the dropout probability during training and evaluation.
- Adam uses a adaptive learning rate.
- Use xentropy during training and accuracy during evaluation
correct_prediction = tf.equal(tf.argmax(y_, 1), tf.argmax(y_conv, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
- For simple classification tasks, it generally is OK to down-sample the image first and then do the classification using CNN.
- Deep learning is usually good at bring up sensitivity, then afterwards use conventional computer vision or machine learning techniques to bring up specificity (filtering out false positives).
- https://www.tensorflow.org/tutorials/using_gpu
- https://www.tensorflow.org/tutorials/deep_cnn
- Using GPU guide
- Buying guide
- Guide Lines:
- FP32 runs much faster than FP64 on GPU. Theoretically FP32 should be twice as fast as FP64, but most GPU's don't have as many FP64 units and FP32 operations can be 32 times faster than FP64. This ratio depends on GPU architecture. However if the operation is memory bounded such as matrix transposition, this ideal number can be achieved. (link)
- Use FP32 as default for float point calculations. It has enough precision for most deep learning cases.
- Cross entropy: A great visual guide to cross entropy
- Visualizing graph using tensorboardtutorial
- Debugging tips in TF link
- Coursera course by Hinton link
- Kaggle Data Bowl link
- Blogs
I've read a lot of research papers (DeepMind, Google Brain, Facebook, NYU, Stanford, etc.), blogs (Nervana Systems, Indico, Colah, Otoro's Blog, etc.), lecture notes (Stanford cs231n, cs224d, cs229), and tutorials (Quoc Le's tutorial, TensorFlow, etc.), and have watched a lot of videos (Hugo Larochelle's tutorials, Stanford cs229, TedTalks, lectures by Yann LeCun <3, etc.) link
- The video is available on youtube.
- Decision making flowchart in tuning DL ![](images/Decision Making in Applying Deep Learning - Page 1.svg)
- Make sure dev and test are from the same distribution.
- Dev is the benchmark of tuning.
- Bias vs Variance tradeoff
- In the era of deep learning, the bias and variance are not as closely coupled as traditional machine learning.
- Human level performance
- It provides feasibility, data and insights.
- Why improving after beating human level performance becomes harder?
- Human level performance gives guidance on which gap to focus on before hitting human level performance (e.g., are we doing good enough on training data error?).
- Afterwards, it becomes unclear which area to focus on (e.g., hard to tell if is it a bias or variance problem).
- What to do? Focus on sub-areas still lagging behind human level performance.
- Why it is increasingly important now?
- Since we are approaching human level performance now, and knowing human level performance is very useful information to drive decision making.
- How to define human level performance in order to drive algorithm development?
- Given that human level performance are often used as proxy for theoretical optimal error rate (and measure the noise level of the data), it is most useful to get the best possible human level performance.
- Medical example of making diagnosis of a certain disease. Among the error rate results from a) average person 3%, b) average doctor 1%, c) expert doctor 0.8%, d) group of expert doctors 0.5%, d) is most useful. However, considering the difficulty of obtaining the labels, b) is most often used.
- What can AI/DL do? How can we put AI into our product?
- Heuristic: Anything that a person can perform with < 1s of thought.
- Predicting outcome of next in sequence of events.
- How to become a good DL researcher?
- ML + DL basics
- PhD student process:
- Read a lot of (20~50 at least) papers
- replicate the result.
- Dig into dirty work (but not only that)
- link
- Beautiful simplicity of backpropagation: every local gradient is a LOCAL worker in a GLOBAL chase for smaller loss function.
- The gradient of sigmoid function
$\frac{d\sigma(x)}{dx} = \sigma(x) (1-\sigma(x))$ .
- Precision-Recall curve plots precision (percentage of true positives in the percentage of predicted positives,
$\frac{TP}{TP+FN}$ ) vs recall (percentage of predicted true positives in all true positives,$\frac{TP}{TP+FP}$ ). - ROC (receiver operating curve) plots recall (sensitivity, true positive rate) vs specificity (1 - false positive rate,
$\frac{TN}{TN+FP}$ ). - When data is highly skewed (imbalanced, such as rare disease detection,
$P/N \to 0$ ), ROC curve is not very useful as specificity cannot be differentiated clearly. In this regard, PR curve works better.
- DeepLearningJP2016 @SlideShares
- DeepEM3D: Approaching human-level performance on 3D anisotropic EM image segmentation link
- Sensor fusion link
- 2 approchaes to sensor fusion
- fuse input data from sensors before analysis
- fuse analysis output data
- Prerequisites of sensor fusion
- sensor synchronization using GPS
- Localization in 6D. GPS is not reliable or accurate in urban canyons
- 2 approchaes to sensor fusion