Skip to content

Commit a0fbfa0

Browse files
committed
Proper usage of i.e. and e.g.
Always 'i.e.,' and 'e.g.,'
1 parent 1aa128a commit a0fbfa0

File tree

13 files changed

+62
-62
lines changed

13 files changed

+62
-62
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,4 +24,4 @@ If you want to test your changes locally before pushing your changes to the `mas
2424

2525
- Start and end math equations with `$$` **for both inline and display equations**! To make a display equation, put one newline before the starting `$$` a newline after the ending `$$`.
2626

27-
- Avoid vertical bars `|` in any inline math equations (ie. within a paragraph of text). Otherwise, the GitHub Markdown compiler interprets it as a table cell element (see GitHub Markdown spec [here](https://github.github.com/gfm/)). Instead, use one of `\mid`, `\vert`, `\lvert`, or `\rvert` instead. For double bar lines, write `\|` instead of `||`.
27+
- Avoid vertical bars `|` in any inline math equations (i.e., within a paragraph of text). Otherwise, the GitHub Markdown compiler interprets it as a table cell element (see GitHub Markdown spec [here](https://github.github.com/gfm/)). Instead, use one of `\mid`, `\vert`, `\lvert`, or `\rvert` instead. For double bar lines, write `\|` instead of `||`.

extras/vae/index.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,9 @@ $$ p(x,z) = p(x|z)p(z) $$
2222
with observed $$x \in \mathcal{X}$$, where $$\mathcal{X}$$ can be continuous or discrete, and latent $$z \in \Re^k$$.
2323

2424
{% include marginfigure.html id="faces" url="assets/img/faces.png" description="Variational autoencoder $$p(x|z)p(z)$$ applied to a face images (modeled by $$x$$). The learned latent space $$z$$ can be used to interpolate between facial expressions." %}
25-
To make things concrete, you may think of $$x$$ as being an image (e.g. a human face), and $$z$$ as latent factors (not seen during training) that explain features of the face. For example, one coordinate of $$z$$ can encode whether the face is happy or sad, another one whether the face is male or female, etc.
25+
To make things concrete, you may think of $$x$$ as being an image (e.g., a human face), and $$z$$ as latent factors (not seen during training) that explain features of the face. For example, one coordinate of $$z$$ can encode whether the face is happy or sad, another one whether the face is male or female, etc.
2626

27-
We may also be interested in models with many layers, e.g. $$p(x \mid z_1)p(z_1 \mid z_2)p(z_2 \mid z_3)\cdots p(z_{m-1}\mid z_m)p(z_m)$$. These are often called *deep generative models* and can learn hierarchies of latent representations.
27+
We may also be interested in models with many layers, e.g., $$p(x \mid z_1)p(z_1 \mid z_2)p(z_2 \mid z_3)\cdots p(z_{m-1}\mid z_m)p(z_m)$$. These are often called *deep generative models* and can learn hierarchies of latent representations.
2828
In this chapter, we will assume for simplicity that there is only one latent layer.
2929

3030
### Learning deep generative models
@@ -128,7 +128,7 @@ This reparametrization has a very interesting interpretation. First, think of $$
128128

129129
The first term $$\log p(x\mid z)$$ is the log-likelihood of the observed $$x$$ given the code $$z$$ that we have sampled. This term is maximized when $$p(x\mid z)$$ assigns high probability to the original $$x$$. It is trying to reconstruct $$x$$ given the code $$z$$; for that reason we call $$p(x\mid z)$$ the *decoder* network and the term is called the *reconstruction error*.
130130

131-
The second term is the divergence between $$q(z\mid x)$$ and the prior $$p(z)$$, which we will fix to be a unit Normal. It encourages the codes $$z$$ to look Gaussian. We call it the *regularization* term. It prevents $$q(z\mid x)$$ from simply encoding an identity mapping, and instead forces it to learn some more interesting representation (e.g. facial features in our first example).
131+
The second term is the divergence between $$q(z\mid x)$$ and the prior $$p(z)$$, which we will fix to be a unit Normal. It encourages the codes $$z$$ to look Gaussian. We call it the *regularization* term. It prevents $$q(z\mid x)$$ from simply encoding an identity mapping, and instead forces it to learn some more interesting representation (e.g., facial features in our first example).
132132

133133
Thus, our optimization objective is trying to fit a $$q(z\mid x)$$ that will map $$x$$ into a useful latent space $$z$$ from which we are able to reconstruct $$x$$ via $$p(x\mid z)$$. This type of objective is reminiscent of *auto-encoder* neural networks{% include sidenote.html id="note-autoencoder" note="An auto-encoder is a pair of neural networks $$f, g$$ that are composed as $$\bar x=f(g(x))$$. They are trained to minimize the reconstruction error $$\|\bar x - x\|$$. In practice, $$g(x)$$ learns to embed $$x$$ in a latent space that often has an intuitive interpretation." %}. This is where the AEVB algorithm takes its name.
134134

@@ -203,7 +203,7 @@ We may interpret the variational autoencoder as a directed latent-variable proba
203203
### Experimental results
204204

205205
{% include marginfigure.html id="mnist" url="assets/img/mnist.png" description="Interpolating over MNIST digits by interpolating over latent variables" %}
206-
The VAE can be applied to images $$x$$ in order to learn interesting latent representations. The VAE paper contains a few examples on the Frey face dataset and on the MNIST digits. On the face dataset, we can interpolate between facial expressions by interpolating between latent variables (e.g. we can generate smooth transitions between "angry" and "surprised"). On the MNIST dataset, we can similarly interpolate between numbers.
206+
The VAE can be applied to images $$x$$ in order to learn interesting latent representations. The VAE paper contains a few examples on the Frey face dataset and on the MNIST digits. On the face dataset, we can interpolate between facial expressions by interpolating between latent variables (e.g., we can generate smooth transitions between "angry" and "surprised"). On the MNIST dataset, we can similarly interpolate between numbers.
207207

208208
The authors also compare their methods against three alternative approaches: the wake-sleep algorithm, Monte-Carlo EM, and hybrid Monte-Carlo. The latter two methods are sampling-based approaches; they are quite accurate, but don't scale well to large datasets. Wake-sleep is a variational inference algorithm that scales much better; however it does not use the exact gradient of the ELBO (it uses an approximation), and hence it is not as accurate as AEVB. The paper illustrates this by plotting learning curves.
209209

inference/jt/index.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,11 @@ title: Junction Tree Algorithm
44
---
55
We have seen how the variable elimination (VE) algorithm can answer marginal queries of the form $$P(Y \mid E = e)$$ for both directed and undirected networks.
66

7-
However, this algorithm has an important shortcoming: if we want to ask the model for another query, e.g. $$P(Y_2 \mid E_2 = e_2)$$, we need to restart the algorithm from scratch. This is very wasteful and computationally burdensome.
7+
However, this algorithm has an important shortcoming: if we want to ask the model for another query, e.g., $$P(Y_2 \mid E_2 = e_2)$$, we need to restart the algorithm from scratch. This is very wasteful and computationally burdensome.
88

99
Fortunately, it turns out that this problem is also easily avoidable. When computing marginals, VE produces many intermediate factors $$\tau$$ as a side-product of the main computation; these factors turn out to be the same as the ones that we need to answer other marginal queries. By caching them after a first run of VE, we can easily answer new marginal queries at essentially no additional cost.
1010

11-
The end result of this chapter will be a new technique called the Junction Tree (JT) algorithm{% include sidenote.html id="note-VEandJT" note="If you are familiar with dynamic programming (DP), you can think of VE vs. the JT algorithm as two flavors of same technique: top-down DP v.s. bottom-up table filling. Just like in computing the $$n$$-th Fibonacci number $$F_n$$, top-down DP (i.e. VE) computes *just* that number, but bottom-up (i.e. JT) will create a filled table of all $$F_i$$ for $$i \leq n$$. Moreover, the two-pass nature of JT is a result of the underlying DP on bi-directional (junction) trees, while Fibonacci numbers' relation is a uni-directional tree." %}; this algorithm will first execute two runs of the VE algorithm to initialize a particular data structure holding a set of pre-computed factors. Once the structure is initialized, it can answer marginal queries in $$O(1)$$ time.
11+
The end result of this chapter will be a new technique called the Junction Tree (JT) algorithm{% include sidenote.html id="note-VEandJT" note="If you are familiar with dynamic programming (DP), you can think of VE vs. the JT algorithm as two flavors of same technique: top-down DP v.s. bottom-up table filling. Just like in computing the $$n$$-th Fibonacci number $$F_n$$, top-down DP (i.e., VE) computes *just* that number, but bottom-up (i.e., JT) will create a filled table of all $$F_i$$ for $$i \leq n$$. Moreover, the two-pass nature of JT is a result of the underlying DP on bi-directional (junction) trees, while Fibonacci numbers' relation is a uni-directional tree." %}; this algorithm will first execute two runs of the VE algorithm to initialize a particular data structure holding a set of pre-computed factors. Once the structure is initialized, it can answer marginal queries in $$O(1)$$ time.
1212

1313
We will introduce two variants of this algorithm: belief propagation (BP), and the full junction tree method. BP applies to tree-structured graphs, while the junction-tree method is applicable to general networks.
1414

@@ -35,8 +35,8 @@ Finally, this algorithm will be correct because our messages are defined as the
3535

3636
We are now ready to formally define the belief propagation algorithm. This algorithm has two variants, each used for a different task:
3737

38-
- *sum-product message passing*: used for marginal inference, i.e. computing $$p(x_i)$$
39-
- *max-product message passing*: used for MAP (maximum a posteriori) inference, i.e. computing $$\max_{x_1, \dotsc, x_n} p(x_1, \dotsc, x_n)$$
38+
- *sum-product message passing*: used for marginal inference, i.e., computing $$p(x_i)$$
39+
- *max-product message passing*: used for MAP (maximum a posteriori) inference, i.e., computing $$\max_{x_1, \dotsc, x_n} p(x_1, \dotsc, x_n)$$
4040

4141
### Sum-product message passing
4242

@@ -98,13 +98,13 @@ $$
9898

9999
Since both problems decompose in the same way, we may reuse all of the machinery developed for marginal inference and apply it directly to MAP inference. Note that this also applies to factor trees.
100100

101-
There is a small caveat in that we often want not just maximum value of a distribution, i.e. $$\max_x p(x)$$, but also its most probable assignment, i.e. $$\arg\max_x p(x)$$. This problem can be easily solved by keeping *back-pointers* during the optimization procedure. For instance, in the above example, we would keep a backpointer to the best assignment to $$x_1$$ given each assignment to $$x_2$$, a pointer to the best assignment to $$x_2$$ given each assignment to $$x_3,$$ and so on.
101+
There is a small caveat in that we often want not just maximum value of a distribution, i.e., $$\max_x p(x)$$, but also its most probable assignment, i.e., $$\arg\max_x p(x)$$. This problem can be easily solved by keeping *back-pointers* during the optimization procedure. For instance, in the above example, we would keep a backpointer to the best assignment to $$x_1$$ given each assignment to $$x_2$$, a pointer to the best assignment to $$x_2$$ given each assignment to $$x_3,$$ and so on.
102102

103103
## Junction tree algorithm
104104

105105
So far, our discussion assumed that the graph is a tree. What if that is not the case? Inference in that case will not be tractable; however, we may try to massage the graph to its most tree-like form, and then run message passing on this graph.
106106

107-
At a high-level the junction tree algorithm partitions the graph into clusters of variables; internally, the variables within a cluster could be highly coupled; however, interactions *among* clusters will have a tree structure, i.e. a cluster will be only directly influenced by its neighbors in the tree. This leads to tractable global solutions if the local (cluster-level) problems can be solved exactly.
107+
At a high-level the junction tree algorithm partitions the graph into clusters of variables; internally, the variables within a cluster could be highly coupled; however, interactions *among* clusters will have a tree structure, i.e., a cluster will be only directly influenced by its neighbors in the tree. This leads to tractable global solutions if the local (cluster-level) problems can be solved exactly.
108108

109109
### An illustrative example
110110

@@ -133,7 +133,7 @@ The running intersection property is what enables us to push sums in all the way
133133
The core idea of the junction tree algorithm is to turn a graph into a tree of clusters that are amenable to the variable elimination algorithm like the above MRF. Then we simply perform message-passing on this tree.
134134

135135
Suppose we have an undirected graphical model $$G$$ (if the model is directed, we consider its moralized graph).
136-
A junction tree $$T=(C, E_T)$$ over $$G = (\Xc, E_G)$$ is a tree whose nodes $$c \in C$$ are associated with subsets $$x_c \subseteq \Xc$$ of the graph vertices (i.e. sets of variables); the junction tree must satisfy the following properties:
136+
A junction tree $$T=(C, E_T)$$ over $$G = (\Xc, E_G)$$ is a tree whose nodes $$c \in C$$ are associated with subsets $$x_c \subseteq \Xc$$ of the graph vertices (i.e., sets of variables); the junction tree must satisfy the following properties:
137137

138138
- *Family preservation*: For each factor $$\phi$$, there is a cluster $$c$$ such that $$\text{Scope}[\phi] \subseteq x_c$$.
139139
- *Running intersection*: For every pair of clusters $$c^{(i)}, c^{(j)}$$, every cluster on the path between $$c^{(i)}, c^{(j)}$$ contains $$x_c^{(i)} \cap x_c^{(j)}$$.
@@ -170,7 +170,7 @@ $$
170170
\beta_c(x_c) = \psi_c(x_c) \prod_{\ell \in N(i)} m_{\ell \to i}(S_{\ell i}).
171171
$$
172172

173-
These updates are often referred to as *Shafer-Shenoy*. After all the messages have been passed, beliefs will be proportional to the marginal probabilities over their scopes, i.e. $$\beta_c(x_c) \propto p(x_c)$$. We may answer queries of the form $$\tp(x)$$ for $$x \in x_c$$ by marginalizing out the variable in its belief{% include sidenote.html id="note-dp" note="Readers familiar with combinatorial optimization will recognize this as a special case of dynamic programming on a tree decomposition of a graph with bounded treewidth." %}
173+
These updates are often referred to as *Shafer-Shenoy*. After all the messages have been passed, beliefs will be proportional to the marginal probabilities over their scopes, i.e., $$\beta_c(x_c) \propto p(x_c)$$. We may answer queries of the form $$\tp(x)$$ for $$x \in x_c$$ by marginalizing out the variable in its belief{% include sidenote.html id="note-dp" note="Readers familiar with combinatorial optimization will recognize this as a special case of dynamic programming on a tree decomposition of a graph with bounded treewidth." %}
174174

175175
$$
176176
\tp(x) = \sum_{x_c \backslash x} \beta_c(x_c).
@@ -197,7 +197,7 @@ Repeating this procedure eventually produces a single factor $$\beta(x_c^{(i)})$
197197

198198
Formally, we may prove correctness of the JT algorithm through an induction argument on the number of factors $$\psi$$; we will leave this as an exercise to the reader. The key property that makes this argument possible is the RIP; it assures us that it's safe to eliminate a variable from a leaf cluster that is not found in that cluster's sepset; by the RIP, it cannot occur anywhere except that one cluster.
199199

200-
The important thing to note is that if we now set $$c^{(k)}$$ to be the root of the tree (e.g. if we set $$(b,c,e)$$ to be the root), the message it will receive from $$c^{(j)}$$ (or from $$(b,e,f)$$ in our example) will not change. Hence, the caching approach we used for the belief propagation algorithm extends immediately to junction trees; the algorithm we formally defined above implements this caching.
200+
The important thing to note is that if we now set $$c^{(k)}$$ to be the root of the tree (e.g., if we set $$(b,c,e)$$ to be the root), the message it will receive from $$c^{(j)}$$ (or from $$(b,e,f)$$ in our example) will not change. Hence, the caching approach we used for the belief propagation algorithm extends immediately to junction trees; the algorithm we formally defined above implements this caching.
201201

202202
### Finding a good junction tree
203203

0 commit comments

Comments
 (0)