ProbabilisticLearning · jerry123z · Apr 12, 2021 · Apr 12, 2021
diff --git a/Lecture Notes/Images/Moralized.png b/Lecture Notes/Images/Moralized.png
diff --git a/Lecture Notes/Images/Screenshot_1.png b/Lecture Notes/Images/Screenshot_1.png
diff --git a/Lecture Notes/Images/Screenshot_10.png b/Lecture Notes/Images/Screenshot_10.png
diff --git a/Lecture Notes/Images/Screenshot_11.png b/Lecture Notes/Images/Screenshot_11.png
diff --git a/Lecture Notes/Images/Screenshot_12.png b/Lecture Notes/Images/Screenshot_12.png
diff --git a/Lecture Notes/Images/Screenshot_13.png b/Lecture Notes/Images/Screenshot_13.png
diff --git a/Lecture Notes/Images/Screenshot_14.png b/Lecture Notes/Images/Screenshot_14.png
diff --git a/Lecture Notes/Images/Screenshot_15.png b/Lecture Notes/Images/Screenshot_15.png
diff --git a/Lecture Notes/Images/Screenshot_16.png b/Lecture Notes/Images/Screenshot_16.png
diff --git a/Lecture Notes/Images/Screenshot_17.png b/Lecture Notes/Images/Screenshot_17.png
diff --git a/Lecture Notes/Images/Screenshot_18.png b/Lecture Notes/Images/Screenshot_18.png
diff --git a/Lecture Notes/Images/Screenshot_19.png b/Lecture Notes/Images/Screenshot_19.png
diff --git a/Lecture Notes/Images/Screenshot_2.png b/Lecture Notes/Images/Screenshot_2.png
diff --git a/Lecture Notes/Images/Screenshot_20.png b/Lecture Notes/Images/Screenshot_20.png
diff --git a/Lecture Notes/Images/Screenshot_21.png b/Lecture Notes/Images/Screenshot_21.png
diff --git a/Lecture Notes/Images/Screenshot_3.png b/Lecture Notes/Images/Screenshot_3.png
diff --git a/Lecture Notes/Images/Screenshot_4.png b/Lecture Notes/Images/Screenshot_4.png
diff --git a/Lecture Notes/Images/Screenshot_5.png b/Lecture Notes/Images/Screenshot_5.png
diff --git a/Lecture Notes/Images/Screenshot_6.png b/Lecture Notes/Images/Screenshot_6.png
diff --git a/Lecture Notes/Images/Screenshot_7.png b/Lecture Notes/Images/Screenshot_7.png
diff --git a/Lecture Notes/Images/Screenshot_8.png b/Lecture Notes/Images/Screenshot_8.png
diff --git a/Lecture Notes/Images/Screenshot_9.png b/Lecture Notes/Images/Screenshot_9.png
diff --git a/Lecture Notes/Images/Week3.tex b/Lecture Notes/Images/Week3.tex
@@ -0,0 +1,144 @@
+\documentclass{article}
+\usepackage[utf8]{inputenc}
+\usepackage{hyperref}
+\usepackage{graphicx}
+\usepackage{amssymb}
+\title{CSC412 Notes Week 3}
+\author{Jerry Zheng}
+\date{April 2021}
+\hypersetup{
+    colorlinks=true,
+    linkcolor=blue,
+    filecolor=magenta,      
+    urlcolor=blue,
+    pdftitle={Sharelatex Example},
+    bookmarks=true,
+    pdfpagemode=FullScreen,
+}
+
+\begin{document}
+
+\maketitle
+
+\section{Graphical Models}
+\subsection{Chain Rule}
+The joint distribution of (N) random variables can be evaluated with the chain rule\\
+
+$$P(x_{1, ..., N}) = P(x_1)P(x_2|x_1)P(x_3 | x_2, x_1) \ldots P(x_n | x_{n-1 ,..., 1})$$\\
+
+When we have a joint distribution of discrete random variables with full dependence between variables.\\
+
+More formally, in probability the chain rule for two random variables is\\
+$$P(x, y) = P(x | y)P(y)$$\\
+
+\subsection{Conditional Independence}
+
+To represent large joint distributions we can assume conditional independence
+$$X \perp Y | Z \Leftrightarrow P(X, Y | Z) = P(X | Z)P(Y | Z) \Leftrightarrow P(X | Y, Z) = P(X | Z)$$
+This is very useful as now we can represent a large chain of N variables as a product of independent variables.
+$$P(x_{1:n})  = P(x_1) \prod_{t=1}^n P(x_t|x_{t-1})$$
+
+this is the (first order) Markov Assumption. Where "the future is independent of the past given the present"
+
+\subsection{Probabilistic Graphical Models}
+If you don't know what a  graph is, \href{https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)}{Wikipedia}\\
+
+A probabilistic graphical model can modeled can be used to represent joint distributions when we assume Conditional Independence. In this model, nodes are variables and edges show conditional dependence.\\
+
+\subsection{Directed Acyclical Graphical Model}
+Using the Markov Assumption  we can very easily represent complicated graphs.\\
+A 4 node starting graph can be represented with half the number of edges!\\
+\includegraphics[scale=0.7]{Screenshot_2.png}\\
+
+can now be represented as\\
+\includegraphics[scale=0.7]{Screenshot_3.png}\\
+
+its easy to see that $X_a$ is much easier to evaluate in the second graph than the first.
+
+$$P(x_a) = P(x_a| x_b, x_c, x_d) P(x_b| x_c, x_d )P( x_c | x_d) P(x_d)$$
+can be simplified to this with the Markov assumption.
+$$P(x_a) = P(x_a| x_b) P(x_b| x_c)P( x_c | x_d) P(x_d)$$
+
+\section{Conditional Independence and Directed-Separation }
+Directed-separation is where two variables in a DAGM may or may not be connected given a third variable.\\
+D-connection implies conditional dependence.\\
+D-separation implies conditional independence.\\
+\\
+This also extrapolates into groups/sets of variables for X Y Z
+
+$$X = \{X_1, ...X_n\}$$
+$$Y = \{Y_1, ...Y_n\}$$
+$$Z = \{Z_1, ...Z_n\}$$
+$$X \bot Z | Y$$\\
+If every variable in X is d-separated from every variable in Z conditioned on all the variables in Y.\\
+To determine d-separation we will use the Bayes ball algorithm\\
+
+For Bayes Ball there are 3 structures that you must know.
+
+\subsection{Bayes Ball - Chain}
+\includegraphics[scale=0.7]{Screenshot_7.png}\\
+
+X and Z are conditionally dependent when y is unknown and conditionally independent when y is known.\\
+From the chain's graph, we  can encode the structure as.
+$$P(x, y, z) = P(x)P(y|x)P(z|y)$$
+once we condition on y we get.
+$$
+P(x, z | y) = \frac{P(x)p(y|x)p(z|y))}{P(y)} \\
+= \frac{P(x, y)P(z|y)}{P(y)} \\
+= P(x | y) P(z | y) \\ 
+$$
+$$\therefore  x \bot z | y$$
+
+\subsection{Bayes Ball - Fork}
+\includegraphics[scale=0.7]{Screenshot_8.png}\\
+X and Z are conditionally dependent when y is unknown and conditionally independent when y is known.\\
+from the fork's graph we get the equation\\
+$$P(x, y, z) = P(y)P(x|y)P(z|y)$$
+conditioning on y we get.
+$$
+P(x, z | y) = \frac{P(x, y, z)}{P(y)} \\
+= \frac{P(y)P(x|y)P(z|y))}{P(y)} \\
+= P(x | y) P(z | y) \\ 
+$$
+$$\therefore  x \bot z | y$$
+
+\subsection{Bayes Ball - Collider}
+\includegraphics[scale=0.7]{Screenshot_9.png}\\
+X and Z are conditionally independent when y is unknown and conditionally dependent when y is known.\\
+From the collider's graph we get the equation\\
+$$p(x, y, z) = p(x)p(z)p(y|x, z)$$
+conditioning on y we get.
+$$
+P(x, z | y) = \frac{P(x, y, z)}{P(y)} \\
+= \frac{p(x)p(z)p(y|x, z))}{P(y)} \\
+= P(x) P(z) \\ 
+$$
+$$\therefore  x \not \perp z | y$$
+
+however if we do not condition on y it's  easy to see that...
+$$ P(x, z) = P(x) P(z)$$
+$$\therefore  x \perp z$$
+So we see that conditioning on a common child at the bottom of a collider/v-structure makes its parents become dependent.\\
+This important effect is called explaining away, inter-causal reasoning, or Berkson’s paradox.\\
+\\
+As an example,\\
+$X$ is the event of a Toronto Raptors parade $P(X)=0.01$\\
+$Z$ is the event of a car accident $P(Z)=0.1$\\
+$Y$ is the event of a traffic jam downtown\\
+Lets say that these are the only 2 sources of traffic. So if we know a traffic jam has occurred, then at least one of the two events has happened.\\
+\subsection{Boundary Conditions}
+\includegraphics[scale=0.4]{Screenshot_11.png}\\
+In example 3, if any child of Y is observed then Y is effectively observed, so the information 'bounces back'.\\
+This is shown again in examples 1 and 2. Where if Y is known then the information goes up the chain.\\
+\pagebreak
+\subsection{Putting It Together}
+Now that we understand Conditional Independence with Bayes Ball on simple graphs we can apply this to complex graphs.\\
+\includegraphics[scale=0.7]{Screenshot_10.png}\\
+Say, we want to determine the conditional dependence of 2 and 6 given 5.\\
+There are 3 paths from 2 to 6. \\
+2 → 5 → 6 cannot be traversed $2\perp 6 | 5$ (known chain)\\
+2 → 4 → 7 → 6 cannot be traversed $4 \perp 6 | 7$ (unknown collider)\\
+2 → 1 → 3 → 6 cannot be traversed $2 \perp 3 | 1$ (unknown fork)\\
+so we can say $2 \perp 6 | 5$.\\
+This would change if we knew 1 or 6 or didn't know 5.
+\end{document}
diff --git a/Lecture Notes/Images/Week4.tex b/Lecture Notes/Images/Week4.tex
@@ -0,0 +1,115 @@
+\documentclass{article}
+\usepackage[utf8]{inputenc}
+\usepackage{hyperref}
+\usepackage{amsmath}
+\usepackage{graphicx}
+\usepackage{amssymb}
+\title{CSC412 Notes Week 4}
+\author{Jerry Zheng}
+\date{April 2021}
+\hypersetup{
+    colorlinks=true,
+    linkcolor=blue,
+    filecolor=magenta,      
+    urlcolor=blue,
+    pdftitle={Sharelatex Example},
+    bookmarks=true,
+    pdfpagemode=FullScreen,
+}
+
+\begin{document}
+
+\maketitle
+\section{Exact Inference}
+Lets say we have a distribution. $P(x, y)$
+If we want to perform inference on it we would use $P(y|x) = \frac{P(x,y)}{\sum_{y}p(x,y)}$
+However, there may be a set of variables in our model P that isn't apart of x or y.
+
+$$x = \text{The observed evidence}$$
+$$y = \text{The unobserved variable we want to infer} $$
+$$r = X - {x, y} $$
+
+Where $r$ is a set of random variables neither apart of the query nor the evidence.
+
+$$p(y | x) = \frac{p(y, x)}{p(x)}$$
+
+each of the distributions we need to compute can be computed by marginalizing over the other variables.
+
+$$p(y, x) = \sum_{r}p(y, x, r)$$
+
+However, naively marginalizing over all unobserved variables requires a number of computations exponential in the number of random variables, (N), in our model.
+
+\section{Variable Elimination}
+To compute this efficiently we will use the Variable Elimination Algorithm.\\
+\\
+It's an exact inference algorithm, meaning it will calculate exactly $p(y|x)$.\\
+\\
+It's also general, meaning it can be used on many different kinds of graphical models.\\
+\\
+It's complexity depends of the conditional independence of our model.\\
+\\
+It's intuitively done with dynamic programming.
+
+\subsection{Chain Example}
+$$ A \rightarrow B \rightarrow C \rightarrow D $$
+
+To find P(D), we have the variables
+
+$$y = \{D\}$$
+$$x = \{\}$$
+$$r = \{A, B, C\} $$
+
+\begin{align*} 
+P(y) &= \sum_{r} p(y, r) \\ 
+\Rightarrow P(D) &= \sum_{A, B, C}p(A, B, C, D) \\ 
+& = \sum_A \sum_B \sum_C p(A)p(B | A) p(C | B) p(D | C) \\
+\end{align*}
+
+This is exponential in the number of variables $\mathbf O(k^n)$ (k is the number of states per variable). But, reordering the joint distribution
+
+$$ P(D) = \sum_C p(D | C) \sum_B p(C | B) \sum_A p(A)p(B | A) $$
+
+we can begin to simplify
+
+\begin{align*} 
+P(D) &= \sum_C p(D | C) \sum_b p(C | B) \sum_A p(A)p(B | A) \\
+&= \sum_C p(D | C) \sum_B p(C | B) \tau (B) \\
+&= \sum_C p(D | C) \tau (C) \\
+\end{align*}
+
+So, by using dynamic programming to do the computation in reverse, we do inference over the joint distribution represented by the chain without generating it explicitly!\\
+We have reduced the running time to $\mathbf{O(nk^2)}$!
+
+\subsection{bigger example}
+
+\includegraphics[scale=0.7]{Screenshot_12.png}\\
+The joint distribution of our CS student graph is given by
+
+$$P (C, D, I, G, S, L, J, H) = P (C)P (D|C)P (I)P (G|I, D)P (S|I)P (L|G)P (J|L, S)P (H|G, J)$$
+
+with factors
+
+$$P(C, D, I, G, S, L, J, H) = {\psi_C(C), \psi_D(C, D), \psi_I(I), \psi_G(G, I, D),  \psi_S(S, I), \psi_L(L, G), \psi_J(J, L, S), \psi_H(H, G, J)} $$
+ (the textbook uses an undirected graph, not too important as variable elimination can be done all the same.)\\
+ (an explanation of $\psi$ will be given in week 5's notes)\\
+\\
+To compute p(J = 1), we could calculate all possible assignments\\
+$$p(J) = \sum_{L} \sum_{S} \sum_{ G} \sum_{ H} \sum_{ I} \sum_{D} \sum_{C}p(C, D, I, G, S, L, J, H)$$
+But we can do better with variable elimination. Where we push sums inside products.\\
+
+\begin{align*}
+p(J) &= \sum_{L,S,G,H,I,D,C} p(C, D, I, G, S, L, J, H)\\
+&= \sum_{L,S,G,H,I,D,C}\psi_C(C)\psi_D(D, C)\psi_I(I)\psi_G(G, I, D)\psi_S(S, I)\psi_L(L, G) \times \psi_J(J, L, S)\psi_H(H, G, J)\\
+&= \sum_{L,S}\psi_J(J, L, S)
+\sum_{G}\psi_L(L, G)
+\sum_{H}\psi_H(H, G, J)
+\sum_{I}\psi_S(S, I)\psi_I(I)
+\times \sum_{D}\psi_G(G, I, D)
+\sum_{C}\psi_C(C)\psi_D(D, C)
+\end{align*}
+from here we will marginalize out each variable individually to get a new factor at each step.\\
+We do it in the order C, D, I, H, G, S, L to get P(J)\\
+
+\includegraphics[scale=0.6]{Screenshot_13.png}
+
+\end{document}
diff --git a/Lecture Notes/Images/Week5.tex b/Lecture Notes/Images/Week5.tex
@@ -0,0 +1,119 @@
+documentclass{article}
+usepackage[utf8]{inputenc}
+usepackage{hyperref}
+usepackage{amsmath}
+usepackage{graphicx}
+usepackage{amssymb}
+title{CSC412 Notes Week 5}
+author{Jerry Zheng}
+date{April 2021}
+hypersetup{
+    colorlinks=true,
+    linkcolor=blue,
+    filecolor=magenta,      
+    urlcolor=blue,
+    pdftitle={Sharelatex Example},
+    bookmarks=true,
+    pdfpagemode=FullScreen,
+}
+
+begin{document}
+
+maketitle
+section{Problems with Directed Graphical Models}
+For some problems, directionality for the edges in our DAGMs really hinders us.
+for example,  when we process an image we know that a pixel depends on its neighbours.
+Lets Say pixel 2 depends on pixel 1 and pixel 3 depends on pixel 2.
+We can extrapolate this into a Markov mesh.
+includegraphics[scale=0.5]{Screenshot_15.png}
+Of course, this model isnt very good because dependencies are directional and only go down and to the right.
+Also, if we observe some pixels, then pixels nearby can be arbitrarily conditionally independent!
+
+Alternatively we can have a Naive Bayes model by introducing a hidden class variable.
+$$p(X) = sum_z p(X,z)$$
+$$p(X) = sum_z prod_{x_iin X} p(x_iz)$$
+includegraphics[scale=0.4]{Screenshot_16.png}
+
+However there are issues with this too.
+The top left and bot right pixels are dependent on each other which might not be desired.
+Also, if we know what the class is then all the pixels are conditionally independent.
+
+An alternative to DAGMs, is undirected graphical models (UGMs).
+
+
+section{Undirected Graphical Models}
+In UGMs, we have edges that captures relation between variables rather than defining them as parent and child.
+
+includegraphics[scale=0.4]{Screenshot_17.png}
+
+subsection{D-Sepearation in Undirected Graphical Models}
+The following three properties are used to determine if nodes are conditionally independent
+includegraphics[scale=0.2]{skggm_markov.png}
+
+Global Markov Property 
+$X_A bot X_B  X_C$ iff $(X_C)$ separates $(X_A)$ from $(X_B)$
+
+Local Markov Property The set of nodes that renders a node conditionally independent of all the other nodes in the graph
+
+$$X_j bot X_{V - {j,neighbour(j)}} X_{neighbour(j)}$$
+
+Pairwise Markov Property The set of nodes that renders two nodes conditionally independent of each other.
+$$X_j bot X_i X_{V - {j, i}}$$
+
+It's obvious that global Markov implies local Markov which implies pairwise Markov.
+
+subsection{limitations of UAGs and DAGs}
+Note, though we can represent new relations between variables, we can't represent others.
+includegraphics[scale=0.4]{Screenshot_18.png}
+A DAG cant represent graph 1 where X and Z are conditionally dependent while a UAG can.
+But a UAG cannot represent graph 2 where X and Z are conditionally independent without knowing Y.
+
+subsection{Moralization}
+This was only mentioned in passing during lecture but a DAG can be converted to a UAG using href{httpsen.wikipedia.orgwikiMoral_graph}{Moralization}
+This is done adding edges between all pairs of non-adjacent nodes that have a common child, then making all edges in the graph undirected.
+includegraphics[scale=0.2]{moralGraph-DAG.png} becomes
+includegraphics[scale=0.2]{Moralized.png}
+section{Cliques}
+A clique in an undirected graph is a subset of its vertices such that every two vertices in the subset are connected by an edge.
+
+A maximal clique is a clique that cannot be extended by including one more adjacent vertex.
+A maximum clique is a clique of the largest possible size in a graph.
+
+includegraphics[scale=0.8]{Screenshot_19.png}
+The image shows 2 maximal cliques with the red clique also being the maximum clique
+
+section{Hammersley-Clifford Theorem}
+Since there is no topological ordering for an undirected graph, we can’t use the chain rule to represent p(y). So instead, we associate factors with each maximal clique.
+We will denote the potential function for clique with $psi_c(y_ctheta_c)$ (back from the last week's notes) this can be any non-negative function.
+The joint distribution is then defined to be proportional to the product of clique potentials.
+
+Hammersley-Clifford Theorem A positive distribution $p(y)  0$ satisfies the CI properties of an undirected graph G iff p can be represented as a product of factors, one per maximal clique, i.e.,
+
+$$P(ytheta) = frac{1}{Z(theta)} prod_C psi_c (y_ctheta_c)$$
+
+Where $Z(theta)$ is the sum of all our possible values.
+$$P(ytheta) = sum_C psi_c (y_ctheta_c)$$
+
+going back to our example graph
+includegraphics[scale=0.8]{Screenshot_20.png}
+$$
+p(ytheta) = frac{1}{Z(theta)}  psi_{123}(y_1, y_2, y_3) psi_{234}(y_2, y_3, y_4) psi_{35}(y_3, y_5)
+$$
+$$
+Z = sum_y psi_{123}(y_1, y_2, y_3)psi_{234}(y_2, y_3, y_4)psi_{35}(y_3, y_5)
+$$
+this is useful because we can represent terms in terms of cliques instead of edges with reduces the number of terms in variable elimination
+
+section{Energy Based Models}
+UGMs are very useful in physics. Take for example the Gibbs distribution used for modeling of Gibbs free Energy
+$$p(yθ) = frac{1}{Z(theta)} exp(−sum_c E(y_ctheta_c)) $$
+where $E(y_c)  0$ is the energy associated with the variables in clique c. We can convert this to
+UGM by defining.
+$$psi_c(y_ctheta_c) = exp(−E(y_ctheta_c))$$
+
+so we can now model the energy state of say, a protein molecule as a UGM.
+includegraphics[scale=0.6]{Screenshot_21.png}
+not a molecule don't @ me
+
+But going back to our initial example, it's east to see with a UAGM we can represent say, pixels in an image and have neighbouring pixels be related to each other.
+end{document}
diff --git a/Lecture Notes/Images/moralGraph-DAG.png b/Lecture Notes/Images/moralGraph-DAG.png
diff --git a/Lecture Notes/Images/skggm_markov.png b/Lecture Notes/Images/skggm_markov.png
diff --git a/Lecture Notes/Week_3.pdf b/Lecture Notes/Week_3.pdf
diff --git a/Lecture Notes/Week_4.pdf b/Lecture Notes/Week_4.pdf
diff --git a/Lecture Notes/Week_5.pdf b/Lecture Notes/Week_5.pdf