diff --git a/Lecture Notes/Images/Moralized.png b/Lecture Notes/Images/Moralized.png new file mode 100644 index 0000000..2ec0f13 Binary files /dev/null and b/Lecture Notes/Images/Moralized.png differ diff --git a/Lecture Notes/Images/Screenshot_1.png b/Lecture Notes/Images/Screenshot_1.png new file mode 100644 index 0000000..0a1bea2 Binary files /dev/null and b/Lecture Notes/Images/Screenshot_1.png differ diff --git a/Lecture Notes/Images/Screenshot_10.png b/Lecture Notes/Images/Screenshot_10.png new file mode 100644 index 0000000..d07826d Binary files /dev/null and b/Lecture Notes/Images/Screenshot_10.png differ diff --git a/Lecture Notes/Images/Screenshot_11.png b/Lecture Notes/Images/Screenshot_11.png new file mode 100644 index 0000000..3617d2c Binary files /dev/null and b/Lecture Notes/Images/Screenshot_11.png differ diff --git a/Lecture Notes/Images/Screenshot_12.png b/Lecture Notes/Images/Screenshot_12.png new file mode 100644 index 0000000..c1e0889 Binary files /dev/null and b/Lecture Notes/Images/Screenshot_12.png differ diff --git a/Lecture Notes/Images/Screenshot_13.png b/Lecture Notes/Images/Screenshot_13.png new file mode 100644 index 0000000..15d4d8a Binary files /dev/null and b/Lecture Notes/Images/Screenshot_13.png differ diff --git a/Lecture Notes/Images/Screenshot_14.png b/Lecture Notes/Images/Screenshot_14.png new file mode 100644 index 0000000..b7d5f03 Binary files /dev/null and b/Lecture Notes/Images/Screenshot_14.png differ diff --git a/Lecture Notes/Images/Screenshot_15.png b/Lecture Notes/Images/Screenshot_15.png new file mode 100644 index 0000000..a8f9338 Binary files /dev/null and b/Lecture Notes/Images/Screenshot_15.png differ diff --git a/Lecture Notes/Images/Screenshot_16.png b/Lecture Notes/Images/Screenshot_16.png new file mode 100644 index 0000000..035b44d Binary files /dev/null and b/Lecture Notes/Images/Screenshot_16.png differ diff --git a/Lecture Notes/Images/Screenshot_17.png b/Lecture Notes/Images/Screenshot_17.png new file mode 100644 index 0000000..1ebac31 Binary files /dev/null and b/Lecture Notes/Images/Screenshot_17.png differ diff --git a/Lecture Notes/Images/Screenshot_18.png b/Lecture Notes/Images/Screenshot_18.png new file mode 100644 index 0000000..d9a3c37 Binary files /dev/null and b/Lecture Notes/Images/Screenshot_18.png differ diff --git a/Lecture Notes/Images/Screenshot_19.png b/Lecture Notes/Images/Screenshot_19.png new file mode 100644 index 0000000..cde9e8b Binary files /dev/null and b/Lecture Notes/Images/Screenshot_19.png differ diff --git a/Lecture Notes/Images/Screenshot_2.png b/Lecture Notes/Images/Screenshot_2.png new file mode 100644 index 0000000..dc303d3 Binary files /dev/null and b/Lecture Notes/Images/Screenshot_2.png differ diff --git a/Lecture Notes/Images/Screenshot_20.png b/Lecture Notes/Images/Screenshot_20.png new file mode 100644 index 0000000..b4186c6 Binary files /dev/null and b/Lecture Notes/Images/Screenshot_20.png differ diff --git a/Lecture Notes/Images/Screenshot_21.png b/Lecture Notes/Images/Screenshot_21.png new file mode 100644 index 0000000..6a81eb1 Binary files /dev/null and b/Lecture Notes/Images/Screenshot_21.png differ diff --git a/Lecture Notes/Images/Screenshot_3.png b/Lecture Notes/Images/Screenshot_3.png new file mode 100644 index 0000000..80b26dd Binary files /dev/null and b/Lecture Notes/Images/Screenshot_3.png differ diff --git a/Lecture Notes/Images/Screenshot_4.png b/Lecture Notes/Images/Screenshot_4.png new file mode 100644 index 0000000..2edafb4 Binary files /dev/null and b/Lecture Notes/Images/Screenshot_4.png differ diff --git a/Lecture Notes/Images/Screenshot_5.png b/Lecture Notes/Images/Screenshot_5.png new file mode 100644 index 0000000..c66d1e7 Binary files /dev/null and b/Lecture Notes/Images/Screenshot_5.png differ diff --git a/Lecture Notes/Images/Screenshot_6.png b/Lecture Notes/Images/Screenshot_6.png new file mode 100644 index 0000000..e487475 Binary files /dev/null and b/Lecture Notes/Images/Screenshot_6.png differ diff --git a/Lecture Notes/Images/Screenshot_7.png b/Lecture Notes/Images/Screenshot_7.png new file mode 100644 index 0000000..d185e9c Binary files /dev/null and b/Lecture Notes/Images/Screenshot_7.png differ diff --git a/Lecture Notes/Images/Screenshot_8.png b/Lecture Notes/Images/Screenshot_8.png new file mode 100644 index 0000000..5332a45 Binary files /dev/null and b/Lecture Notes/Images/Screenshot_8.png differ diff --git a/Lecture Notes/Images/Screenshot_9.png b/Lecture Notes/Images/Screenshot_9.png new file mode 100644 index 0000000..f3a51d8 Binary files /dev/null and b/Lecture Notes/Images/Screenshot_9.png differ diff --git a/Lecture Notes/Images/Week3.tex b/Lecture Notes/Images/Week3.tex new file mode 100644 index 0000000..1c09997 --- /dev/null +++ b/Lecture Notes/Images/Week3.tex @@ -0,0 +1,144 @@ +\documentclass{article} +\usepackage[utf8]{inputenc} +\usepackage{hyperref} +\usepackage{graphicx} +\usepackage{amssymb} +\title{CSC412 Notes Week 3} +\author{Jerry Zheng} +\date{April 2021} +\hypersetup{ + colorlinks=true, + linkcolor=blue, + filecolor=magenta, + urlcolor=blue, + pdftitle={Sharelatex Example}, + bookmarks=true, + pdfpagemode=FullScreen, +} + +\begin{document} + +\maketitle + +\section{Graphical Models} +\subsection{Chain Rule} +The joint distribution of (N) random variables can be evaluated with the chain rule\\ + +$$P(x_{1, ..., N}) = P(x_1)P(x_2|x_1)P(x_3 | x_2, x_1) \ldots P(x_n | x_{n-1 ,..., 1})$$\\ + +When we have a joint distribution of discrete random variables with full dependence between variables.\\ + +More formally, in probability the chain rule for two random variables is\\ +$$P(x, y) = P(x | y)P(y)$$\\ + +\subsection{Conditional Independence} + +To represent large joint distributions we can assume conditional independence +$$X \perp Y | Z \Leftrightarrow P(X, Y | Z) = P(X | Z)P(Y | Z) \Leftrightarrow P(X | Y, Z) = P(X | Z)$$ +This is very useful as now we can represent a large chain of N variables as a product of independent variables. +$$P(x_{1:n}) = P(x_1) \prod_{t=1}^n P(x_t|x_{t-1})$$ + +this is the (first order) Markov Assumption. Where "the future is independent of the past given the present" + +\subsection{Probabilistic Graphical Models} +If you don't know what a graph is, \href{https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)}{Wikipedia}\\ + +A probabilistic graphical model can modeled can be used to represent joint distributions when we assume Conditional Independence. In this model, nodes are variables and edges show conditional dependence.\\ + +\subsection{Directed Acyclical Graphical Model} +Using the Markov Assumption we can very easily represent complicated graphs.\\ +A 4 node starting graph can be represented with half the number of edges!\\ +\includegraphics[scale=0.7]{Screenshot_2.png}\\ + +can now be represented as\\ +\includegraphics[scale=0.7]{Screenshot_3.png}\\ + +its easy to see that $X_a$ is much easier to evaluate in the second graph than the first. + +$$P(x_a) = P(x_a| x_b, x_c, x_d) P(x_b| x_c, x_d )P( x_c | x_d) P(x_d)$$ +can be simplified to this with the Markov assumption. +$$P(x_a) = P(x_a| x_b) P(x_b| x_c)P( x_c | x_d) P(x_d)$$ + +\section{Conditional Independence and Directed-Separation } +Directed-separation is where two variables in a DAGM may or may not be connected given a third variable.\\ +D-connection implies conditional dependence.\\ +D-separation implies conditional independence.\\ +\\ +This also extrapolates into groups/sets of variables for X Y Z + +$$X = \{X_1, ...X_n\}$$ +$$Y = \{Y_1, ...Y_n\}$$ +$$Z = \{Z_1, ...Z_n\}$$ +$$X \bot Z | Y$$\\ +If every variable in X is d-separated from every variable in Z conditioned on all the variables in Y.\\ +To determine d-separation we will use the Bayes ball algorithm\\ + +For Bayes Ball there are 3 structures that you must know. + +\subsection{Bayes Ball - Chain} +\includegraphics[scale=0.7]{Screenshot_7.png}\\ + +X and Z are conditionally dependent when y is unknown and conditionally independent when y is known.\\ +From the chain's graph, we can encode the structure as. +$$P(x, y, z) = P(x)P(y|x)P(z|y)$$ +once we condition on y we get. +$$ +P(x, z | y) = \frac{P(x)p(y|x)p(z|y))}{P(y)} \\ += \frac{P(x, y)P(z|y)}{P(y)} \\ += P(x | y) P(z | y) \\ +$$ +$$\therefore x \bot z | y$$ + +\subsection{Bayes Ball - Fork} +\includegraphics[scale=0.7]{Screenshot_8.png}\\ +X and Z are conditionally dependent when y is unknown and conditionally independent when y is known.\\ +from the fork's graph we get the equation\\ +$$P(x, y, z) = P(y)P(x|y)P(z|y)$$ +conditioning on y we get. +$$ +P(x, z | y) = \frac{P(x, y, z)}{P(y)} \\ += \frac{P(y)P(x|y)P(z|y))}{P(y)} \\ += P(x | y) P(z | y) \\ +$$ +$$\therefore x \bot z | y$$ + +\subsection{Bayes Ball - Collider} +\includegraphics[scale=0.7]{Screenshot_9.png}\\ +X and Z are conditionally independent when y is unknown and conditionally dependent when y is known.\\ +From the collider's graph we get the equation\\ +$$p(x, y, z) = p(x)p(z)p(y|x, z)$$ +conditioning on y we get. +$$ +P(x, z | y) = \frac{P(x, y, z)}{P(y)} \\ += \frac{p(x)p(z)p(y|x, z))}{P(y)} \\ += P(x) P(z) \\ +$$ +$$\therefore x \not \perp z | y$$ + +however if we do not condition on y it's easy to see that... +$$ P(x, z) = P(x) P(z)$$ +$$\therefore x \perp z$$ +So we see that conditioning on a common child at the bottom of a collider/v-structure makes its parents become dependent.\\ +This important effect is called explaining away, inter-causal reasoning, or Berkson’s paradox.\\ +\\ +As an example,\\ +$X$ is the event of a Toronto Raptors parade $P(X)=0.01$\\ +$Z$ is the event of a car accident $P(Z)=0.1$\\ +$Y$ is the event of a traffic jam downtown\\ +Lets say that these are the only 2 sources of traffic. So if we know a traffic jam has occurred, then at least one of the two events has happened.\\ +\subsection{Boundary Conditions} +\includegraphics[scale=0.4]{Screenshot_11.png}\\ +In example 3, if any child of Y is observed then Y is effectively observed, so the information 'bounces back'.\\ +This is shown again in examples 1 and 2. Where if Y is known then the information goes up the chain.\\ +\pagebreak +\subsection{Putting It Together} +Now that we understand Conditional Independence with Bayes Ball on simple graphs we can apply this to complex graphs.\\ +\includegraphics[scale=0.7]{Screenshot_10.png}\\ +Say, we want to determine the conditional dependence of 2 and 6 given 5.\\ +There are 3 paths from 2 to 6. \\ +2 → 5 → 6 cannot be traversed $2\perp 6 | 5$ (known chain)\\ +2 → 4 → 7 → 6 cannot be traversed $4 \perp 6 | 7$ (unknown collider)\\ +2 → 1 → 3 → 6 cannot be traversed $2 \perp 3 | 1$ (unknown fork)\\ +so we can say $2 \perp 6 | 5$.\\ +This would change if we knew 1 or 6 or didn't know 5. +\end{document} diff --git a/Lecture Notes/Images/Week4.tex b/Lecture Notes/Images/Week4.tex new file mode 100644 index 0000000..0882aac --- /dev/null +++ b/Lecture Notes/Images/Week4.tex @@ -0,0 +1,115 @@ +\documentclass{article} +\usepackage[utf8]{inputenc} +\usepackage{hyperref} +\usepackage{amsmath} +\usepackage{graphicx} +\usepackage{amssymb} +\title{CSC412 Notes Week 4} +\author{Jerry Zheng} +\date{April 2021} +\hypersetup{ + colorlinks=true, + linkcolor=blue, + filecolor=magenta, + urlcolor=blue, + pdftitle={Sharelatex Example}, + bookmarks=true, + pdfpagemode=FullScreen, +} + +\begin{document} + +\maketitle +\section{Exact Inference} +Lets say we have a distribution. $P(x, y)$ +If we want to perform inference on it we would use $P(y|x) = \frac{P(x,y)}{\sum_{y}p(x,y)}$ +However, there may be a set of variables in our model P that isn't apart of x or y. + +$$x = \text{The observed evidence}$$ +$$y = \text{The unobserved variable we want to infer} $$ +$$r = X - {x, y} $$ + +Where $r$ is a set of random variables neither apart of the query nor the evidence. + +$$p(y | x) = \frac{p(y, x)}{p(x)}$$ + +each of the distributions we need to compute can be computed by marginalizing over the other variables. + +$$p(y, x) = \sum_{r}p(y, x, r)$$ + +However, naively marginalizing over all unobserved variables requires a number of computations exponential in the number of random variables, (N), in our model. + +\section{Variable Elimination} +To compute this efficiently we will use the Variable Elimination Algorithm.\\ +\\ +It's an exact inference algorithm, meaning it will calculate exactly $p(y|x)$.\\ +\\ +It's also general, meaning it can be used on many different kinds of graphical models.\\ +\\ +It's complexity depends of the conditional independence of our model.\\ +\\ +It's intuitively done with dynamic programming. + +\subsection{Chain Example} +$$ A \rightarrow B \rightarrow C \rightarrow D $$ + +To find P(D), we have the variables + +$$y = \{D\}$$ +$$x = \{\}$$ +$$r = \{A, B, C\} $$ + +\begin{align*} +P(y) &= \sum_{r} p(y, r) \\ +\Rightarrow P(D) &= \sum_{A, B, C}p(A, B, C, D) \\ +& = \sum_A \sum_B \sum_C p(A)p(B | A) p(C | B) p(D | C) \\ +\end{align*} + +This is exponential in the number of variables $\mathbf O(k^n)$ (k is the number of states per variable). But, reordering the joint distribution + +$$ P(D) = \sum_C p(D | C) \sum_B p(C | B) \sum_A p(A)p(B | A) $$ + +we can begin to simplify + +\begin{align*} +P(D) &= \sum_C p(D | C) \sum_b p(C | B) \sum_A p(A)p(B | A) \\ +&= \sum_C p(D | C) \sum_B p(C | B) \tau (B) \\ +&= \sum_C p(D | C) \tau (C) \\ +\end{align*} + +So, by using dynamic programming to do the computation in reverse, we do inference over the joint distribution represented by the chain without generating it explicitly!\\ +We have reduced the running time to $\mathbf{O(nk^2)}$! + +\subsection{bigger example} + +\includegraphics[scale=0.7]{Screenshot_12.png}\\ +The joint distribution of our CS student graph is given by + +$$P (C, D, I, G, S, L, J, H) = P (C)P (D|C)P (I)P (G|I, D)P (S|I)P (L|G)P (J|L, S)P (H|G, J)$$ + +with factors + +$$P(C, D, I, G, S, L, J, H) = {\psi_C(C), \psi_D(C, D), \psi_I(I), \psi_G(G, I, D), \psi_S(S, I), \psi_L(L, G), \psi_J(J, L, S), \psi_H(H, G, J)} $$ + (the textbook uses an undirected graph, not too important as variable elimination can be done all the same.)\\ + (an explanation of $\psi$ will be given in week 5's notes)\\ +\\ +To compute p(J = 1), we could calculate all possible assignments\\ +$$p(J) = \sum_{L} \sum_{S} \sum_{ G} \sum_{ H} \sum_{ I} \sum_{D} \sum_{C}p(C, D, I, G, S, L, J, H)$$ +But we can do better with variable elimination. Where we push sums inside products.\\ + +\begin{align*} +p(J) &= \sum_{L,S,G,H,I,D,C} p(C, D, I, G, S, L, J, H)\\ +&= \sum_{L,S,G,H,I,D,C}\psi_C(C)\psi_D(D, C)\psi_I(I)\psi_G(G, I, D)\psi_S(S, I)\psi_L(L, G) \times \psi_J(J, L, S)\psi_H(H, G, J)\\ +&= \sum_{L,S}\psi_J(J, L, S) +\sum_{G}\psi_L(L, G) +\sum_{H}\psi_H(H, G, J) +\sum_{I}\psi_S(S, I)\psi_I(I) +\times \sum_{D}\psi_G(G, I, D) +\sum_{C}\psi_C(C)\psi_D(D, C) +\end{align*} +from here we will marginalize out each variable individually to get a new factor at each step.\\ +We do it in the order C, D, I, H, G, S, L to get P(J)\\ + +\includegraphics[scale=0.6]{Screenshot_13.png} + +\end{document} diff --git a/Lecture Notes/Images/Week5.tex b/Lecture Notes/Images/Week5.tex new file mode 100644 index 0000000..86a0e9f --- /dev/null +++ b/Lecture Notes/Images/Week5.tex @@ -0,0 +1,119 @@ +documentclass{article} +usepackage[utf8]{inputenc} +usepackage{hyperref} +usepackage{amsmath} +usepackage{graphicx} +usepackage{amssymb} +title{CSC412 Notes Week 5} +author{Jerry Zheng} +date{April 2021} +hypersetup{ + colorlinks=true, + linkcolor=blue, + filecolor=magenta, + urlcolor=blue, + pdftitle={Sharelatex Example}, + bookmarks=true, + pdfpagemode=FullScreen, +} + +begin{document} + +maketitle +section{Problems with Directed Graphical Models} +For some problems, directionality for the edges in our DAGMs really hinders us. +for example, when we process an image we know that a pixel depends on its neighbours. +Lets Say pixel 2 depends on pixel 1 and pixel 3 depends on pixel 2. +We can extrapolate this into a Markov mesh. +includegraphics[scale=0.5]{Screenshot_15.png} +Of course, this model isnt very good because dependencies are directional and only go down and to the right. +Also, if we observe some pixels, then pixels nearby can be arbitrarily conditionally independent! + +Alternatively we can have a Naive Bayes model by introducing a hidden class variable. +$$p(X) = sum_z p(X,z)$$ +$$p(X) = sum_z prod_{x_iin X} p(x_iz)$$ +includegraphics[scale=0.4]{Screenshot_16.png} + +However there are issues with this too. +The top left and bot right pixels are dependent on each other which might not be desired. +Also, if we know what the class is then all the pixels are conditionally independent. + +An alternative to DAGMs, is undirected graphical models (UGMs). + + +section{Undirected Graphical Models} +In UGMs, we have edges that captures relation between variables rather than defining them as parent and child. + +includegraphics[scale=0.4]{Screenshot_17.png} + +subsection{D-Sepearation in Undirected Graphical Models} +The following three properties are used to determine if nodes are conditionally independent +includegraphics[scale=0.2]{skggm_markov.png} + +Global Markov Property +$X_A bot X_B X_C$ iff $(X_C)$ separates $(X_A)$ from $(X_B)$ + +Local Markov Property The set of nodes that renders a node conditionally independent of all the other nodes in the graph + +$$X_j bot X_{V - {j,neighbour(j)}} X_{neighbour(j)}$$ + +Pairwise Markov Property The set of nodes that renders two nodes conditionally independent of each other. +$$X_j bot X_i X_{V - {j, i}}$$ + +It's obvious that global Markov implies local Markov which implies pairwise Markov. + +subsection{limitations of UAGs and DAGs} +Note, though we can represent new relations between variables, we can't represent others. +includegraphics[scale=0.4]{Screenshot_18.png} +A DAG cant represent graph 1 where X and Z are conditionally dependent while a UAG can. +But a UAG cannot represent graph 2 where X and Z are conditionally independent without knowing Y. + +subsection{Moralization} +This was only mentioned in passing during lecture but a DAG can be converted to a UAG using href{httpsen.wikipedia.orgwikiMoral_graph}{Moralization} +This is done adding edges between all pairs of non-adjacent nodes that have a common child, then making all edges in the graph undirected. +includegraphics[scale=0.2]{moralGraph-DAG.png} becomes +includegraphics[scale=0.2]{Moralized.png} +section{Cliques} +A clique in an undirected graph is a subset of its vertices such that every two vertices in the subset are connected by an edge. + +A maximal clique is a clique that cannot be extended by including one more adjacent vertex. +A maximum clique is a clique of the largest possible size in a graph. + +includegraphics[scale=0.8]{Screenshot_19.png} +The image shows 2 maximal cliques with the red clique also being the maximum clique + +section{Hammersley-Clifford Theorem} +Since there is no topological ordering for an undirected graph, we can’t use the chain rule to represent p(y). So instead, we associate factors with each maximal clique. +We will denote the potential function for clique with $psi_c(y_ctheta_c)$ (back from the last week's notes) this can be any non-negative function. +The joint distribution is then defined to be proportional to the product of clique potentials. + +Hammersley-Clifford Theorem A positive distribution $p(y) 0$ satisfies the CI properties of an undirected graph G iff p can be represented as a product of factors, one per maximal clique, i.e., + +$$P(ytheta) = frac{1}{Z(theta)} prod_C psi_c (y_ctheta_c)$$ + +Where $Z(theta)$ is the sum of all our possible values. +$$P(ytheta) = sum_C psi_c (y_ctheta_c)$$ + +going back to our example graph +includegraphics[scale=0.8]{Screenshot_20.png} +$$ +p(ytheta) = frac{1}{Z(theta)} psi_{123}(y_1, y_2, y_3) psi_{234}(y_2, y_3, y_4) psi_{35}(y_3, y_5) +$$ +$$ +Z = sum_y psi_{123}(y_1, y_2, y_3)psi_{234}(y_2, y_3, y_4)psi_{35}(y_3, y_5) +$$ +this is useful because we can represent terms in terms of cliques instead of edges with reduces the number of terms in variable elimination + +section{Energy Based Models} +UGMs are very useful in physics. Take for example the Gibbs distribution used for modeling of Gibbs free Energy +$$p(yθ) = frac{1}{Z(theta)} exp(−sum_c E(y_ctheta_c)) $$ +where $E(y_c) 0$ is the energy associated with the variables in clique c. We can convert this to +UGM by defining. +$$psi_c(y_ctheta_c) = exp(−E(y_ctheta_c))$$ + +so we can now model the energy state of say, a protein molecule as a UGM. +includegraphics[scale=0.6]{Screenshot_21.png} +not a molecule don't @ me + +But going back to our initial example, it's east to see with a UAGM we can represent say, pixels in an image and have neighbouring pixels be related to each other. +end{document} diff --git a/Lecture Notes/Images/moralGraph-DAG.png b/Lecture Notes/Images/moralGraph-DAG.png new file mode 100644 index 0000000..9a93a55 Binary files /dev/null and b/Lecture Notes/Images/moralGraph-DAG.png differ diff --git a/Lecture Notes/Images/skggm_markov.png b/Lecture Notes/Images/skggm_markov.png new file mode 100644 index 0000000..53d6c40 Binary files /dev/null and b/Lecture Notes/Images/skggm_markov.png differ diff --git a/Lecture Notes/Week_3.pdf b/Lecture Notes/Week_3.pdf new file mode 100644 index 0000000..e42c1c7 Binary files /dev/null and b/Lecture Notes/Week_3.pdf differ diff --git a/Lecture Notes/Week_4.pdf b/Lecture Notes/Week_4.pdf new file mode 100644 index 0000000..33f12e9 Binary files /dev/null and b/Lecture Notes/Week_4.pdf differ diff --git a/Lecture Notes/Week_5.pdf b/Lecture Notes/Week_5.pdf new file mode 100644 index 0000000..e4c11ee Binary files /dev/null and b/Lecture Notes/Week_5.pdf differ