-
Notifications
You must be signed in to change notification settings - Fork 0
/
gr-coursera.html
139 lines (87 loc) · 3.92 KB
/
gr-coursera.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width">
<title>MathJax example</title>
<script type="text/javascript">
window.onload = function() {
MathJax.Hub.Config({
inlineMath: [ ["$", "$"], ["\\(", "\\)"] ],
showProcessingMessages: true,
jax: ["input/TeX", "output/HTML-CSS"],
displayAlign: "left",
processEscapes: true
});
}
</script>
<script type="text/javascript" async
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-MML-AM_CHTML" async>
</script>
<style>
.col { float:left; margin-right:25px;}
.col .head { width:100%; text-align:center; color:#4c4c4c; padding-bottom:5px; }
.col .top { }
.col .bottom { padding-top:20px; padding-right:25px; border:1px dashed #000; border-width:0 1px 0 0; }
.bottom .section { padding:10px 0 0 0; border-bottom:1px dashed #000; }
.first .top:after { content:""; clear:both; display:block; }
.first .top span { border: 1px solid #000; float:left; padding:5px; clear:left; margin: 5px;}
.top .formula { border:1px solid #000; padding:0 10px; }
.circle { display:inline-block; text-transform: uppercase; font-size: 12px; border:1px solid #000; border-radius:50%; padding:5px; }
.other-notes { margin-top:200px; }
</style>
</head>
<body>
Another attempt to get fairly complicated Latex into a post. Suggestions on how to get it to display sensibly in the forums would be appreciated. It displays very nicely in a notebook.
# Why is $$dz^{[L]} = a^{[L]} - y$$ ?
This is a confusing and mysterious formula, either in this simple form or the version vectorized over the set of test cases.
Probably because it is introduced and used after the formulae involving derivatives and chain rules, and this seems to be ignoring all of that and just using the difference between the prediction and the correct values.
In fact this formula is specific to the case where we have a sigmoid final activation function and a cross-entropy loss function.
The cross-entropy loss function:
$$
\mathcal{L}(y, \hat{y}) = - \left(y \log {\hat{y}} + (1-y) \log {(1 - \hat{y})} \right)
$$
In that formula \( \hat{y} \) is just another name for \( a^{[L]} \)
So we can write the loss function as
$$
\mathcal{L}(y, a^{[L]}) = - \left( y \log {a^{[L]}} + (1 - y) \log {(1 - a^{[L]})} \right)
$$
The abbreviated notation \(d a^{[L]}\) is used for \(\frac {d \mathcal{L}} {d a^{[L]}}\)
With this notation
$$
da^{[L]} = - \left( \frac{y}{a^{[L]}} - \frac {1 - y} {1 - a^{[L]}} \right)
$$
So we get
$$
da^{[L]} = \frac {1-y} {1 - a^{[L]}} - \frac {y } { a^{[L]}}
$$
Now we also have that \(a^{[L]} = \sigma\left( z^{[L]} \right)\) with \(\sigma\) being the sigmoid function, and we know the formula for the sigmoid derivative:
$$
\frac {d \sigma(z)} {d z} = \sigma(z) \left( 1 - \sigma(z) \right)
$$
This allows us to write the derivative of \(a^{[L]}\) in terms of its value:
$$
\frac {d a^{[L]} } {d z^{[L]} } = a^{[L]} \left( 1 - a^{[L]} \right)
$$
We now take the chain rule:
$$
\frac {d\mathcal{L}} {dz^{[L]}} = \frac {d\mathcal{L}} {da^{[L]}} \frac {da^{[L]}} {dz^{[L]}}
$$
And using the abbreviated notation and substituting what we know for the terms on the right-hand side we get:
$$
dz^{[L]} = \left( \frac {1 - y}{1 - a^{[L]}} - \frac {y} {a^{[L]}} \right) a^{[L]} \left( 1 - a^{[L]} \right)
$$
Multiplying things out gets rid of the fractions:
$$
dz^{[L]} = \left( \left( 1-y \right) a^{[L]} - y \left( 1 - a^{[L]}\right) \right)
$$
And multiplying these terms out gives us:
$$
dz^{[L]} = a^{[L]} - y a^{[L]} -y + y a^{[L]}
$$
Which finally simplifies to:
$$
dz^{[L]} = a^{[L]} - y
$$
So as we see, the complicated derivatives and chain rule end up with that simple, but over-simple looking, difference.
</body></html>