chap2.html

<!DOCTYPE html>
<html lang="en">
<!-- Produced from a LaTeX source file.  Note that the production is done -->
<!-- by a very rough-and-ready (and buggy) script, so the HTML and other  -->
<!-- code is quite ugly!  Later versions should be better.                -->
    <meta charset="utf-8">
    <meta name="citation_title" content="ニューラルネットワークと深層学習">
    <meta name="citation_author" content="Nielsen, Michael A.">
    <meta name="citation_publication_date" content="2014">
    <meta name="citation_fulltext_html_url" content="http://neuralnetworksanddeeplearning.com">
    <meta name="citation_publisher" content="Determination Press">
    <link rel="icon" href="nnadl_favicon.ICO" />
    <title>ニューラルネットワークと深層学習</title>
    <script src="assets/jquery.min.js"></script>
    <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: {inlineMath: [['$','$']]},
        "HTML-CSS":
          {scale: 92},
        TeX: { equationNumbers: { autoNumber: "AMS" }}});
    </script>
    <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>


    <link href="assets/style.css" rel="stylesheet">
    <link href="assets/pygments.css" rel="stylesheet">

<style>
/* Adapted from */
/* https://groups.google.com/d/msg/mathjax-users/jqQxrmeG48o/oAaivLgLN90J, */
/* by David Cervone */

@font-face {
    font-family: 'MJX_Math';
    src: url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); /* IE9 Compat Modes */
    src: url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot?iefix') format('eot'),
    url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff')  format('woff'),
    url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf')  format('opentype'),
    url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Math-Italic.svg#MathJax_Math-Italic') format('svg');
}

@font-face {
    font-family: 'MJX_Main';
    src: url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); /* IE9 Compat Modes */
    src: url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot?iefix') format('eot'),
    url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff')  format('woff'),
    url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf')  format('opentype'),
    url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Main-Regular.svg#MathJax_Main-Regular') format('svg');
}
</style>

  </head>
  <body><div class="header"><h1 class="chapter_number">
  <a href="">CHAPTER 2</a></h1>
  <h1 class="chapter_title"><a href="">
<!--逆伝播の仕組み-->
逆伝播の仕組み
</a></h1></div><div class="section"><div id="toc">
<p class="toc_title"><a href="index.html">ニューラルネットワークと深層学習</a></p><p class="toc_not_mainchapter"><a href="about.html">What this book is about</a></p><p class="toc_not_mainchapter"><a href="exercises_and_problems.html">On the exercises and problems</a></p><p class='toc_mainchapter'><a id="toc_using_neural_nets_to_recognize_handwritten_digits_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_using_neural_nets_to_recognize_handwritten_digits" src="images/arrow.png" width="15px"></a><a href="chap1.html">ニューラルネットワークを用いた手書き文字認識</a><div id="toc_using_neural_nets_to_recognize_handwritten_digits" style="display: none;"><p class="toc_section"><ul><a href="chap1.html#perceptrons"><li>Perceptrons</li></a><a href="chap1.html#sigmoid_neurons"><li>Sigmoid neurons</li></a><a href="chap1.html#the_architecture_of_neural_networks"><li>The architecture of neural networks</li></a><a href="chap1.html#a_simple_network_to_classify_handwritten_digits"><li>A simple network to classify handwritten digits</li></a><a href="chap1.html#learning_with_gradient_descent"><li>Learning with gradient descent</li></a><a href="chap1.html#implementing_our_network_to_classify_digits"><li>Implementing our network to classify digits</li></a><a href="chap1.html#toward_deep_learning"><li>Toward deep learning</li></a></ul></p></div>
<script>
$('#toc_using_neural_nets_to_recognize_handwritten_digits_reveal').click(function() {
   var src = $('#toc_img_using_neural_nets_to_recognize_handwritten_digits').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow.png');
   };
   $('#toc_using_neural_nets_to_recognize_handwritten_digits').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_how_the_backpropagation_algorithm_works_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_how_the_backpropagation_algorithm_works" src="images/arrow.png" width="15px"></a><a href="chap2.html">逆伝播の仕組み</a><div id="toc_how_the_backpropagation_algorithm_works" style="display: none;"><p class="toc_section"><ul><a href="chap2.html#warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network"><li>Warm up: a fast matrix-based approach to computing the output  from a neural network</li></a><a href="chap2.html#the_two_assumptions_we_need_about_the_cost_function"><li>The two assumptions we need about the cost function</li></a><a href="chap2.html#the_hadamard_product_$s_\odot_t$"><li>The Hadamard product, $s \odot t$</li></a><a href="chap2.html#the_four_fundamental_equations_behind_backpropagation"><li>The four fundamental equations behind backpropagation</li></a><a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)"><li>Proof of the four fundamental equations (optional)</li></a><a href="chap2.html#the_backpropagation_algorithm"><li>The backpropagation algorithm</li></a><a href="chap2.html#the_code_for_backpropagation"><li>The code for backpropagation</li></a><a href="chap2.html#in_what_sense_is_backpropagation_a_fast_algorithm"><li>In what sense is backpropagation a fast algorithm?</li></a><a href="chap2.html#backpropagation_the_big_picture"><li>Backpropagation: the big picture</li></a></ul></p></div>
<script>
$('#toc_how_the_backpropagation_algorithm_works_reveal').click(function() {
   var src = $('#toc_img_how_the_backpropagation_algorithm_works').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow.png');
   };
   $('#toc_how_the_backpropagation_algorithm_works').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_improving_the_way_neural_networks_learn_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_improving_the_way_neural_networks_learn" src="images/arrow.png" width="15px"></a><a href="chap3.html">ニューラルネットワークの学習の改善</a><div id="toc_improving_the_way_neural_networks_learn" style="display: none;"><p class="toc_section"><ul><a href="chap3.html#the_cross-entropy_cost_function"><li>The cross-entropy cost function</li></a><a href="chap3.html#overfitting_and_regularization"><li>Overfitting and regularization</li></a><a href="chap3.html#weight_initialization"><li>Weight initialization</li></a><a href="chap3.html#handwriting_recognition_revisited_the_code"><li>Handwriting recognition revisited: the code</li></a><a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters"><li>How to choose a neural network's hyper-parameters?</li></a><a href="chap3.html#other_techniques"><li>Other techniques</li></a></ul></p></div>
<script>
$('#toc_improving_the_way_neural_networks_learn_reveal').click(function() {
   var src = $('#toc_img_improving_the_way_neural_networks_learn').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow.png');
   };
   $('#toc_improving_the_way_neural_networks_learn').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_a_visual_proof_that_neural_nets_can_compute_any_function" src="images/arrow.png" width="15px"></a><a href="chap4.html">ニューラルネットワークが任意の関数を表現できることの視覚的証明</a><div id="toc_a_visual_proof_that_neural_nets_can_compute_any_function" style="display: none;"><p class="toc_section"><ul><a href="chap4.html#two_caveats"><li>Two caveats</li></a><a href="chap4.html#universality_with_one_input_and_one_output"><li>Universality with one input and one output</li></a><a href="chap4.html#many_input_variables"><li>Many input variables</li></a><a href="chap4.html#extension_beyond_sigmoid_neurons"><li>Extension beyond sigmoid neurons</li></a><a href="chap4.html#fixing_up_the_step_functions"><li>Fixing up the step functions</li></a><a href="chap4.html#conclusion"><li>Conclusion</li></a></ul></p></div>
<script>
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal').click(function() {
   var src = $('#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow.png');
   };
   $('#toc_a_visual_proof_that_neural_nets_can_compute_any_function').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_why_are_deep_neural_networks_hard_to_train_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_why_are_deep_neural_networks_hard_to_train" src="images/arrow.png" width="15px"></a><a href="chap5.html">ニューラルネットワークを訓練するのはなぜ難しいのか</a><div id="toc_why_are_deep_neural_networks_hard_to_train" style="display: none;"><p class="toc_section"><ul><a href="chap5.html#the_vanishing_gradient_problem"><li>The vanishing gradient problem</li></a><a href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets"><li>What's causing the vanishing gradient problem?  Unstable gradients in deep neural nets</li></a><a href="chap5.html#unstable_gradients_in_more_complex_networks"><li>Unstable gradients in more complex networks</li></a><a href="chap5.html#other_obstacles_to_deep_learning"><li>Other obstacles to deep learning</li></a></ul></p></div>
<script>
$('#toc_why_are_deep_neural_networks_hard_to_train_reveal').click(function() {
   var src = $('#toc_img_why_are_deep_neural_networks_hard_to_train').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow.png');
   };
   $('#toc_why_are_deep_neural_networks_hard_to_train').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_deep_learning_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_deep_learning" src="images/arrow.png" width="15px"></a>Deep learning<div id="toc_deep_learning" style="display: none;"><p class="toc_section"><ul><li>Convolutional neural networks</li><li>Pretraining</li><li>Recurrent neural networks, Boltzmann machines, and other  models</li><li>Is there a universal thinking algorithm?</li><li>On the future of neural networks</li></ul></p></div>
<script>
$('#toc_deep_learning_reveal').click(function() {
   var src = $('#toc_img_deep_learning').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_deep_learning").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_deep_learning").attr('src', 'images/arrow.png');
   };
   $('#toc_deep_learning').toggle('fast', function() {});
});</script><p class="toc_not_mainchapter"><a href="acknowledgements.html">Acknowledgements</a></p><p class="toc_not_mainchapter"><a href="faq.html">Frequently Asked Questions</a></p>
<hr>
<span class="sidebar_title">Sponsors</span>
<br/>
<a href='http://www.ersatz1.com/'><img src='assets/ersatz.png' width='140px' style="padding: 0px 0px 10px 8px; border-style: none;"></a>

<a href='http://gsquaredcapital.com/'><img src='assets/gsquared.png' width='150px' style="padding: 0px 0px 10px 10px; border-style: none;"></a>

<a href='http://www.tineye.com'><img src='assets/tineye.png' width='150px'
style="padding: 0px 0px 10px 8px; border-style: none;"></a>

<a href='http://www.visionsmarts.com'><img
src='assets/visionsmarts.png' width='160px' style="padding: 0px 0px
0px 0px; border-style: none;"></a> <br/>


<!--
<p class="sidebar">Thanks to all the <a
href="supporters.html">supporters</a> who made the book possible.
Thanks also to all the contributors to the <a
href="bugfinder.html">Bugfinder Hall of Fame</a>.  </p>

<p class="sidebar">The book is currently a beta release, and is still
under active development.  Please send error reports to
mn@michaelnielsen.org.  For other enquiries, please see the <a
href="faq.html">FAQ</a> first.</p>
-->

<p class="sidebar">著者と共にこの本を作り出してくださった<a
href="supporters.html">サポーター</a>の皆様に感謝いたします。
また、<a
        href="bugfinder.html">バグ発見者の殿堂</a>に名を連ねる皆様にも感謝いたします。
また、日本語版の出版にあたっては、<a
href="translators.html">翻訳者</a>の皆様に深く感謝いたします。

</p>


<p class="sidebar">この本は目下のところベータ版で、開発続行中です。
エラーレポートは mn@michaelnielsen.org まで、日本語版に関する質問は muranushi@gmail.com までお送りください。
その他の質問については、まずは<a
href="faq.html">FAQ</a>をごらんください。</p>


<hr>
<span class="sidebar_title">Resources</span>

<p class="sidebar">
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning">Code repository</a></p>

<p class="sidebar">
<a href="http://eepurl.com/BYr9L">Mailing list for book announcements</a>
</p>

<p class="sidebar">
<a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
</p>

<hr>
<a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>

<p class="sidebar">
  著：<a href="http://michaelnielsen.org">Michael Nielsen</a> / 2014年9月-12月 <br >  訳：<a href="https://github.com/nnadl-ja/nnadl_site_ja">「ニューラルネットワークと深層学習」翻訳プロジェクト</a>
</p>
</div>
</p>
<p>
<!--In the <a href="chap1.html">last chapter</a> we saw how neural networks can
learn their weights and biases using the gradient descent algorithm.
There was, however, a gap in our explanation: we didn't discuss how to
compute the gradient of the cost function.  That's quite a gap!  In
this chapter I'll explain a fast algorithm for computing such
gradients, an algorithm known as <em>backpropagation</em>. -->
<a href="chap1.html">前章</a>では、勾配降下法を用いてニューラルネットワークが重みとバイアスをどのように学習するかを説明しました。
しかし、その説明にはギャップがありました。具体的には、コスト関数の勾配をどのように計算するかを議論していません。これはとても大きなギャップです！
本章では、<em>逆伝播</em>と呼ばれる、コスト関数の勾配を高速に計算するアルゴリズムを説明します。
</p>
<p>
<p>
<!--The backpropagation algorithm was originally introduced in the 1970s,
but its importance wasn't fully appreciated until a
<a href="http://www.nature.com/nature/journal/v323/n6088/pdf/323533a0.pdf">famous 1986 paper</a> by
<a href="http://en.wikipedia.org/wiki/David_Rumelhart">David Rumelhart</a>,
<a href="http://www.cs.toronto.edu/&#126;hinton/">Geoffrey Hinton</a>, and
<a href="http://en.wikipedia.org/wiki/Ronald_J._Williams">Ronald Williams</a>.
 That paper describes several
neural networks where backpropagation works far faster than earlier
approaches to learning, making it possible to use neural nets to solve
problems which had previously been insoluble.  Today, the
backpropagation algorithm is the workhorse of learning in neural
networks.-->
逆伝播アルゴリズムはもともと1970年代に導入されました。
しかし逆伝播が評価されたのは、
<a href="http://en.wikipedia.org/wiki/David_Rumelhart">David Rumelhart</a>・
<a href="http://www.cs.toronto.edu/&#126;hinton/">Geoffrey Hinton</a>・
<a href="http://en.wikipedia.org/wiki/Ronald_J._Williams">Ronald Williams</a>
による1986年の著名な論文が登場してからでした。
その論文では、逆伝播を用いると既存の学習方法よりもずっと早く学習できる事をいくつかのニューラルネットワークに対して示し、それまでニューラルネットワークでは解けなかった問題が解ける事を示しました。
今日では、逆伝播はニューラルネットワークを学習させる便利なアルゴリズムです。
<!--
2014/12/24 Kenta OONO
"workhorse" の訳語がもっとうまいものがあったら偏向したい
-->
</p>
<p>
<!--This chapter is more mathematically involved than the rest of the
book.  If you're not crazy about mathematics you may be tempted to
skip the chapter, and to treat backpropagation as a black box whose
details you're willing to ignore.  Why take the time to study those
details?-->
本章は他の章に比べて数学的に難解です。
よほど数学に対し熱心でなければ、本章を飛ばして、逆伝播を中身を無視できるブラックボックスとして扱いたくなるかもしれません。
では、なぜ時間をかけて逆伝播の詳細を勉強するのでしょうか？
</p>
<p>
<!--The reason, of course, is understanding.  At the heart of
backpropagation is an expression for the partial derivative $\partial
C / \partial w$ of the cost function $C$ with respect to any weight
$w$ (or bias $b$) in the network.  The expression tells us how quickly
the cost changes when we change the weights and biases.  And while the
expression is somewhat complex, it also has a beauty to it, with each
element having a natural, intuitive interpretation.  And so
backpropagation isn't just a fast algorithm for learning.  It actually
gives us detailed insights into how changing the weights and biases
changes the overall behaviour of the network.  That's well worth
studying in detail.-->
その理由はもちろん理解のためです。
逆伝播の本質はコスト関数$C$のネットワークの重み$w$（もしくはバイアス$b$）に関する偏微分$\partial C / \partial w$ （$\partial C / \partial b$）です。
<!--
2014/12/24 Kenta OONO
バイアスの場合の数式$\partial C / \partial b$を追加しました
-->
偏微分をみると、重みとバイアスを変化させた時のコスト関数の変化の度合いがわかります。
偏微分の式は若干複雑ですが、そこには美しい構造があり、式の各要素には自然で直感的な解釈を与える事ができます。
そうです、逆伝播は単なる高速な学習アルゴリズムではありません。
逆伝播をみることで、重みやバイアスを変化させた時のニューラルネットワーク全体の挙動の変化に関して深い洞察が得られます。
逆伝播を勉強する価値はそこにあるのです。
</p>
<p>
<!--With that said, if you want to skim the chapter, or jump straight to
the next chapter, that's fine.  I've written the rest of the book to
be accessible even if you treat backpropagation as a black box.  There
are, of course, points later in the book where I refer back to results
from this chapter.  But at those points you should still be able to
understand the main conclusions, even if you don't follow all the
reasoning.-->
そうは言うものの、本章をざっと読んだり、読み飛ばして次の章に進んでも大丈夫です。
この本は逆伝播をブラックボックスとして扱っても他の章を理解できるように書いています。
もちろん次章以降で本章の結果を参照する部分はあります。
しかし、その参照部分の議論をすべて追わなくても、主な結論は理解できるはずです。
</p>
<p><h3><a name="warm_up_a_fast_matrix-based_approach_to_computing_the_output
_from_a_neural_network"></a><a href="#warm_up_a_fast_matrix-based_approach_to_computing_the_output
_from_a_neural_network">
<!--Warm up: a fast matrix-based approach to computing the output from a neural network-->
ウォーミングアップ：ニューラルネットワークの出力の行列を用いた高速な計算
</a></h3></p>
<p>
<!--Before discussing backpropagation, let's warm up with a fast
matrix-based algorithm to compute the output from a neural network.
We actually already briefly saw this algorithm
<a href="chap1.html#implementing_our_network_to_classify_digits">near
  the end of the last chapter</a>, but I described it quickly, so it's
worth revisiting in detail.  In particular, this is a good way of
getting comfortable with the notation used in backpropagation, in a
familiar context.-->
逆伝播を議論する前に、ニューラルネットワークの出力を高速に計算する行列を用いたアルゴリズムでウォーミングアップしましょう。
私達は
<a href="chap1.html#implementing_our_network_to_classify_digits">前章の最後のあたり</a>
で既にこのアルゴリズムを簡単に見ています。
しかしその時はざっと書いていたので、ここで立ち戻って詳しく説明しようと思います。
特にこれまで説明して慣れた文脈で逆伝播で使用する記号に慣れるのに、このウォーミングアップは良い方法です。
<!--
2014/12/24 Kenta OONO
"in a familiar context"が何を指しているのかがよくわかりませんでした
-->
</p>
<p>
<!--Let's begin with a notation which lets us refer to weights in the
network in an unambiguous way.  We'll use $w^l_{jk}$ to denote the
weight for the connection from the $k^{\rm th}$ neuron in the
$(l-1)^{\rm th}$ layer to the $j^{\rm th}$ neuron in the $l^{\rm th}$
layer.  So, for example, the diagram below shows the weight on a
connection from the fourth neuron in the second layer to the second
neuron in the third layer of a network:-->
ニューラルネットワーク中の重みを曖昧性なく指定する表記方法からまず始めましょう。
$w^l_{jk}$で$(l-1)$番目の層の$k$番目のニューロンから$l$番目の層の$j$番目のニューロンへの接続に対する重みを表します。
例えば、下図は2番目の層の4番目のニューロンから3番目の層の2番目のニューロンへの接続の重みを表します。
<!--
2014/12/24 Kenta OONO
- "connection"を愚直に「接続」と訳しているけれど、固いので解釈して「枝」としてもいいかも
- l-th layer は「l番目の層」と「l層目」のどちらが適切か？
-->
<center>
<img src="images/tikz16.png"/>
</center>
<!--This notation is cumbersome at first, and it does take some work to
master.  But with a little effort you'll find the notation becomes
easy and natural.  One quirk of the notation is the ordering of the
$j$ and $k$ indices.  You might think that it makes more sense to use
$j$ to refer to the input neuron, and $k$ to the output neuron, not
vice versa, as is actually done.  I'll explain the reason for this
quirk below.-->
この表記方法は最初は面倒で、使いこなすのにある程度の練習が必要かもしれません。
しかし、少し頑張ればこの表記方法は簡単で自然だと感じるようになるはずです。
この表記方法で若干曲者なのが、$j$と$k$の順番です。
$j$を入力ニューロン、$k$を出力ニューロンとする方が理にかなっていると思うかもしれませんが、
実際には逆にしています。
奇妙なこの記述方法の理由は後程説明します。
</p>
<p>
<!--
We use a similar notation for the network's biases and activations.
Explicitly, we use $b^l_j$ for the bias of the $j^{\rm th}$ neuron in
the $l^{\rm th}$ layer.  And we use $a^l_j$ for the activation of the
$j^{\rm th}$ neuron in the $l^{\rm th}$ layer.  The following diagram
shows examples of these notations in use:-->
ニューラルネットワークのバイアスと活性についても似た表記方法を導入します。
具体的には、$b^l_j$で$l$番目の層の$j$番目のニューロンのバイアスを表します。
また、$a^l_j$で$l$番目の層の$j$番目のニューロンの活性を表します。
下図はこれらの表記方法の利用例です。
<center>
<img src="images/tikz17.png"/>
</center>
<!--With these notations, the activation $a^{l}_j$ of the $j^{\rm th}$
neuron in the $l^{\rm th}$ layer is related to the activations in the
$(l-1)^{\rm th}$ layer by the equation (compare Equation
<span id="margin_295319563926_reveal" class="equation_link">(4)</span>
<span id="margin_295319563926" class="marginequation" style="display: none;">
  <a href="chap1.html#eqtn4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">
    \begin{eqnarray}
    \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber
    \end{eqnarray}
  </a>
</span>
<script>$('#margin_295319563926_reveal').click(function() {$('#margin_295319563926').toggle('slow', function() {});});</script>
 and surrounding discussion in the last chapter)
<a class="displaced_anchor" name="eqtn23"></a>\begin{eqnarray}
  a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right),
\tag{23}\end{eqnarray}
where the sum is over all neurons $k$ in the $(l-1)^{\rm th}$ layer. -->
これらの表記を用いると、$l$番目の層の$j$番目のニューロンの活性$a^l_j$は、$(l-1)$番目の層の活性と以下の式で関係付けられます（式
<span id="margin_295319563926_reveal" class="equation_link">(4)</span>
<span id="margin_295319563926" class="marginequation" style="display: none;">
  <a href="chap1.html#eqtn4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">
    \begin{eqnarray}
    \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber
    \end{eqnarray}
  </a>
</span>
<script>$('#margin_295319563926_reveal').click(function() {$('#margin_295319563926').toggle('slow', function() {});});</script>
と前章の周辺の議論を比較してください）
<a class="displaced_anchor" name="eqtn23"></a>\begin{eqnarray}
  a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right).
\tag{23}\end{eqnarray}
ここで、和は$(l-1)$番目の層の全てのニューロン$k$について足しています。
<!-- To rewrite this expression in a matrix form we define a <em>weight
  matrix</em> $w^l$ for each layer, $l$.  The entries of the weight matrix
$w^l$ are just the weights connecting to the $l^{\rm th}$ layer of neurons,
that is, the entry in the $j^{\rm th}$ row and $k^{\rm th}$ column is $w^l_{jk}$.
Similarly, for each layer $l$ we define a <em>bias vector</em>, $b^l$.
You can probably guess how this works - the components of the bias
vector are just the values $b^l_j$, one component for each neuron in
the $l^{\rm th}$ layer.  And finally, we define an activation vector $a^l$
whose components are the activations $a^l_j$.-->
この式を行列で書き直すため、各層$l$に対し<em>重み行列</em>$w^l$を定義します。
重み行列$w^l$の各要素は$l$番目の層のニューロンを終点とする接続の重みです。
<!--
2014/12/24 Kenta OONO
「終点」は"connecting to the l-th layer"のtoの訳にあたります
-->
すなわち、$j$行目$k$列目の要素を$w^l_{jk}$とします。
同様に、各層$l$に対し、<em>バイアスベクトル</em>$b^l$を定義します。
おそらく想像できると思いますが、バイアスベクトルの要素は$b^l_j$達で、
$l$番目の層の各ニューロンに対し1つの行列要素が伴います。
最後に、活性ベクトル$a^l$を活性$a^l_j$達で定義します。
</p>
<p>
<!--The last ingredient we need to rewrite
<span id="margin_910880967978_reveal" class="equation_link">(23)</span>
<span id="margin_910880967978" class="marginequation" style="display: none;">
  <a href="chap2.html#eqtn23" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">
  \begin{eqnarray}
    a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber
  \end{eqnarray}
  </a>
</span>
<script>$('#margin_910880967978_reveal').click(function() {$('#margin_910880967978').toggle('slow', function() {});});</script>
in a matrix form is the idea of vectorizing a function such as $\sigma$.
-->
<span id="margin_910880967978_reveal" class="equation_link">(23)</span>
<span id="margin_910880967978" class="marginequation" style="display: none;">
  <a href="chap2.html#eqtn23" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">
  \begin{eqnarray}
    a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber
  \end{eqnarray}
  </a>
</span>
<script>$('#margin_910880967978_reveal').click(function() {$('#margin_910880967978').toggle('slow', function() {});});</script>
を行列形式に書き直すのに必要な最後の要素は、$\sigma$などの関数のベクトル化です。
<!--
We met vectorization briefly in the last chapter, but to recap, the
idea is that we want to apply a function such as $\sigma$ to every
element in a vector $v$.  We use the obvious notation $\sigma(v)$ to
denote this kind of elementwise application of a function.  That is,
the components of $\sigma(v)$ are just $\sigma(v)_j = \sigma(v_j)$.
As an example, if we have the function $f(x) = x^2$ then the
vectorized form of $f$ has the effect
<a class="displaced_anchor" name="eqtn24"></a>\begin{eqnarray}
  f\left(\left[ \begin{array}{c} 2 \\ 3 \end{array} \right] \right)
  = \left[ \begin{array}{c} f(2) \\ f(3) \end{array} \right]
  = \left[ \begin{array}{c} 4 \\ 9 \end{array} \right],
\tag{24}\end{eqnarray}
that is, the vectorized $f$ just squares every element of the vector.-->
ベクトル化は既に前章で簡単に見ました。
要点をまとめると、$\sigma$のような関数をベクトル$v$の各要素に適用したいというのがアイデアです。
このような各要素への関数適用には$\sigma(v)$という自然な表記を用います。
つまり、$\sigma(v)$の各要素は$\sigma(v)_j = \sigma(v_j)$です。
例えば$f(x) = x^2$とすると、次のようになります。
<a class="displaced_anchor" name="eqtn24"></a>\begin{eqnarray}
  f\left(\left[ \begin{array}{c} 2 \\ 3 \end{array} \right] \right)
  = \left[ \begin{array}{c} f(2) \\ f(3) \end{array} \right]
  = \left[ \begin{array}{c} 4 \\ 9 \end{array} \right].
\tag{24}\end{eqnarray}
すなわち、ベクトル化した$f$はベクトルの各要素を2乗します。
</p>
<p>
<!--With these notations in mind, Equation
<span id="margin_948905994145_reveal" class="equation_link">(23)</span>
<span id="margin_948905994145" class="marginequation" style="display: none;">
  <a href="chap2.html#eqtn23" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">
  \begin{eqnarray}
  a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber
  \end{eqnarray}</a>
</span>
<script>$('#margin_948905994145_reveal').click(function() {$('#margin_948905994145').toggle('slow', function() {});});</script>
can be rewritten in the beautiful and compact vectorized form
<a class="displaced_anchor" name="eqtn25"></a>
\begin{eqnarray}
  a^{l} = \sigma(w^l a^{l-1}+b^l). \tag{25}
\end{eqnarray}
-->
この表記方法を用いると、式
<span id="margin_948905994145_reveal" class="equation_link">(23)</span>
<span id="margin_948905994145" class="marginequation" style="display: none;">
  <a href="chap2.html#eqtn23" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">
  \begin{eqnarray}
  a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber
  \end{eqnarray}</a>
</span>
<script>$('#margin_948905994145_reveal').click(function() {$('#margin_948905994145').toggle('slow', function() {});});</script>
は次のような美しくコンパクトなベクトル形式で書けます。
<a class="displaced_anchor" name="eqtn25"></a>
\begin{eqnarray}
  a^{l} = \sigma(w^l a^{l-1}+b^l). \tag{25}
\end{eqnarray}
<!--
This expression gives us a much more global way of thinking about how
the activations in one layer relate to activations in the previous
layer: we just apply the weight matrix to the activations, then add
the bias vector, and finally apply the $\sigma$ function*
-->
この表現を用いると、ある層の活性とその前の層の活性との関係を俯瞰できます。
我々が行っているのは活性に対し重み行列を掛け、バイアスベクトルを足し、最後に$\sigma$関数を適用するだけです。
<!--
<span class="marginnote">
*By the way, it's this expression that motivates the quirk in the
  $w^l_{jk}$ notation mentioned earlier.  If we used $j$ to index the
  input neuron, and $k$ to index the output neuron, then we'd need to
  replace the weight matrix in Equation
  <span id="margin_305931363478_reveal" class="equation_link">(25)</span>
  <span id="margin_305931363478" class="marginequation" style="display: none;">
  <a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">
  \begin{eqnarray}
    a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber
  \end{eqnarray}</a></span>
  <script>$('#margin_305931363478_reveal').click(function() {$('#margin_305931363478').toggle('slow', function() {});});</script>
  by the transpose of the weight matrix.  That's a small change, but
  annoying, and we'd lose the easy simplicity of saying (and thinking)
  "apply the weight matrix to the activations".</span>.-->
<span class="marginnote">*ところで、先ほどの$w^l_{jk}$という奇妙な表記を用いる動機はこの式に由来します。
もし、$j$を入力ニューロンに用い、$k$を出力ニューロンに用いたとすると、式
  <span id="margin_305931363478_reveal" class="equation_link">(25)</span>
  <span id="margin_305931363478" class="marginequation" style="display: none;">
  <a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">
  \begin{eqnarray}
    a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber
  \end{eqnarray}</a></span>
  <script>$('#margin_305931363478_reveal').click(function() {$('#margin_305931363478').toggle('slow', function() {});});</script>
  は重み行列をそれの転置行列に置き換えなければなりません。
  些細な変更ですが、煩わしい上に「重み行列を掛ける」と簡単に言ったり（もしくは考えたり）できなくなってしまいます。
</span>
<!--
That global view
is often easier and more succinct (and involves fewer indices!) than
the neuron-by-neuron view we've taken to now.  Think of it as a way of
escaping index hell, while remaining precise about what's going on.
The expression is also useful in practice, because most matrix
libraries provide fast ways of implementing matrix multiplication,
vector addition, and vectorization.  Indeed, the
<a href="chap1.html#implementing_our_network_to_classify_digits">code</a>
in the last chapter made implicit use of this expression to compute
the behaviour of the network.-->
この見方はこれまでのニューロン単位での見方よりも簡潔で、添字も少なくて済みます。
議論の正確性を失う事なく添字地獄から抜け出せる方法と考えると良いでしょう。
さらに、この表現方法は実用上も有用です。
というのも、多くの行列ライブラリでは高速な行列掛算・ベクトル足し算・関数のベクトル化の実装が提供されているからです。
実際、前章の<a href="chap1.html#implementing_our_network_to_classify_digits">コード</a>
では、ネットワークの挙動の計算にこの表式を暗に利用していました。
</p>
<p>
<!--When using Equation
<span id="margin_924913160001_reveal" class="equation_link">(25)</span><span id="margin_924913160001" class="marginequation" style="display: none;"><a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_924913160001_reveal').click(function() {$('#margin_924913160001').toggle('slow', function() {});});</script>
 to compute $a^l$, we compute the intermediate quantity $z^l \equiv w^l a^{l-1}+b^l$
along the way.-->
$a^l$の計算のために式
<span id="margin_924913160001_reveal" class="equation_link">(25)</span><span id="margin_924913160001" class="marginequation" style="display: none;"><a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_924913160001_reveal').click(function() {$('#margin_924913160001').toggle('slow', function() {});});</script>
を利用する時には、途中で$z^l \equiv w^l a^{l-1}+b^l$を計算しています。
<!--This quantity turns out to be useful enough to be
worth naming: we call $z^l$ the <em>weighted input</em> to the neurons
in layer $l$.  We'll make considerable use of the weighted input $z^l$
later in the chapter. -->
この値は後の議論で有用なので名前をつけておく価値があります。
$z^l$を$l$番目の層に対する<em>重みつき入力</em>と呼ぶことにします。
本章の以降の議論では重みつき入力$z^l$を頻繁に利用します。
<!--Equation
<span id="margin_957787478923_reveal" class="equation_link">(25)</span><span id="margin_957787478923" class="marginequation" style="display: none;"><a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_957787478923_reveal').click(function() {$('#margin_957787478923').toggle('slow', function() {});});</script>
is sometimes written in terms of the weighted input, as $a^l = \sigma(z^l)$. -->
式
<span id="margin_957787478923_reveal" class="equation_link">(25)</span><span id="margin_957787478923" class="marginequation" style="display: none;"><a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_957787478923_reveal').click(function() {$('#margin_957787478923').toggle('slow', function() {});});</script>
をしばしば重み付き入力を用いて$a^l = \sigma(z^l)$とも書きます。
<!-- It's also worth noting that $z^l$ has components $z^l_j
= \sum_k w^l_{jk} a^{l-1}_k+b^l_j$, that is, $z^l_j$ is just the
weighted input to the activation function for neuron $j$ in layer $l$.-->
$z^l$の要素は$z^l_j = \sum_k w^l_{jk} a^{l-1}_k+b^l_j$と書ける事にも注意してください。
つまり、$z^l_j$は$l$番目の層の$j$番目のニューロンが持つ活性関数へ与える重みつき入力です。
</p>
<p><h3><a name="the_two_assumptions_we_need_about_the_cost_function"></a>
<a href="#the_two_assumptions_we_need_about_the_cost_function">
<!--The two assumptions we need about the cost function-->
コスト関数に必要な2つの仮定
</a></h3></p>
<p>
<!--The goal of backpropagation is to compute the partial derivatives
$\partial C / \partial w$ and $\partial C / \partial b$ of the cost
function $C$ with respect to any weight $w$ or bias $b$ in the
network.  For backpropagation to work we need to make two main
assumptions about the form of the cost function.  Before stating those
assumptions, though, it's useful to have an example cost function in
mind. -->
逆伝播の目標はニューラルネットワーク中の任意の重み$w$またはバイアス$b$に関するコスト関数$C$の偏微分、すなわち$\partial C / \partial w$と$\partial C / \partial b$の計算です。
逆伝播が機能するには、コスト関数の形について2つの仮定を置く必要があります。
それらの仮定を述べる前に、コスト関数の例を念頭に置くのが良いでしょう。
<!--We'll use the quadratic cost function from last chapter
(c.f. Equation <span id="margin_342696826100_reveal" class="equation_link">(6)</span><span id="margin_342696826100" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_342696826100_reveal').click(function() {$('#margin_342696826100').toggle('slow', function() {});});</script>).-->
前章でも出てきた2乗コスト関数（参考：式<span id="margin_342696826100_reveal" class="equation_link">(6)</span><span id="margin_342696826100" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv） \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_342696826100_reveal').click(function() {$('#margin_342696826100').toggle('slow', function() {});});</script>）をここでも考えます。
<!--In the notation of the last section, the quadratic cost has the form
<a class="displaced_anchor" name="eqtn26"></a>\begin{eqnarray}
  C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2,
\tag{26}\end{eqnarray}
where: $n$ is the total number of training examples; the sum is over
individual training examples, $x$; $y = y(x)$ is the corresponding
desired output; $L$ denotes the number of layers in the network; and
$a^L = a^L(x)$ is the vector of activations output from the network
when $x$ is input.-->
前章の記法では、2乗コスト関数は以下の様な形をしていました
<a class="displaced_anchor" name="eqtn26"></a>\begin{eqnarray}
  C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2.
\tag{26}\end{eqnarray}
ここで、$n$は訓練例の総数、和は個々の訓練例$x$について足しあわせたもの、$y = y(x)$は対応する目標の出力、$L$はニューラルネットワークの層数、$a^L = a^L(x)$は$x$を入力した時のニューラルネットワークの出力のベクトルです。
</p>
<p>
<!--Okay, so what assumptions do we need to make about our cost function,
$C$, in order that backpropagation can be applied?  The first
assumption we need is that the cost function can be written as an
average $C = \frac{1}{n} \sum_x C_x$ over cost functions $C_x$ for
individual training examples, $x$.  This is the case for the quadratic
cost function, where the cost for a single training example is $C_x =
\frac{1}{2} \|y-a^L \|^2$.  This assumption will also hold true for
all the other cost functions we'll meet in this book.-->
では、逆伝播を適用するために、コスト関数$C$に置く仮定はどのようなものでしょうか。
1つ目の仮定はコスト関数は個々の訓練例$x$に対するコスト関数$C_x$の平均 $C = \frac{1}{n} \sum_x C_x$で書かれているという事です。
2乗コスト関数ではこの仮定が成立しています。
それには1つの訓練例に対するコスト関数を$C_x = \frac{1}{2} \|y-a^L \|^2$とすれば良いです。
この仮定はこの本で登場する他のコスト関数でも成立しています。
</p>
<p>
<!--The reason we need this assumption is because what backpropagation
actually lets us do is compute the partial derivatives $\partial C_x
/ \partial w$ and $\partial C_x / \partial b$ for a single training
example.  We then recover $\partial C / \partial w$ and $\partial C
/ \partial b$ by averaging over training examples.  In fact, with this
assumption in mind, we'll suppose the training example $x$ has been
fixed, and drop the $x$ subscript, writing the cost $C_x$ as $C$.
We'll eventually put the $x$ back in, but for now it's a notational
nuisance that is better left implicit.-->
この仮定が必要となる理由は、逆伝播によって計算できるのは個々の訓練例に対する偏微分$\partial C_x / \partial w$、$\partial C_x / \partial b$だからです。
コスト関数の偏微分$\partial C / \partial w$、$\partial C / \partial b$は全訓練例についての平均を取ることで得られます。
この仮定を念頭に置き、私達は訓練例$x$を1つ固定していると仮定し、コスト$C_x$を添字$x$を除いて$C$と書くことにします。最終的に除いた$x$は元に戻しますが、当面は記法が煩わしいので暗に$x$が書かれていると考えます。
</p>
<p>
<!--The second assumption we make about the cost is that it can be written
as a function of the outputs from the neural network:
<center>
<img src="images/tikz18.png"/>
</center>-->
コスト関数に課す2つ目の仮定は、コスト関数はニューラルネットワークの出力の関数で書かれているという仮定です。
<center>
<img src="images/tikz18.png"/>
</center>
<!--For example, the quadratic cost function satisfies this requirement,
since the quadatic cost for a single training example $x$ may be
written as
<a class="displaced_anchor" name="eqtn27"></a>\begin{eqnarray}
  C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2,
\tag{27}\end{eqnarray}
and thus is a function of the output activations. -->
例えば、2乗誤差関数はこの要求を満たしています、それは1つの訓練例$x$に対する誤差は以下のように書かれるためです
<a class="displaced_anchor" name="eqtn27"></a>\begin{eqnarray}
  C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2.
\tag{27}\end{eqnarray}
<!--Of course, this
cost function also depends on the desired output $y$, and you may
wonder why we're not regarding the cost also as a function of $y$.
Remember, though, that the input training example $x$ is fixed, and so
the output $y$ is also a fixed parameter.  In particular, it's not
something we can modify by changing the weights and biases in any way,
i.e., it's not something which the neural network learns.  And so it
makes sense to regard $C$ as a function of the output activations
$a^L$ alone, with $y$ merely a parameter that helps define that
function.-->
もちろんこのコスト関数は目標とする出力$y$にも依存しています。
コスト関数を$y$の関数とみなさない事を不思議に思うかもしれません。
しかし、訓練例$x$を固定する事で、出力$y$も固定している事に注意してください。
つまり、出力$y$は重みやバイアスをどのように変化させた所で変化させられる量ではなく、ニューラルネットが学習するものではありません。
ですので、$C$を出力の活性$a^L$単独の関数とみなし、$y$は関数を定義するための単なるパラメータとみなすのは意味のある問題設定です。
</p>
<p></p>
<p></p>
<p></p>
<p><h3><a name="the_hadamard_product_$s_\odot_t$"></a><a href="#the_hadamard_product_$s_\odot_t$"><!--The Hadamard product, $s \odot t$-->
アダマール積 $s \odot t$
</a></h3></p>
<p>
<!--The backpropagation algorithm is based on common linear algebraic
operations - things like vector addition, multiplying a vector by a
matrix, and so on.  But one of the operations is a little less
commonly used.  In particular, suppose $s$ and $t$ are two vectors of
the same dimension.  Then we use $s \odot t$ to denote the
<em>elementwise</em> product of the two vectors.  Thus the components of
$s \odot t$ are just $(s \odot t)_j = s_j t_j$. -->
逆伝播アルゴリズムは、ベクトルの足し算やベクトルと行列の掛け算など、一般的な代数操作に基づいています。
しかし、その中で1つあまり一般的ではない操作があります。
$s$と$t$が同じ次元のベクトルとした時、$s \odot t$を2つのベクトルの<em>要素ごと</em>の積とします。つまり、$s \odot t$の要素は$(s \odot t)_j = s_j t_j$です。
<!-- As an example,
<a class="displaced_anchor" name="eqtn28"></a>\begin{eqnarray}
\left[\begin{array}{c} 1 \\ 2 \end{array}\right]
  \odot \left[\begin{array}{c} 3 \\ 4\end{array} \right]
= \left[ \begin{array}{c} 1 * 3 \\ 2 * 4 \end{array} \right]
= \left[ \begin{array}{c} 3 \\ 8 \end{array} \right].
\tag{28}\end{eqnarray}
This kind of elementwise multiplication is sometimes called the
<em>Hadamard product</em> or <em>Schur product</em>.  We'll refer to it as
the Hadamard product.  Good matrix libraries usually provide fast
implementations of the Hadamard product, and that comes in handy when
implementing backpropagation.-->
例えば、
<a class="displaced_anchor" name="eqtn28"></a>\begin{eqnarray}
\left[\begin{array}{c} 1 \\ 2 \end{array}\right]
  \odot \left[\begin{array}{c} 3 \\ 4\end{array} \right]
= \left[ \begin{array}{c} 1 * 3 \\ 2 * 4 \end{array} \right]
= \left[ \begin{array}{c} 3 \\ 8 \end{array} \right]
\tag{28}\end{eqnarray}
です。
この種の要素ごとの積はしばしば<em>アダマール積</em>、もしくは<em>シューア積</em>と呼ばれます。
私達はアダマール積と呼ぶことにします。
よく出来た行列ライブラリにはアダマール積の高速な実装が用意されており、逆伝播を実装する際に手軽に利用できます。
</p>
<p><h3><a name="the_four_fundamental_equations_behind_backpropagation"></a><a href="#the_four_fundamental_equations_behind_backpropagation">
<!--The four fundamental equations behind backpropagation-->
逆伝播の基礎となる4つの式
</a></h3></p>
<p>
<!--Backpropagation is about understanding how changing the weights and
biases in a network changes the cost function.  Ultimately, this means
computing the partial derivatives $\partial C / \partial w^l_{jk}$ and
$\partial C / \partial b^l_j$.  But to compute those, we first
introduce an intermediate quantity, $\delta^l_j$, which we call the
<em>error</em> in the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer.
Backpropagation will give us a procedure to compute the error
$\delta^l_j$, and then will relate $\delta^l_j$ to $\partial C
/ \partial w^l_{jk}$ and $\partial C / \partial b^l_j$.-->
逆伝播は重みとバイアスの値を変えた時にコスト関数がどのように変化するかを把握する方法です。
これは究極的には$\partial C / \partial w^l_{jk}$と$\partial C / \partial b^l_j$とを計算する事を意味します。
これらの偏微分を計算する為にまずは中間的な値$\delta^l_j$を導入します。
この値は$l$番目の層の$j$番目のニューロンの<em>誤差</em>と呼びます。
逆伝播の仕組みを見ると$\delta^l_j$を計算手順と$\delta^l_j$を$\partial C/ \partial w^l_{jk}$や$\partial C / \partial b^l_j$と関連づける方法が得られます。
</p>
<p>
<!--To understand how the error is defined, imagine there is a demon in
our neural network:
<center>
<img src="images/tikz19.png"/>
</center>
The demon sits at the $j^{\rm th}$ neuron in layer $l$.  As the input to the
neuron comes in, the demon messes with the neuron's operation.  It
adds a little change $\Delta z^l_j$ to the neuron's weighted input, so
that instead of outputting $\sigma(z^l_j)$, the neuron instead outputs
$\sigma(z^l_j+\Delta z^l_j)$.  This change propagates through later
layers in the network, finally causing the overall cost to change by
an amount $\frac{\partial C}{\partial z^l_j} \Delta z^l_j$.-->
誤差の定義方法を理解する為にニューラルネットワークの中にいる悪魔を想像してみましょう。
<center>
<img src="images/tikz19.png"/>
</center>
悪魔は$l$番目の層の$j$番目のニューロンに座っているとします。
ニューロンに入力が入ってきた時、悪魔はニューロンをいじって、重みつき入力に小さな変更$\Delta z^l_j$を加えます。
従って、ニューロンは$\sigma(z^l_j)$の代わりに、$\sigma(z^l_j+\Delta z^l_j)$を出力します。
この変化はニューラルネット中の後段の層に伝播し、最終的に全体のコスト関数の値は$\frac{\partial C}{\partial z^l_j} \Delta z^l_j$だけ変化します。
</p>
<p>
<!--Now, this demon is a good demon, and is trying to help you improve the cost, i.e., they're trying to find a $\Delta z^l_j$ which makes the
cost smaller.
Suppose $\frac{\partial C}{\partial z^l_j}$ has a large value (either positive or negative).
Then the demon can lower the cost quite a bit by choosing $\Delta z^l_j$ to have the opposite sign to $\frac{\partial C}{\partial z^l_j}$.  By contrast, if $\frac{\partial C}{\partial z^l_j}$ is close to zero, then the demon can't improve the cost much at all by perturbing the weighted input
$z^l_j$.-->
ここで、この悪魔は善良な悪魔で、コスト関数を改善する、つまりコストを小さくするような$\Delta z^l_j$を探そうとするとします。
$\frac{\partial C}{\partial z^l_j}$が大きな値（正でも負も良いです）であるとします。
すると、$\frac{\partial C}{\partial z^l_j}$と逆の符号の$\Delta z^l_j$を選ぶことで、この悪魔はコストをかなり改善させられます。
逆に、もし$\frac{\partial C}{\partial z^l_j}$が$0$に近いと悪魔は重みつき入力$z^l_j$を摂動させてもコストをそれほどは改善できません。
<!--So far as the demon can tell, the neuron is already pretty near optimal*<span class="marginnote">
*This is only the case for small changes $\Delta
  z^l_j$, of course. We'll assume that the demon is constrained to
  make such small changes.</span>.  And so there's a heuristic sense in
which $\frac{\partial C}{\partial z^l_j}$ is a measure of the error in
the neuron.-->
悪魔が判断できる範囲においてはニューロンは既に最適に近い状態だと言えます*<span class="marginnote">
*もちろんこれが正しいのは$\Delta z^l_j$が小さい場合に限ってです。悪魔は微小な変化しか起こせないと仮定しています</span>。
つまり、ヒューリスティックには、$\frac{\partial C}{\partial z^l_j}$はニューラルネットワークの誤差を測定しているという意味を与える事ができます。
</p>
<p>
<!--Motivated by this story, we define the error $\delta^l_j$ of neuron
$j$ in layer $l$ by
<a class="displaced_anchor" name="eqtn29"></a>\begin{eqnarray}
  \delta^l_j \equiv \frac{\partial C}{\partial z^l_j}.
\tag{29}\end{eqnarray}
As per our usual conventions, we use $\delta^l$ to denote the vector
of errors associated with layer $l$.  Backpropagation will give us a
way of computing $\delta^l$ for every layer, and then relating those
errors to the quantities of real interest, $\partial C / \partial
w^l_{jk}$ and $\partial C / \partial b^l_j$.-->
この話を動機として、$l$番目の層の$j$番目のニューロンの誤差$\delta^l_j$を以下のように定義します
<a class="displaced_anchor" name="eqtn29"></a>\begin{eqnarray}
  \delta^l_j \equiv \frac{\partial C}{\partial z^l_j}.
\tag{29}\end{eqnarray}.
慣習に沿って、$\delta^l$で$l$番目の層の誤差からなるベクトルを表します。
逆伝播により、各層での$\delta^l$を計算し、これらを真に興味のある$\partial C / \partial w^l_{jk}$や$\partial C / \partial b^l_j$と関連付けることができます。
</p>
<p>
<!--You might wonder why the demon is changing the weighted input $z^l_j$.
Surely it'd be more natural to imagine the demon changing the output
activation $a^l_j$, with the result that we'd be using $\frac{\partial
  C}{\partial a^l_j}$ as our measure of error.  In fact, if you do
this things work out quite similarly to the discussion below.  But it
turns out to make the presentation of backpropagation a little more
algebraically complicated. -->
悪魔はなぜ重みつき入力$z^l_j$を変えようとするのかを疑問に思うかもしれません。
確かに、出力活性$a^l_j$を変化させ、その結果の$\frac{\partial C}{\partial a^l_j}$を誤差の指標として用いる方が自然かもしれません。
実際そのようにしても、以下の議論は同じように進められます。
しかし、やってみるとわかるのですが、誤差逆伝播の表示が数学的に若干複雑になってしまいます。
<!--So we'll stick with $\delta^l_j =
\frac{\partial C}{\partial z^l_j}$ as our measure of error*<span class="marginnote">
*In
  classification problems like MNIST the term "error" is sometimes
  used to mean the classification failure rate.  E.g., if the neural
  net correctly classifies 96.0 percent of the digits, then the error
  is 4.0 percent.  Obviously, this has quite a different meaning from
  our $\delta$ vectors.  In practice, you shouldn't have trouble
  telling which meaning is intended in any given useage.</span>.-->
ですので、我々は誤差の指標として$\delta^l_j = \frac{\partial C}{\partial z^l_j}$を用いることにします*<span class="marginnote">
*MNISTのような分類問題では、誤差（error）という言葉はしばしば誤分類の割合を意味します。
例えばニューラルネットが96.0%の数字を正しく分類できたとしたら、"error"は4.0%です。
もちろん、これは$\delta$ベクトルとは全く異なる意味です。
実際の文脈ではどちらの意味かで迷うことはないでしょう。
</span>。
</p>
<p>
<!--<strong>Plan of attack:</strong> Backpropagation is based around four
fundamental equations.
Together, those equations give us a way of computing both the error $\delta^l$ and the gradient of the cost function.
I state the four equations below.
Be warned, though: you shouldn't expect to instantaneously assimilate the equations.
Such an expectation will lead to disappointment.
In fact, the backpropagation equations are so rich that understanding them well requires considerable time and patience as you gradually delve deeper into the equations.
The good news is that such patience is repaid many times over.
And so the discussion in this section is merely a beginning, helping you on the way to a thorough understanding of the equations.-->
<strong>攻略計画</strong>
逆伝播は4つの基本的な式を基礎とします。
これらを組み合わせると、誤差$\delta^l$とコスト関数の勾配を計算ができます。
以下でその4つの式を挙げていきますが、1点注意があります：これらの式の意味をすぐに消化できると期待しない方が良いでしょう。
そのように期待するとがっかりするかもしれません。
逆伝播は内容が豊富であり、これらの式は相当の時間と忍耐がかけて徐々に理解できていくものです。
幸いなことに、ここで辛抱しておくと後々何度も報われることになります。
この節の議論はスタート地点に過ぎませんが、逆伝播の式を深く理解する過程の中で役に立つもののはずです。
</p>
<p>
<!--Here's a preview of the ways we'll delve more deeply into the
equations later in the chapter: I'll
<a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)">give a short proof of the equations</a>
, which helps explain why they are true; we'll
<a href="chap2.html#the_backpropagation_algorithm">restate the equations</a>
in algorithmic form as pseudocode, and
<a href="chap2.html#the_code_for_backpropagation">see how</a>
the pseudocode can be implemented as real, running Python code; and, in
<a href="chap2.html#backpropagation_the_big_picture">the final section of the chapter</a>
, we'll develop an intuitive picture of what the backpropagation equations mean, and how someone might discover them from scratch.
Along the way we'll return repeatedly to the four
fundamental equations, and as you deepen your understanding those
equations will come to seem comfortable and, perhaps, even beautiful
and natural.-->
誤差逆伝播の式をより深く理解する方法の概略は以下の通りです。
まず、
<a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)">これらの式の手短な証明</a>を示します。
この証明を見ればなぜこれらの式が正しいのかを理解しやすくなります。
その後、これらの式を<a href="chap2.html#the_backpropagation_algorithm">擬似コードで書き直し</a>、
その擬似コードを<a href="chap2.html#the_code_for_backpropagation">どのように実装できるか</a>を実際のPythonのコードで示します。
<a href="chap2.html#backpropagation_the_big_picture">本章の最後の節</a>では、誤差逆伝播の式の意味を直感的な図で示し、ゼロからスタートしてどのように誤差逆伝播を発見するかを見ていきます。
その道中で、我々は何度も4つの基本的な式に立ち戻ります。
理解が深まるにつれ、これらの式が快適で、美しく自然なものとさえ思えるようになるはずです。
</p>
<p><strong>
<!--An equation for the error in the output layer, $\delta^L$:-->
出力層での誤差$\delta^L$に関する式：
</strong>
<!--The components of $\delta^L$ are given by
<a class="displaced_anchor" name="eqtnBP1"></a>\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j).
\tag{BP1}\end{eqnarray}-->
$\delta^L$の各要素は
<a class="displaced_anchor" name="eqtnBP1"></a>\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j).
\tag{BP1}\end{eqnarray}
です。
<!--This is a very natural expression.  The first term on the right,
$\partial C / \partial a^L_j$, just measures how fast the cost is
changing as a function of the $j^{\rm th}$ output activation.  If, for
example, $C$ doesn't depend much on a particular output neuron, $j$,
then $\delta^L_j$ will be small, which is what we'd expect.  The
second term on the right, $\sigma'(z^L_j)$, measures how fast the
activation function $\sigma$ is changing at $z^L_j$.-->
これはとても自然な表式です。右辺の第1項の$\partial C / \partial a^L_j$はコストが$j$番目の出力活性の関数としてどの程度敏感に変化するかの度合いを測っています。
例えば、$C$が出力層の特定のニューロン（例えば$j$番目とします）にそれほど依存していなければ、我々の期待通り$\delta^L_j$は小さくなります。
一方、右辺の第2項の$\sigma'(z^L_j)$は活性関数$\sigma$が$z^L_j$の変化にどの程度敏感に反応するかの度合いを表しています。
</p>
<p>
<!--Notice that everything in
<span id="margin_147792633206_reveal" class="equation_link">(BP1)</span><span id="margin_147792633206" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_147792633206_reveal').click(function() {$('#margin_147792633206').toggle('slow', function() {});});</script>
is easily computed.-->
ここで注目すべきなのは
<span id="margin_147792633206_reveal" class="equation_link">(BP1)</span><span id="margin_147792633206" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_147792633206_reveal').click(function() {$('#margin_147792633206').toggle('slow', function() {});});</script>
中の全ての項が簡単に計算できる事です。
<!--In particular, we compute $z^L_j$ while computing the behaviour of the network, and it's only a small additional overhead to compute
$\sigma'(z^L_j)$.  The exact form of $\partial C / \partial a^L_j$
will, of course, depend on the form of the cost function.  However,
provided the cost function is known there should be little trouble
computing $\partial C / \partial a^L_j$.  For example, if we're using
the quadratic cost function then $C = \frac{1}{2} \sum_j (y_j-a_j)^2$,
and so $\partial C / \partial a^L_j = (a_j-y_j)$, which obviously is
easily computable.-->
ニューラルネットワークの挙動を計算する間に$z^L_j$を計算でき、さらに若干のオーバーヘッドを加えれば$\sigma'(z^L_j)$も計算できます。従って、第2項は計算できます。
<!--
2014/12/28 Kenta OONO
最後の1文を補足説明として追加しました。
-->
第1項に関してですが、$\partial C / \partial a^L_j$の具体的な表式はもちろんコスト関数の形に依存します。しかし、コスト関数が既知ならば$\partial C / \partial a^L_j$を計算するのは難しくありません。
例えば、2乗誤差コスト関数を用いた場合、$C = \frac{1}{2} \sum_j (y_j-a_j)^2$なので、$\partial C / \partial a^L_j = (a_j-y_j)$という簡単に計算できる式が得られます。
</p>
<p>
<!--Equation
<span id="margin_352512149181_reveal" class="equation_link">(BP1)</span><span id="margin_352512149181" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_352512149181_reveal').click(function() {$('#margin_352512149181').toggle('slow', function() {});});</script>
is a componentwise expression for $\delta^L$.-->
式<span id="margin_352512149181_reveal" class="equation_link">(BP1)</span><span id="margin_352512149181" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_352512149181_reveal').click(function() {$('#margin_352512149181').toggle('slow', function() {});});</script>
は$\delta^L$の各要素に対する表式です。
<!--It's a perfectly good expression, but not the matrix-based form we
want for backpropagation. However, it's easy to rewrite the equation
in a matrix-based form, as
<a class="displaced_anchor" name="eqtnBP1a"></a>\begin{eqnarray}
  \delta^L = \nabla_a C \odot \sigma'(z^L).
\tag{BP1a}\end{eqnarray}
Here, $\nabla_a C$ is defined to be a vector whose components are the
partial derivatives $\partial C / \partial a^L_j$.
You can think of $\nabla_a C$ as expressing the rate of change of $C$ with respect to the output activations. -->
この表式自体は悪くはないのですが、逆伝播で欲しい行列を用いた表式ではありません。
この式を行列として書き直すのは容易で、以下の様に書けます
<a class="displaced_anchor" name="eqtnBP1a"></a>\begin{eqnarray}
  \delta^L = \nabla_a C \odot \sigma'(z^L).
\tag{BP1a}\end{eqnarray}
ここで、$\nabla_a C$は偏微分$\partial C / \partial a^L_j$を並べたベクトルです。
$\nabla_a C$は出力活性に対する$C$の変化率とみなせます。
<!--It's easy to see that Equations
<span id="margin_744045597073_reveal" class="equation_link">(BP1a)</span><span id="margin_744045597073" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1a" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L = \nabla_a C \odot \sigma'(z^L) \nonumber\end{eqnarray}</a></span><script>$('#margin_744045597073_reveal').click(function() {$('#margin_744045597073').toggle('slow', function() {});});</script>
and
<span id="margin_713595193686_reveal" class="equation_link">(BP1)</span><span id="margin_713595193686" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_713595193686_reveal').click(function() {$('#margin_713595193686').toggle('slow', function() {});});</script>
are equivalent, and for that reason from now on we'll use
<span id="margin_645562843841_reveal" class="equation_link">(BP1)</span><span id="margin_645562843841" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_645562843841_reveal').click(function() {$('#margin_645562843841').toggle('slow', function() {});});</script>
interchangeably to refer to both equations.-->
<span id="margin_744045597073_reveal" class="equation_link">(BP1a)</span><span id="margin_744045597073" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1a" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L = \nabla_a C \odot \sigma'(z^L) \nonumber\end{eqnarray}</a></span><script>$('#margin_744045597073_reveal').click(function() {$('#margin_744045597073').toggle('slow', function() {});});</script>
と
<span id="margin_713595193686_reveal" class="equation_link">(BP1)</span><span id="margin_713595193686" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_713595193686_reveal').click(function() {$('#margin_713595193686').toggle('slow', function() {});});</script>
は同値である事はすぐにわかります。ですので、以下では両者の式を参照するのに
<span id="margin_645562843841_reveal" class="equation_link">(BP1)</span><span id="margin_645562843841" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_645562843841_reveal').click(function() {$('#margin_645562843841').toggle('slow', function() {});});</script>
を用いる事にします。
<!--As an example, in the case of the quadratic cost we have $\nabla_a C = (a^L-y)$, and so the fully matrix-based form of
<span id="margin_864112170226_reveal" class="equation_link">(BP1)</span><span id="margin_864112170226" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_864112170226_reveal').click(function() {$('#margin_864112170226').toggle('slow', function() {});});</script>
becomes
<a class="displaced_anchor" name="eqtn30"></a>\begin{eqnarray}
  \delta^L = (a^L-y) \odot \sigma'(z^L).
\tag{30}\end{eqnarray}
As you can see, everything in this expression has a nice vector form,
and is easily computed using a library such as Numpy.-->
例として、2乗誤差コスト関数の例では$\nabla_a C = (a^L-y)$です。
従って行列形式の
<span id="margin_864112170226_reveal" class="equation_link">(BP1)</span><span id="margin_864112170226" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_864112170226_reveal').click(function() {$('#margin_864112170226').toggle('slow', function() {});});</script>
は以下のようになります。
<a class="displaced_anchor" name="eqtn30"></a>\begin{eqnarray}
  \delta^L = (a^L-y) \odot \sigma'(z^L).
\tag{30}\end{eqnarray}
見ての通り、この表式内の全ての項がベクトル形式の表式となっており、Numpyなどのライブラリで簡単に計算できます。
</p>
<p><strong><!--An equation for the error $\delta^l$ in terms of the error in the next layer, $\delta^{l+1}$:-->
誤差$\delta^{l}$の次層での誤差$\delta^{l+1}$に関する表式：
</strong>
<!--In particular
<a class="displaced_anchor" name="eqtnBP2"></a>\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l),
\tag{BP2}\end{eqnarray}
where $(w^{l+1})^T$ is the tranpose of the weight matrix $w^{l+1}$ for
the $(l+1)^{\rm th}$ layer.-->
これは以下の通りです
<a class="displaced_anchor" name="eqtnBP2"></a>\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l).
\tag{BP2}\end{eqnarray}
ここで、$(w^{l+1})^T$は$(l+1)$番目の層の重み行列$w^{l+1}$の転置です。
<!-- This equation appears complicated, but each element has a nice interpretation.
Suppose we know the error $\delta^{l+1}$ at the $l+1^{\rm th}$ layer.
When we apply the transpose weight matrix, $(w^{l+1})^T$, we can think intuitively of this as moving the error <em>backward</em> through the network, giving us some sort of measure of the error at the output of the $l^{\rm th}$ layer.
We then take the Hadamard product $\odot \sigma'(z^l)$.  This
moves the error backward through the activation function in layer $l$,
giving us the error $\delta^l$ in the weighted input to layer $l$.-->
この式は一見複雑ですが、各要素はきちんとした解釈を持ちます。
$(l+1)$番目の層の誤差$\delta^{l+1}$番目が既知だとします。
重み行列の転置$(w^{l+1})^T$を掛ける操作は、直感的には誤差をネットワークとは<em>逆方向</em>に伝播させていると考える事ができます。
従って、この値は$l$番目の層の出力の誤差を測る指標の一種とみなすことができます。
転置行列を掛けた後、$\sigma'(z^l)$とのアダマール積を取っています。
これにより$l$番目の層の活性関数を通してエラーを更に逆方向に伝播しています。
その結果、$l$番目の層の重みつき入力についての誤差$\delta^l$が得られます。
</p>
<p>
<!--By combining
<span id="margin_333455157300_reveal" class="equation_link">(BP2)</span><span id="margin_333455157300" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_333455157300_reveal').click(function() {$('#margin_333455157300').toggle('slow', function() {});});</script>
with
<span id="margin_168271461280_reveal" class="equation_link">(BP1)</span><span id="margin_168271461280" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_168271461280_reveal').click(function() {$('#margin_168271461280').toggle('slow', function() {});});</script>
we can compute the error $\delta^l$ for any layer in the network.-->
<span id="margin_333455157300_reveal" class="equation_link">(BP2)</span><span id="margin_333455157300" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_333455157300_reveal').click(function() {$('#margin_333455157300').toggle('slow', function() {});});</script>
を
<span id="margin_168271461280_reveal" class="equation_link">(BP1)</span><span id="margin_168271461280" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_168271461280_reveal').click(function() {$('#margin_168271461280').toggle('slow', function() {});});</script>
と組み合わせる事で、ニューラルネットワークの任意の層$l$での誤差$\delta^l$を計算できます。
<!--We start by using
<span id="margin_394578264286_reveal" class="equation_link">(BP1)</span><span id="margin_394578264286" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_394578264286_reveal').click(function() {$('#margin_394578264286').toggle('slow', function() {});});</script>n
to compute $\delta^L$,-->
まず、$\delta^L$を式
<span id="margin_394578264286_reveal" class="equation_link">(BP1)</span><span id="margin_394578264286" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j). \nonumber\end{eqnarray}</a></span><script>$('#margin_394578264286_reveal').click(function() {$('#margin_394578264286').toggle('slow', function() {});});</script>
で計算します。
<!--then apply Equation
<span id="margin_565392375859_reveal" class="equation_link">(BP2)</span><span id="margin_565392375859" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_565392375859_reveal').click(function() {$('#margin_565392375859').toggle('slow', function() {});});</script>
to compute $\delta^{L-1}$,-->
次に、式
<span id="margin_565392375859_reveal" class="equation_link">(BP2)</span><span id="margin_565392375859" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_565392375859_reveal').click(function() {$('#margin_565392375859').toggle('slow', function() {});});</script>
を適用して$\delta^{L-1}$を計算します。
<!-- then Equation
<span id="margin_646381930468_reveal" class="equation_link">(BP2)</span><span id="margin_646381930468" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_646381930468_reveal').click(function() {$('#margin_646381930468').toggle('slow', function() {});});</script>
again to compute $\delta^{L-2}$, and so on, all the way back through the network.-->
その後、再び
<span id="margin_646381930468_reveal" class="equation_link">(BP2)</span><span id="margin_646381930468" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_646381930468_reveal').click(function() {$('#margin_646381930468').toggle('slow', function() {});});</script>
を適用して、$\delta^{L-2}$を計算します。以下これを繰り返してニューラルネットワークを逆向きに辿る事ができます。
</p>
<p><strong>
<!--An equation for the rate of change of the cost with respect to any bias in the network:-->
任意のバイアスに関するコストの変化率の式：
</strong>
<!--In particular:
<a class="displaced_anchor" name="eqtnBP3"></a>\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =
  \delta^l_j.
\tag{BP3}\end{eqnarray}
That is, the error $\delta^l_j$ is <em>exactly equal</em> to the rate of
change $\partial C / \partial b^l_j$. -->
具体的には以下の通りです
<a class="displaced_anchor" name="eqtnBP3"></a>\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =
  \delta^l_j.
\tag{BP3}\end{eqnarray}
すなわち、誤差$\delta^l_j$はコスト関数の変化率$\partial C / \partial b^l_j$と<em>完全に同一</em>です。
<!--This is great news, since
<span id="margin_553087553337_reveal" class="equation_link">(BP1)</span><span id="margin_553087553337" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_553087553337_reveal').click(function() {$('#margin_553087553337').toggle('slow', function() {});});</script>
and
<span id="margin_416964584497_reveal" class="equation_link">(BP2)</span><span id="margin_416964584497" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_416964584497_reveal').click(function() {$('#margin_416964584497').toggle('slow', function() {});});</script>
have already told us how to compute $\delta^l_j$.-->
<span id="margin_553087553337_reveal" class="equation_link">(BP1)</span><span id="margin_553087553337" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_553087553337_reveal').click(function() {$('#margin_553087553337').toggle('slow', function() {});});</script>
と
<span id="margin_416964584497_reveal" class="equation_link">(BP2)</span><span id="margin_416964584497" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_416964584497_reveal').click(function() {$('#margin_416964584497').toggle('slow', function() {});});</script>
からこの値の計算方法は既にわかっているので、この事実はは好都合です。
<!--We can rewrite
<span id="margin_828081856583_reveal" class="equation_link">(BP3)</span><span id="margin_828081856583" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =
  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_828081856583_reveal').click(function() {$('#margin_828081856583').toggle('slow', function() {});});</script>
in shorthand as
<a class="displaced_anchor" name="eqtn31"></a>\begin{eqnarray}
  \frac{\partial C}{\partial b} = \delta,
\tag{31}\end{eqnarray}
where it is understood that $\delta$ is being evaluated at the same
neuron as the bias $b$.-->
<span id="margin_828081856583_reveal" class="equation_link">(BP3)</span><span id="margin_828081856583" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =
  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_828081856583_reveal').click(function() {$('#margin_828081856583').toggle('slow', function() {});});</script>
を簡潔に
<a class="displaced_anchor" name="eqtn31"></a>\begin{eqnarray}
  \frac{\partial C}{\partial b} = \delta
\tag{31}\end{eqnarray}
と書くことができます。ここで、$\delta$の各成分は同じニューロンのバイアス$b$で評価した値と解釈します。
</p>
<p><strong>
<!--An equation for the rate of change of the cost with respect to any weight in the network:-->
任意の重みについてのコストの変化率：
</strong>
<!--In particular:
<a class="displaced_anchor" name="eqtnBP4"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j.
\tag{BP4}\end{eqnarray}-->
具体的には以下の通りです。
<a class="displaced_anchor" name="eqtnBP4"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j.
\tag{BP4}\end{eqnarray}
<!--This tells us how to compute the partial derivatives $\partial C / \partial w^l_{jk}$ in terms of the quantities $\delta^l$ and
$a^{l-1}$, which we already know how to compute.  The equation can be
rewritten in a less index-heavy notation as
<a class="displaced_anchor" name="eqtn32"></a>\begin{eqnarray}  \frac{\partial C}{\partial w} = a_{\rm in} \delta_{\rm out},
\tag{32}\end{eqnarray}
where it's understood that $a_{\rm in}$ is the activation of the
neuron input to the weight $w$, and $\delta_{\rm out}$ is the error of
the neuron output from the weight $w$. -->
この式を見ると、偏微分$\partial C / \partial w^l_{jk}$を計算方法が既知の$\delta^l$と$a^{l-1}$を用いて計算できることがわかります。
この式はもう少し添字の軽い式で
<a class="displaced_anchor" name="eqtn32"></a>\begin{eqnarray}  \frac{\partial C}{\partial w} = a_{\rm in} \delta_{\rm out},
\tag{32}\end{eqnarray}
と書き直せます。ここで、$a_{\rm in}$は重み$w$を持つ枝に対する入力ニューロンの活性で、$\delta_{\rm out}$は同じ枝に対する出力ニューロンの持つ誤差です。
<!--Zooming in to look at just the weight $w$, and the two neurons connected by that weight, we can depict this as:
<center>
<img src="images/tikz20.png"/>
</center>-->
重み$w$とそれに接続する2つのニューロンだけに焦点を絞ると、この式は以下のように見ることができます：
<center>
<img src="images/tikz20.png"/>
</center>
<!--A nice consequence of Equation
<span id="margin_926874251435_reveal" class="equation_link">(32)</span><span id="margin_926874251435" class="marginequation" style="display: none;"><a href="chap2.html#eqtn32" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial
    C}{\partial w} = a_{\rm in} \delta_{\rm out} \nonumber\end{eqnarray}</a></span><script>$('#margin_926874251435_reveal').click(function() {$('#margin_926874251435').toggle('slow', function() {});});</script>
is that when the activation $a_{\rm in}$ is small, $a_{\rm in} \approx
n0$, the gradient term $\partial C / \partial w$ will also tend to be
small.-->
式
<span id="margin_926874251435_reveal" class="equation_link">(32)</span><span id="margin_926874251435" class="marginequation" style="display: none;"><a href="chap2.html#eqtn32" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial w} = a_{\rm in} \delta_{\rm out} \nonumber\end{eqnarray}</a></span><script>$('#margin_926874251435_reveal').click(function() {$('#margin_926874251435').toggle('slow', function() {});});</script>
から$a_{\rm in}$が小さい($a_{\rm in} \approx 0$)時には、勾配$\partial C / \partial w$も小さくなる傾向があるという結論が得られます。
<!--In this case, we'll say the weight <em>learns slowly</em>,
meaning that it's not changing much during gradient descent.  In other
words, one consequence of
<span id="margin_342844631908_reveal" class="equation_link">(BP4)</span><span id="margin_342844631908" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_342844631908_reveal').click(function() {$('#margin_342844631908').toggle('slow', function() {});});</script>
 is that weights output from low-activation neurons learn slowly.-->
このような状態を重みの<em>学習が遅い</em>と表現します。その意味は、勾配降下法を行っている間、値が大きく変化しないという事です。
同じ事の言い換えですが、式
<span id="margin_342844631908_reveal" class="equation_link">(BP4)</span><span id="margin_342844631908" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_342844631908_reveal').click(function() {$('#margin_342844631908').toggle('slow', function() {});});</script>
の帰結の1つとして、活性の低いニューロンから入力を受けとる重みは学習が遅いとわかります。
</p>
<p>
<!--There are other insights along these lines which can be obtained from
<span id="margin_757095973931_reveal" class="equation_link">(BP1)</span><span id="margin_757095973931" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_757095973931_reveal').click(function() {$('#margin_757095973931').toggle('slow', function() {});});</script>
-
<span id="margin_724164157530_reveal" class="equation_link">(BP4)</span><span id="margin_724164157530" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_724164157530_reveal').click(function() {$('#margin_724164157530').toggle('slow', function() {});});</script>
.-->
<span id="margin_757095973931_reveal" class="equation_link">(BP1)</span><span id="margin_757095973931" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_757095973931_reveal').click(function() {$('#margin_757095973931').toggle('slow', function() {});});</script>
-
<span id="margin_724164157530_reveal" class="equation_link">(BP4)</span><span id="margin_724164157530" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_724164157530_reveal').click(function() {$('#margin_724164157530').toggle('slow', function() {});});</script>
からわかる事は他にもあります。
<!--Let's start by looking at the output layer.
Consider the term $\sigma'(z^L_j)$ in
<span id="margin_239655849963_reveal" class="equation_link">(BP1)</span><span id="margin_239655849963" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_239655849963_reveal').click(function() {$('#margin_239655849963').toggle('slow', function() {});});</script>.-->
出力層から見てみましょう。
<span id="margin_239655849963_reveal" class="equation_link">(BP1)</span><span id="margin_239655849963" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_239655849963_reveal').click(function() {$('#margin_239655849963').toggle('slow', function() {});});</script>
内の$\sigma'(z^L_j)$の項に注目します。
<!--Recall from the
<a href="chap1.html#sigmoid_graph">graph of the sigmoid function in the last chapter</a>
that the $\sigma$ function becomes very flat when $\sigma(z^L_j)$ is approximately $0$ or $1$.
When this occurs we will have $\sigma'(z^L_j) \approx 0$.  And so the lesson is that a weight in the final layer will learn slowly if the output
neuron is either low activation ($\approx 0$) or high activation
($\approx 1$).  In this case it's common to say the output neuron has
<em>saturated</em> and, as a result, the weight has stopped learning (or
is learning slowly).  Similar remarks hold also for the biases of
output neuron.-->
前章の<a href="chap1.html#sigmoid_graph">シグモイド関数のグラフ</a>を思い出すと、$z^L_j$が$0$か$1$に近づくと関数$\sigma$はとても平坦になっていました。
これは$\sigma'(z^L_j) \approx 0$の状態です。
従って、出力ニューロンの活性が低かったり($\approx 0$)、高かったり($\approx 1$)すると、最終層の学習は遅い事がわかります。
このような状況を、出力ニューロンは<em>飽和</em>し、重みの学習が終了している（もしくは重みの学習が遅い）と表現するのが一般的です。
同様の事は出力ニューロンのバイアスに対しても成立します。
</p>
<p>
<!--We can obtain similar insights for earlier layers.  In particular,
note the $\sigma'(z^l)$ term in
<span id="margin_236897220270_reveal" class="equation_link">(BP2)</span><span id="margin_236897220270" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_236897220270_reveal').click(function() {$('#margin_236897220270').toggle('slow', function() {});});</script>
.-->
出力層より前の層でも似た考察ができます。特に
<span id="margin_236897220270_reveal" class="equation_link">(BP2)</span><span id="margin_236897220270" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_236897220270_reveal').click(function() {$('#margin_236897220270').toggle('slow', function() {});});</script>
内の$\sigma'(z^l)$の項に注目します。
<!--This means that $\delta^l_j$ is likely to get small if the neuron is near saturation.
And this, in turn, means that any weights input to a saturated neuron
will learn slowly*<span class="marginnote">
*This reasoning won't hold if ${w^{l+1}}^T
  \delta^{l+1}$ has large enough entries to compensate for the
  smallness of $\sigma'(z^l_j)$.  But I'm speaking of the general
  tendency.</span>.-->
この式は、ニューロンが飽和状態だと$\delta^l_j$は小さくなる傾向がある事を意味します。また、逆に飽和したニューロンを入力に持つ重みの学習は遅い事も意味します*<span class="marginnote">
*$(w^{l+1})^T \delta^{l+1}$が十分大きく、$\sigma'(z^l_j)$が小さくてもその埋め合わせができるならば、この推論は成り立ちません。ここでは一般的な傾向について述べています。</span>。
</p>
<p>
<!--Summing up, we've learnt that a weight will learn slowly if either the input neuron is low-activation, or if the output neuron has saturated, i.e., is either high- or low-activation.  -->
まとめると、入力ニューロンが低活性状態であるか、出力ニューロンが飽和状態（低活性もしくは高活性状態）の時には、重みの学習が遅いという事がわかりました。
</p>
<p>
<!--None of these observations is too greatly surprising.  Still, they
help improve our mental model of what's going on as a neural network
learns.  Furthermore, we can turn this type of reasoning around.  The
four fundamental equations turn out to hold for any activation
function, not just the standard sigmoid function (that's because, as
we'll see in a moment, the proofs don't use any special properties of
$\sigma$).  And so we can use these equations to <em>design</em>
activation functions which have particular desired learning
properties.  As an example to give you the idea, suppose we were to
choose a (non-sigmoid) activation function $\sigma$ so that $\sigma'$
is always positive, and never gets close to zero.  That would prevent
the slow-down of learning that occurs when ordinary sigmoid neurons
saturate.  Later in the book we'll see examples where this kind of
modification is made to the activation function. -->
これらの知見はどれも極端に驚くべき事ではありません。
しかし、これらの考察を通じてニューラルネットワークの学習過程に関するメンタルモデルの精緻にできます。
さらに、以上の考察を逆向きに利用する事ができます。
<!--
2014/12/30 Kenta OONO
turn aroundの訳がうまく思いつかず。とりあえず逆向きと訳している。
-->
これら4つの基本方程式は任意の活性化関数について成立します
（後述のように、証明に関数$\sigma$の特別な性質を用いていないからです）。
従って、これらの式を利用して好きな学習特性を持つ活性化関数を<em>設計</em>する事が可能です。
アイデアを示すために例を挙げると、例えば（シグモイドではない）活性化関数$\sigma$として$\sigma'$が常に正で、$0$に漸近しないものを選んだとします。
すると、通常のシグモイド関数を用いたニューロンが飽和した際に起こってしまう学習の減速を防ぐ事が可能です。
この本の後ろでは、この種の修正を活性化関数に対して施します。
<!--Keeping the four equations
<span id="margin_330673844442_reveal" class="equation_link">(BP1)</span><span id="margin_330673844442" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_330673844442_reveal').click(function() {$('#margin_330673844442').toggle('slow', function() {});});</script>
-
<span id="margin_626656835558_reveal" class="equation_link">(BP4)</span><span id="margin_626656835558" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_626656835558_reveal').click(function() {$('#margin_626656835558').toggle('slow', function() {});});</script>
in mind can help explain why such modifications are tried, and what impact they can have.-->
<span id="margin_330673844442_reveal" class="equation_link">(BP1)</span><span id="margin_330673844442" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_330673844442_reveal').click(function() {$('#margin_330673844442').toggle('slow', function() {});});</script>
-
<span id="margin_626656835558_reveal" class="equation_link">(BP4)</span><span id="margin_626656835558" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_626656835558_reveal').click(function() {$('#margin_626656835558').toggle('slow', function() {});});</script>
の4つの式を覚えておくと、なぜこのような修正を行うのか、修正でどのような影響が起こるかを説明するのに役立ちます。
</p>
<p><a id="backpropsummary"></a></p>
<p><center>
<img src="images/tikz21.png"/>
</center></p>
<p><a id="alternative_backprop"></a></p>
<p><h4><a name="problem_798513"></a><a href="#problem_798513">
<!--Problem-->
問題
</a></h4><ul>
<li><strong>
<!--Alternate presentation of the equations of backpropagation:-->
誤差逆伝播の別の表示方法：
</strong>
<!-- I've stated the equations of backpropagation (notably
<span id="margin_174707495294_reveal" class="equation_link">(BP1)</span><span id="margin_174707495294" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_174707495294_reveal').click(function() {$('#margin_174707495294').toggle('slow', function() {});});</script>
and
<span id="margin_512047477177_reveal" class="equation_link">(BP2)</span><span id="margin_512047477177" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_512047477177_reveal').click(function() {$('#margin_512047477177').toggle('slow', function() {});});</script>
) using the Hadamard product.-->
これまで、誤差逆伝播の式（特に
<span id="margin_174707495294_reveal" class="equation_link">(BP1)</span><span id="margin_174707495294" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_174707495294_reveal').click(function() {$('#margin_174707495294').toggle('slow', function() {});});</script>
と
<span id="margin_512047477177_reveal" class="equation_link">(BP2)</span><span id="margin_512047477177" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_512047477177_reveal').click(function() {$('#margin_512047477177').toggle('slow', function() {});});</script>
）をアダマール積を用いて記述していました。
<!--This presentation may
  be disconcerting if you're unused to the Hadamard product.  There's
  an alternative approach, based on conventional matrix
  multiplication, which some readers may find enlightening. -->
アダマール積に慣れていない読者はこの表式にに戸惑ったかもしれません。
これらの式を通常の行列の掛け算に基づいて表示する別の方法があります。
読者によってはこのアプローチは教育的かもしれません。
<!--(1) Show that
<span id="margin_667190610058_reveal" class="equation_link">(BP1)</span><span id="margin_667190610058" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_667190610058_reveal').click(function() {$('#margin_667190610058').toggle('slow', function() {});});</script>
may be rewritten as
  <a class="displaced_anchor" name="eqtn33"></a>\begin{eqnarray}
    \delta^L = \Sigma'(z^L) \nabla_a C,
  \tag{33}\end{eqnarray}
  where $\Sigma'(z^L)$ is a square matrix whose diagonal entries are
  the values $\sigma'(z^L_j)$, and whose off-diagonal entries are
  zero.  Note that this matrix acts on $\nabla_a C$ by conventional
  matrix multiplication.-->
(1)
<span id="margin_667190610058_reveal" class="equation_link">(BP1)</span><span id="margin_667190610058" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_667190610058_reveal').click(function() {$('#margin_667190610058').toggle('slow', function() {});});</script>
を以下の様に書き換えられる事を示してください
  <a class="displaced_anchor" name="eqtn33"></a>\begin{eqnarray}
    \delta^L = \Sigma'(z^L) \nabla_a C
  \tag{33}\end{eqnarray}
ここで、$\Sigma'(z^L)$は$\sigma'(z^L_j)$を対角成分に持ち、非対角成分は$0$の正方行列です。
この行列は$\nabla_a C$に通常の行列の掛け算で作用します。
<!--(2) Show that
<span id="margin_345345235028_reveal" class="equation_link">(BP2)</span><span id="margin_345345235028" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_345345235028_reveal').click(function() {$('#margin_345345235028').toggle('slow', function() {});});</script>
may be rewritten as
  <a class="displaced_anchor" name="eqtn34"></a>\begin{eqnarray}
    \delta^l = \Sigma'(z^l) (w^{l+1})^T \delta^{l+1}.
  \tag{34}\end{eqnarray}-->
(2)
<span id="margin_345345235028_reveal" class="equation_link">(BP2)</span><span id="margin_345345235028" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_345345235028_reveal').click(function() {$('#margin_345345235028').toggle('slow', function() {});});</script>
を以下の様に書き換えられる事を示してください
  <a class="displaced_anchor" name="eqtn34"></a>\begin{eqnarray}
    \delta^l = \Sigma'(z^l) (w^{l+1})^T \delta^{l+1}.
  \tag{34}\end{eqnarray}
<!--  (3) By combining observations (1) and (2) show that
  <a class="displaced_anchor" name="eqtn35"></a>\begin{eqnarray}
    \delta^l = \Sigma'(z^l) (w^{l+1})^T \ldots \Sigma'(z^{L-1}) (w^L)^T
    \Sigma'(z^L) \nabla_a C
  \tag{35}\end{eqnarray}-->
 (3) (1)と(2)を組み合わせて、以下の式を示してください。
  <a class="displaced_anchor" name="eqtn35"></a>\begin{eqnarray}
    \delta^l = \Sigma'(z^l) (w^{l+1})^T \ldots \Sigma'(z^{L-1}) (w^L)^T
    \Sigma'(z^L) \nabla_a C
  \tag{35}\end{eqnarray}
<!--  For readers comfortable with matrix multiplication this equation may be easier to understand than
<span id="margin_823951338021_reveal" class="equation_link">(BP1)</span><span id="margin_823951338021" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_823951338021_reveal').click(function() {$('#margin_823951338021').toggle('slow', function() {});});</script>
and
<span id="margin_312530255170_reveal" class="equation_link">(BP2)</span><span id="margin_312530255170" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_312530255170_reveal').click(function() {$('#margin_312530255170').toggle('slow', function() {});});</script>. -->
行列の掛け算に慣れている読者にとっては
<span id="margin_823951338021_reveal" class="equation_link">(BP1)</span><span id="margin_823951338021" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_823951338021_reveal').click(function() {$('#margin_823951338021').toggle('slow', function() {});});</script>
と
<span id="margin_312530255170_reveal" class="equation_link">(BP2)</span><span id="margin_312530255170" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_312530255170_reveal').click(function() {$('#margin_312530255170').toggle('slow', function() {});});</script>
よりも、こちらの方が理解しやすいかもしれません。
<!--The reason I've focused on
<span id="margin_650030226543_reveal" class="equation_link">(BP1)</span><span id="margin_650030226543" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_650030226543_reveal').click(function() {$('#margin_650030226543').toggle('slow', function() {});});</script>
and
<span id="margin_299731299416_reveal" class="equation_link">(BP2)</span><span id="margin_299731299416" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_299731299416_reveal').click(function() {$('#margin_299731299416').toggle('slow', function() {});});</script>
is because that approach turns out to be faster to implement numerically.
</ul>
-->
それでも
<span id="margin_650030226543_reveal" class="equation_link">(BP1)</span><span id="margin_650030226543" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_650030226543_reveal').click(function() {$('#margin_650030226543').toggle('slow', function() {});});</script>
と
<span id="margin_299731299416_reveal" class="equation_link">(BP2)</span><span id="margin_299731299416" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_299731299416_reveal').click(function() {$('#margin_299731299416').toggle('slow', function() {});});</script>
の表式を用いたのは、こちらの方が実装時の数値計算が速いからです。
</p>
<p><h3><a name="proof_of_the_four_fundamental_equations_(optional)"></a>
<a href="#proof_of_the_four_fundamental_equations_(optional)">
<!--Proof of the four fundamental equations (optional)-->
4つの基本的な式の証明（任意）
</a>
</h3> </p>
<p>
<!--We'll now prove the four fundamental equations
<span id="margin_816792216456_reveal" class="equation_link">(BP1)</span><span id="margin_816792216456" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_816792216456_reveal').click(function() {$('#margin_816792216456').toggle('slow', function() {});});</script>-<span id="margin_517979696409_reveal" class="equation_link">(BP4)</span><span id="margin_517979696409" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_517979696409_reveal').click(function() {$('#margin_517979696409').toggle('slow', function() {});});</script>.-->
それでは、<span id="margin_816792216456_reveal" class="equation_link">(BP1)</span><span id="margin_816792216456" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_816792216456_reveal').click(function() {$('#margin_816792216456').toggle('slow', function() {});});</script>-<span id="margin_517979696409_reveal" class="equation_link">(BP4)</span><span id="margin_517979696409" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_517979696409_reveal').click(function() {$('#margin_517979696409').toggle('slow', function() {});});</script>を証明していきます。
<!-- All four are consequences of the chain rule from multivariable calculus.
If you're comfortable with the chain rule, then I strongly encourage you to attempt the derivation yourself before reading on.-->
これらはすべて多変数関数の微分の連鎖律の結論です。
もし連鎖律に慣れていたら、読み進める前に自力での導出に挑戦してみるのを強くおすすめします。
</p>
<p>
<!--Let's begin with Equation
<span id="margin_189404299769_reveal" class="equation_link">(BP1)</span><span id="margin_189404299769" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_189404299769_reveal').click(function() {$('#margin_189404299769').toggle('slow', function() {});});</script>
, which gives an expression for the output error, $\delta^L$.-->
まず、出力での誤差$\delta^L$の表式
<span id="margin_189404299769_reveal" class="equation_link">(BP1)</span><span id="margin_189404299769" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_189404299769_reveal').click(function() {$('#margin_189404299769').toggle('slow', function() {});});</script>
から証明しましょう。
<!--To prove this equation, recall that by definition
<a class="displaced_anchor" name="eqtn36"></a>\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial z^L_j}.
\tag{36}\end{eqnarray}
Applying the chain rule, we can re-express the partial derivative
above in terms of partial derivatives with respect to the output
activations,
<a class="displaced_anchor" name="eqtn37"></a>\begin{eqnarray}
  \delta^L_j = \sum_k \frac{\partial C}{\partial a^L_k} \frac{\partial a^L_k}{\partial z^L_j},
\tag{37}\end{eqnarray}
where the sum is over all neurons $k$ in the output layer.-->
この式を示すのに、まず次の式を思い出します
<a class="displaced_anchor" name="eqtn36"></a>\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial z^L_j}.
\tag{36}\end{eqnarray}
連鎖律を適用すると、この微分を出力活性に関する偏微分で書き直す事ができます
<a class="displaced_anchor" name="eqtn37"></a>\begin{eqnarray}
  \delta^L_j = \sum_k \frac{\partial C}{\partial a^L_k} \frac{\partial a^L_k}{\partial z^L_j}.
\tag{37}\end{eqnarray}
ここで、和は出力層のすべてのニューロン$k$について足し合わせます。
<!--Of course, the output activation $a^L_k$ of the $k^{\rm th}$ neuron depends only on the input weight $z^L_j$ for the $j^{\rm th}$ neuron when $k = j$.
And so $\partial a^L_k / \partial z^L_j$ vanishes when $k \neq j$.  As
a result we can simplify the previous equation to
<a class="displaced_anchor" name="eqtn38"></a>\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \frac{\partial a^L_j}{\partial z^L_j}.
\tag{38}\end{eqnarray}-->
もちろん、$k=j$の時には、$k$番目のニューロンの出力活性$a^L_k$は、$j$番目のニューロンの入力の重みにのみ依存します。
従って、$k\neq j$の時には$\partial a^L_k / \partial z^L_j$の値は$0$です。
結果として前述の式を以下のように簡略化できます
<a class="displaced_anchor" name="eqtn38"></a>\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \frac{\partial a^L_j}{\partial z^L_j}.
\tag{38}\end{eqnarray}
<!--Recalling that $a^L_j = \sigma(z^L_j)$ the second term on the right
can be written as $\sigma'(z^L_j)$, and the equation becomes
<a class="displaced_anchor" name="eqtn39"></a>\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j),
\tag{39}\end{eqnarray}
which is just
<span id="margin_104919018101_reveal" class="equation_link">(BP1)</span><span id="margin_104919018101" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_104919018101_reveal').click(function() {$('#margin_104919018101').toggle('slow', function() {});});</script>
, in component form.-->
$a^L_j = \sigma(z^L_j)$であった事を思い出すと、第2項は$\sigma'(z^L_j)$と書けて、
<a class="displaced_anchor" name="eqtn39"></a>\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j)
\tag{39}\end{eqnarray}
となります。これを添字なしの形式で書くと
<span id="margin_104919018101_reveal" class="equation_link">(BP1)</span><span id="margin_104919018101" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_104919018101_reveal').click(function() {$('#margin_104919018101').toggle('slow', function() {});});</script>
が得られます。
</p>
<p>
<!--Next, we'll prove
<span id="margin_987004185602_reveal" class="equation_link">(BP2)</span><span id="margin_987004185602" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_987004185602_reveal').click(function() {$('#margin_987004185602').toggle('slow', function() {});});</script>
, which gives an equation for the error $\delta^l$ in terms of the error in the next layer, $\delta^{l+1}$.-->
次に、誤差$\delta^l$をその1つ後ろの層の誤差$\delta^{l+1}$を用いて表す
<span id="margin_987004185602_reveal" class="equation_link">(BP2)</span><span id="margin_987004185602" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_987004185602_reveal').click(function() {$('#margin_987004185602').toggle('slow', function() {});});</script>
を証明します。
<!--To do this, we want to rewrite $\delta^l_j = \partial C / \partial z^l_j$ in terms of $\delta^{l+1}_k = \partial C / \partial z^{l+1}_k$.
We can do this using the chain rule,
<a class="displaced_anchor" name="eqtn40"></a><a class="displaced_anchor" name="eqtn41"></a><a class="displaced_anchor" name="eqtn42"></a>\begin{eqnarray}
  \delta^l_j & = & \frac{\partial C}{\partial z^l_j} \tag{40}\\
  & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}\\
  & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k,
\tag{42}\end{eqnarray}
where in the last line we have interchanged the two terms on the
right-hand side, and substituted the definition of $\delta^{l+1}_k$.-->
そのために、連鎖律を用いて$\delta^l_j = \partial C / \partial z^l_j$を$\delta^{l+1}_k = \partial C / \partial z^{l+1}_k$を用いて書き直します
<a class="displaced_anchor" name="eqtn40"></a><a class="displaced_anchor" name="eqtn41"></a><a class="displaced_anchor" name="eqtn42"></a>\begin{eqnarray}
  \delta^l_j & = & \frac{\partial C}{\partial z^l_j} \tag{40}\\
  & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}\\
  & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k.
\tag{42}\end{eqnarray}
ここで、最後の行は2つの項を交換し、第2項を$\delta^{l+1}_k$の定義で置き換えました。
<!--To evaluate the first term on the last line, note that
<a class="displaced_anchor" name="eqtn43"></a>\begin{eqnarray}
  z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k.
\tag{43}\end{eqnarray}
Differentiating, we obtain
<a class="displaced_anchor" name="eqtn44"></a>\begin{eqnarray}
  \frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j).
\tag{44}\end{eqnarray}-->
最後の行の第1項を評価するために、次の式に注意します
<a class="displaced_anchor" name="eqtn43"></a>\begin{eqnarray}
  z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k.
\tag{43}\end{eqnarray}
この式を微分すると、次が得られます
<a class="displaced_anchor" name="eqtn44"></a>\begin{eqnarray}
  \frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j).
\tag{44}\end{eqnarray}
<!--Substituting back into
<span id="margin_234573266012_reveal" class="equation_link">(42)</span><span id="margin_234573266012" class="marginequation" style="display: none;"><a href="chap2.html#eqtn42" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k \nonumber\end{eqnarray}</a></span><script>$('#margin_234573266012_reveal').click(function() {$('#margin_234573266012').toggle('slow', function() {});});</script>
 we obtain
<a class="displaced_anchor" name="eqtn45"></a>\begin{eqnarray}
  \delta^l_j = \sum_k w^{l+1}_{kj}  \delta^{l+1}_k \sigma'(z^l_j).
\tag{45}\end{eqnarray}-->
この式で
<span id="margin_234573266012_reveal" class="equation_link">(42)</span><span id="margin_234573266012" class="marginequation" style="display: none;"><a href="chap2.html#eqtn42" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k \nonumber\end{eqnarray}</a></span><script>$('#margin_234573266012_reveal').click(function() {$('#margin_234573266012').toggle('slow', function() {});});</script>
を置き換えると、次の式が得られます
<a class="displaced_anchor" name="eqtn45"></a>\begin{eqnarray}
  \delta^l_j = \sum_k w^{l+1}_{kj}  \delta^{l+1}_k \sigma'(z^l_j).
\tag{45}\end{eqnarray}
<!--This is just
<span id="margin_475522318211_reveal" class="equation_link">(BP2)</span><span id="margin_475522318211" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_475522318211_reveal').click(function() {$('#margin_475522318211').toggle('slow', function() {});});</script> written in component form.-->
この式を添え字を用いずに書いたものが
<span id="margin_475522318211_reveal" class="equation_link">(BP2)</span><span id="margin_475522318211" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_475522318211_reveal').click(function() {$('#margin_475522318211').toggle('slow', function() {});});</script>
そのものです。
</p>
<p>
<!--The final two equations we want to prove are
<span id="margin_3687779347_reveal" class="equation_link">(BP3)</span><span id="margin_3687779347" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =
  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_3687779347_reveal').click(function() {$('#margin_3687779347').toggle('slow', function() {});});</script>
and
<span id="margin_283549566641_reveal" class="equation_link">(BP4)</span><span id="margin_283549566641" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_283549566641_reveal').click(function() {$('#margin_283549566641').toggle('slow', function() {});});</script>
.
These also follow from the chain rule, in a manner similar to the proofs of the two equations above.  I leave them to you as an exercise. -->
証明したいあと2つの式は
<span id="margin_3687779347_reveal" class="equation_link">(BP3)</span><span id="margin_3687779347" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =
  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_3687779347_reveal').click(function() {$('#margin_3687779347').toggle('slow', function() {});});</script>
と
<span id="margin_283549566641_reveal" class="equation_link">(BP4)</span><span id="margin_283549566641" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_283549566641_reveal').click(function() {$('#margin_283549566641').toggle('slow', function() {});});</script>
です。
これらの式もこれまでの2つの式と似た方法で連鎖律から導けます。
証明は読者にお任せします。
</p>
<p><h4><a name="exercise_117377"></a><a href="#exercise_117377">
<!--Exercise-->
演習
</a></h4><ul>
<li>
<!--Prove Equations
<span id="margin_230723398849_reveal" class="equation_link">(BP3)</span><span id="margin_230723398849" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =
  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_230723398849_reveal').click(function() {$('#margin_230723398849').toggle('slow', function() {});});</script>
and
<span id="margin_300679567613_reveal" class="equation_link">(BP4)</span><span id="margin_300679567613" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_300679567613_reveal').click(function() {$('#margin_300679567613').toggle('slow', function() {});});</script>
.-->
<span id="margin_230723398849_reveal" class="equation_link">(BP3)</span><span id="margin_230723398849" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =
  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_230723398849_reveal').click(function() {$('#margin_230723398849').toggle('slow', function() {});});</script>
と
<span id="margin_300679567613_reveal" class="equation_link">(BP4)</span><span id="margin_300679567613" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_300679567613_reveal').click(function() {$('#margin_300679567613').toggle('slow', function() {});});</script>
を証明してください。
</ul></p>
<p>
<!--That completes the proof of the four fundamental equations of
backpropagation.  The proof may seem complicated.  But it's really
just the outcome of carefully applying the chain rule.  A little less
succinctly, we can think of backpropagation as a way of computing the
gradient of the cost function by systematically applying the chain
rule from multi-variable calculus.
That's all there really is to backpropagation - the rest is details.-->
以上で逆伝播の4つの基本的な式の証明が完了しました。
証明は一見複雑かもしれません。
しかし、これらは連鎖律を慎重に適用した結果にしか過ぎません。
もう少し詳しく言えば、逆伝播は多変数関数の微分で利用される連鎖律をシステマチックに適用する事で、コスト関数の勾配を計算する方法と見る事ができます。
それが逆伝播の正体であり、残りは些細な部分です。
</p>
<p><h3><a name="the_backpropagation_algorithm"></a><a href="#the_backpropagation_algorithm">
<!--The backpropagation algorithm-->
逆伝播アルゴリズム
</a></h3></p>
<p>
<!--The backpropagation equations provide us with a way of computing the
gradient of the cost function.  Let's explicitly write this out in the
form of an algorithm:
<ol>
<li> <strong>Input $x$:</strong> Set the corresponding activation $a^{1}$ for
  the input layer.  </p>
<p><li> <strong>Feedforward:</strong> For each $l = 2, 3, \ldots, L$ compute
  $z^{l} = w^l a^{l-1}+b^l$ and $a^{l} = \sigma(z^{l})$.</p>
<p><li> <strong>Output error $\delta^L$:</strong> Compute the vector $\delta^{L}
  = \nabla_a C \odot \sigma'(z^L)$.</p>
<p><li> <strong>Backpropagate the error:</strong> For each $l = L-1, L-2,
  \ldots, 2$ compute $\delta^{l} = ((w^{l+1})^T \delta^{l+1}) \odot
  \sigma'(z^{l})$.</p>
<p><li> <strong>Output:</strong> The gradient of the cost function is given by
  $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$ and
  $\frac{\partial C}{\partial b^l_j} = \delta^l_j$.
</ol>-->
逆伝播の式により、コスト関数の勾配の計算が可能になりました。
その方法を具体的にアルゴリズムの形で書き下してみましょう。
<ol>
<li> <strong>入力 $x$：</strong>
入力層に対応する活性$a^{1}$をセットする</p>
<p><li> <strong>フィードフォワード：</strong>
各$l = 2, 3, \ldots, L$に対し、$z^{l} = w^l a^{l-1}+b^l$ and $a^{l} = \sigma(z^{l})$を計算する</p>
<p><li> <strong>誤差$\delta^L$を出力：</strong>
誤差ベクトル$\delta^{L} = \nabla_a C \odot \sigma'(z^L)$を計算する</p>
<p><li> <strong>誤差を逆伝播：</strong>
各$l = L-1, L-2, \ldots, 2$に対し、$\delta^{l} = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^{l})$を計算する</p>
<p><li> <strong>出力：</strong>
コスト関数の勾配は
$\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$と
$\frac{\partial C}{\partial b^l_j} = \delta^l_j$
で得られる
</ol>
</p>
<p>
<!--Examining the algorithm you can see why it's called
<em>back</em>propagation.  We compute the error vectors $\delta^l$
backward, starting from the final layer.  It may seem peculiar that
we're going through the network backward.  But if you think about the
proof of backpropagation, the backward movement is a consequence of
the fact that the cost is a function of outputs from the network.  To
understand how the cost varies with earlier weights and biases we need
to repeatedly apply the chain rule, working backward through the
layers to obtain useable expressions.-->
アルゴリズムを見ると、これがなぜ<em>逆</em>伝播と呼ばれるかがわかるでしょう。
最終層から始まり、逆向きに誤差ベクトル$\delta^l$を計算しています。
ネットワークを逆向きに辿るのが奇妙に思われるかもしれません。
しかし、逆伝播の証明を思い返すと、コストはニューラルネットワークの出力についての関数であるという事から、逆伝播の方向が決まっている事がわかります。
前段の重みやバイアスによりコストがどのように変化するかを見るためには、連鎖律を繰り返し適用しなければならず、欲しい計算式を得るにはネットワークを逆方向に辿る必要があります。
</p>
<p><h4><a name="exercises_675621"></a><a href="#exercises_675621">
<!--Exercises-->
演習
</a></h4><ul>
<li><strong>
<!--Backpropagation with a single modified neuron-->
ニューロンの1つを差し替えた時の逆伝播：
</strong>
<!--Suppose we modify
  a single neuron in a feedforward network so that the output from the
  neuron is given by $f(\sum_j w_j x_j + b)$, where $f$ is some
  function other than the sigmoid.  How should we modify the
  backpropagation algorithm in this case?-->
フィードフォワードニューラルネットワークの特定の1つのニューロンを、$f(\sum_j w_j x_j + b)$を出力するものに変更したとします。ここで、$f$はシグモイド以外の適当な関数です。
この場合、逆伝播アルゴリズムはどのように変更すればよいでしょうか。
</p>
<p><li><strong>
<!--Backpropagation with linear neurons-->
線形ニューロンでの逆伝播：
</strong>
<!--Suppose we replace the
  usual non-linear $\sigma$ function with $\sigma(z) = z$ throughout
  the network.  Rewrite the backpropagation algorithm for this case.-->
ニューラルネットワーク内の非線形の$\sigma$関数を$\sigma(z) = z$に変更したとします。
逆伝播アルゴリズムをこの変更にあうように書き直してください。
</ul></p>
<p>
<!--As I've described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, $C =
C_x$.  In practice, it's common to combine backpropagation with a
learning algorithm such as stochastic gradient descent, in which we
compute the gradient for many training examples.  In particular, given
a mini-batch of $m$ training examples, the following algorithm applies
a gradient descent learning step based on that mini-batch:-->
前述のように、逆伝播アルゴリズムは単一の訓練例に対するコスト関数$C = C_x$の勾配を計算します。
しかし実際の実装では、逆伝播を、確率的勾配降下法などの多数の訓練例に対する勾配を計算する学習アルゴリズムと組み合わせるのが一般的です。
以下のアルゴリズムでは、$m$個の訓練例からなるミニバッチに対して勾配降下法を適用して学習を行っています。
<!--<ol>
<li> <strong>Input a set of training examples</strong></p>
<p><li> <strong>For each training example $x$:</strong> Set the corresponding
  input activation $a^{x,1}$, and perform the following steps:</p>
<p><ul>
<li> <strong>Feedforward:</strong> For each $l = 2, 3, \ldots, L$ compute
  $z^{x,l} = w^l a^{x,l-1}+b^l$ and $a^{x,l} = \sigma(z^{x,l})$.</p>
<p><li> <strong>Output error $\delta^{x,L}$:</strong> Compute the vector
  $\delta^{x,L} = \nabla_a C_x \odot \sigma'(z^{x,L})$.</p>
<p><li> <strong>Backpropagate the error:</strong> For each $l = L-1, L-2,
  \ldots, 2$ compute $\delta^{x,l} = ((w^{l+1})^T \delta^{x,l+1})
  \odot \sigma'(z^{x,l})$.
</ul>
</ul></p>
<p><li> <strong>Gradient descent:</strong> For each $l = L, L-1, \ldots, 2$
  update the weights according to the rule $w^l \rightarrow
  w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T$, and the
  biases according to the rule $b^l \rightarrow b^l-\frac{\eta}{m}
  \sum_x \delta^{x,l}$.-->
<ol>
<li> <strong>訓練例のセットを入力</strong></p>
<p><li> <strong>各訓練例$x$に対して：</strong>
対応する活性$a^{x, 1}$をセットし、以下のステップを行う：</p>
<p><ul>
<li> <strong>フィードフォワード：</strong>
$l = 2, 3, \ldots, L$に対し、$z^{x,l} = w^l a^{x,l-1}+b^l$と$a^{x,l} = \sigma(z^{x,l})$を計算する</p>
<p><li> <strong>誤差$\delta^{x,L}$を出力：</strong>
ベクトル$\delta^{x,L} = \nabla_a C_x \odot \sigma'(z^{x,L})$を計算する</p>
<p><li> <strong>誤差を逆伝播する：</strong>
$l = L-1, L-2, \ldots, 2$に対し、$\delta^{x,l} = ((w^{l+1})^T \delta^{x,l+1}) \odot \sigma'(z^{x,l})$を計算する
</ul>
</p>
<p><li> <strong>勾配降下：</strong>
$l = L, L-1, \ldots, 2$に対し、重みを$w^l \rightarrow w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T$で更新し、バイアスを$b^l \rightarrow b^l-\frac{\eta}{m} \sum_x \delta^{x,l}$で更新する</p>
<p></ol></p>
<!--Of course, to implement stochastic gradient descent in practice you
also need an outer loop generating mini-batches of training examples,
and an outer loop stepping through multiple epochs of training.  I've
omitted those for simplicity. -->
もちろん、確率的勾配降下法を実装する際は、訓練例のミニバッチを作成する外部ループと複数回のエポックで訓練を繰り返す為の外部ループが必要です。
簡単のためにそれらは記述しませんでした。
</p>
<p></p>
<p>
<h3><a name="the_code_for_backpropagation"></a><a href="#the_code_for_backpropagation">
<!--The code for backpropagation-->
逆伝播の実装
</a></h3></p>
<p>
<!--Having understood backpropagation in the abstract, we can now
understand the code used in the last chapter to implement
backpropagation.  Recall from
<a href="chap1.html#implementing_our_network_to_classify_digits">that chapter</a>
that the code was contained in the <tt>update_mini_batch</tt>
and <tt>backprop</tt> methods of the <tt>Network</tt> class.  The code for
these methods is a direct translation of the algorithm described
above.  In particular, the <tt>update_mini_batch</tt> method updates the
<tt>Network</tt>'s weights and biases by computing the gradient for the
current <tt>mini_batch</tt> of training examples:-->
逆伝播の理論を理解した事で、逆伝播の実装に利用した前章のコードを理解できる段階に達しました。
<a href="chap1.html#implementing_our_network_to_classify_digits">この章</a>を思い出すと、逆伝播の実装は<tt>Network</tt>クラスの<tt>update_minibatch</tt>メソッドと<tt>backprop</tt>メソッドに含まれていました。
これらのメソッドは前述のアルゴリズムをそのままコードに翻訳したものです。
<tt>update_mini_batch</tt>メソッドは現在の訓練例の<tt>mini_batch</tt>について勾配を計算し、<tt>Network</tt>クラスの重みとバイアスを更新しています。
<!--<div class="highlight"><pre><span class="k">class</span> <span class="nc">Network</span><span class="p">():</span>
<span class="o">...</span>
    <span class="k">def</span> <span class="nf">update_mini_batch</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">mini_batch</span><span class="p">,</span> <span class="n">eta</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Update the network&#39;s weights and biases by applying</span>
<span class="sd">        gradient descent using backpropagation to a single mini batch.</span>
<span class="sd">        The &quot;mini_batch&quot; is a list of tuples &quot;(x, y)&quot;, and &quot;eta&quot;</span>
<span class="sd">        is the learning rate.&quot;&quot;&quot;</span>
        <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">b</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">]</span>
        <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">w</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">]</span>
        <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">mini_batch</span><span class="p">:</span>
            <span class="n">delta_nabla_b</span><span class="p">,</span> <span class="n">delta_nabla_w</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">backprop</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
            <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">nb</span><span class="o">+</span><span class="n">dnb</span> <span class="k">for</span> <span class="n">nb</span><span class="p">,</span> <span class="n">dnb</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">nabla_b</span><span class="p">,</span> <span class="n">delta_nabla_b</span><span class="p">)]</span>
            <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">nw</span><span class="o">+</span><span class="n">dnw</span> <span class="k">for</span> <span class="n">nw</span><span class="p">,</span> <span class="n">dnw</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">nabla_w</span><span class="p">,</span> <span class="n">delta_nabla_w</span><span class="p">)]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span><span class="o">-</span><span class="p">(</span><span class="n">eta</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">mini_batch</span><span class="p">))</span><span class="o">*</span><span class="n">nw</span>
                        <span class="k">for</span> <span class="n">w</span><span class="p">,</span> <span class="n">nw</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">,</span> <span class="n">nabla_w</span><span class="p">)]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="p">[</span><span class="n">b</span><span class="o">-</span><span class="p">(</span><span class="n">eta</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">mini_batch</span><span class="p">))</span><span class="o">*</span><span class="n">nb</span>
                       <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">nb</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="n">nabla_b</span><span class="p">)]</span>
</pre></div>
-->
<div class="highlight"><pre><span class="k">class</span> <span class="nc">Network</span><span class="p">():</span>
<span class="o">...</span>
    <span class="k">def</span> <span class="nf">update_mini_batch</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">mini_batch</span><span class="p">,</span> <span class="n">eta</span><span class="p">):</span>
<span class="sd">        &quot;&quot;&quot;ミニバッチ1つ分に逆伝播を用いた勾配降下法を適用し、</span>
<span class="sd">        ニューラルネットワークの重みとバイアスを更新する。</span>
<span class="sd">        &quot;mini_batch&quot;はタプル&quot;(x, y)&quot;のリストで&quot;、</span>
<span class="sd">        eta&quot;は学習率。&quot;&quot;&quot;</span>
        <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">b</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">]</span>
        <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">w</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">]</span>
        <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">mini_batch</span><span class="p">:</span>
            <span class="n">delta_nabla_b</span><span class="p">,</span> <span class="n">delta_nabla_w</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">backprop</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
            <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">nb</span><span class="o">+</span><span class="n">dnb</span> <span class="k">for</span> <span class="n">nb</span><span class="p">,</span> <span class="n">dnb</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">nabla_b</span><span class="p">,</span> <span class="n">delta_nabla_b</span><span class="p">)]</span>
            <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">nw</span><span class="o">+</span><span class="n">dnw</span> <span class="k">for</span> <span class="n">nw</span><span class="p">,</span> <span class="n">dnw</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">nabla_w</span><span class="p">,</span> <span class="n">delta_nabla_w</span><span class="p">)]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span><span class="o">-</span><span class="p">(</span><span class="n">eta</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">mini_batch</span><span class="p">))</span><span class="o">*</span><span class="n">nw</span>
                        <span class="k">for</span> <span class="n">w</span><span class="p">,</span> <span class="n">nw</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">,</span> <span class="n">nabla_w</span><span class="p">)]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="p">[</span><span class="n">b</span><span class="o">-</span><span class="p">(</span><span class="n">eta</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">mini_batch</span><span class="p">))</span><span class="o">*</span><span class="n">nb</span>
                       <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">nb</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="n">nabla_b</span><span class="p">)]</span>
</pre></div>
<!--Most of the work is done by the line
<tt>delta_nabla_b, delta_nabla_w = self.backprop(x, y)</tt> which uses
the <tt>backprop</tt> method to figure out the partial derivatives
$\partial C_x / \partial b^l_j$ and $\partial C_x / \partial w^l_{jk}$.-->
ほとんどの作業は<tt>delta_nabla_b, delta_nabla_w = self.backprop(x, y)</tt>の行で行われています。この行では、<tt>backprop</tt>メソッドを利用して偏微分$\partial C_x / \partial b^l_j$と$\partial C_x / \partial w^l_{jk}$を計算しています。
<!--  The <tt>backprop</tt> method follows the algorithm in the
last section closely.  There is one small change - we use a slightly
different approach to indexing the layers.  This change is made to
take advantage of a feature of Python, namely the use of negative list
indices to count backward from the end of a list, so, e.g.,
<tt>l[-3]</tt> is the third last entry in a list <tt>l</tt>.-->
<tt>backprop</tt>メソッドは前節のアルゴリズムに従って実装されています。層への添字の振り方を、前節の説明から若干の変更しています。
この変更はPythonの特徴、具体的には負数の添字を用いてリストを後ろから数える方法を活用するために行っています（例えば、<tt>l[-3]</tt>はリスト<tt>l</tt>の後ろから3番目の要素です）。
<!--The code for <tt>backprop</tt> is below, together with a few helper functions, which are used to compute the $\sigma$ function and its vectorized form, the derivative $\sigma'$ and its vectorized form, and the derivative of the cost function.  With these inclusions you should be able to
understand the code in a self-contained way.  If something's tripping
you up, you may find it helpful to consult
<a href="chap1.html#implementing_our_network_to_classify_digits">the original description (and complete listing) of the code</a>.-->
次に<tt>backprop</tt>メソッドのコードを示します。$\sigma$関数とそのベクトル化、$\sigma$関数の導関数とそのベクトル化、及びコスト関数の微分を計算するためのヘルパー関数も併せて載せています。これらのヘルパー関数を併せて見れば、自己完結した形でコードを理解できるはずです。
もしどこかでつまづいたら
<a href="chap1.html#implementing_our_network_to_classify_digits">元のコードの説明（と全コード）</a>を参照するのが良いでしょう。
<!--<div class="highlight"><pre><span class="k">class</span> <span class="nc">Network</span><span class="p">():</span>
<span class="o">...</span>
   <span class="k">def</span> <span class="nf">backprop</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return a tuple &quot;(nabla_b, nabla_w)&quot; representing the</span>
<span class="sd">        gradient for the cost function C_x.  &quot;nabla_b&quot; and</span>
<span class="sd">        &quot;nabla_w&quot; are layer-by-layer lists of numpy arrays, similar</span>
<span class="sd">        to &quot;self.biases&quot; and &quot;self.weights&quot;.&quot;&quot;&quot;</span>
        <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">b</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">]</span>
        <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">w</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">]</span>
        <span class="c"># feedforward</span>
        <span class="n">activation</span> <span class="o">=</span> <span class="n">x</span>
        <span class="n">activations</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="c"># list to store all the activations, layer by layer</span>
        <span class="n">zs</span> <span class="o">=</span> <span class="p">[]</span> <span class="c"># list to store all the z vectors, layer by layer</span>
        <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">w</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">):</span>
            <span class="n">z</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">activation</span><span class="p">)</span><span class="o">+</span><span class="n">b</span>
            <span class="n">zs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">activation</span> <span class="o">=</span> <span class="n">sigmoid_vec</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">activations</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">activation</span><span class="p">)</span>
        <span class="c"># backward pass</span>
        <span class="n">delta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost_derivative</span><span class="p">(</span><span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">y</span><span class="p">)</span> <span class="o">*</span> \
            <span class="n">sigmoid_prime_vec</span><span class="p">(</span><span class="n">zs</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
        <span class="n">nabla_b</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">delta</span>
        <span class="n">nabla_w</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
        <span class="c"># Note that the variable l in the loop below is used a little</span>
        <span class="c"># differently to the notation in Chapter 2 of the book.  Here,</span>
        <span class="c"># l = 1 means the last layer of neurons, l = 2 is the</span>
        <span class="c"># second-last layer, and so on.  It&#39;s a renumbering of the</span>
        <span class="c"># scheme in the book, used here to take advantage of the fact</span>
        <span class="c"># that Python can use negative indices in lists.</span>
        <span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">num_layers</span><span class="p">):</span>
            <span class="n">z</span> <span class="o">=</span> <span class="n">zs</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span>
            <span class="n">spv</span> <span class="o">=</span> <span class="n">sigmoid_prime_vec</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">delta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">(),</span> <span class="n">delta</span><span class="p">)</span> <span class="o">*</span> <span class="n">spv</span>
            <span class="n">nabla_b</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span> <span class="o">=</span> <span class="n">delta</span>
            <span class="n">nabla_w</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
        <span class="k">return</span> <span class="p">(</span><span class="n">nabla_b</span><span class="p">,</span> <span class="n">nabla_w</span><span class="p">)</span>

<span class="o">...</span>

    <span class="k">def</span> <span class="nf">cost_derivative</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">output_activations</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return the vector of partial derivatives \partial C_x /</span>
<span class="sd">        \partial a for the output activations.&quot;&quot;&quot;</span>
        <span class="k">return</span> <span class="p">(</span><span class="n">output_activations</span><span class="o">-</span><span class="n">y</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;The sigmoid function.&quot;&quot;&quot;</span>
    <span class="k">return</span> <span class="mf">1.0</span><span class="o">/</span><span class="p">(</span><span class="mf">1.0</span><span class="o">+</span><span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">z</span><span class="p">))</span>

<span class="n">sigmoid_vec</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">sigmoid</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">sigmoid_prime</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;Derivative of the sigmoid function.&quot;&quot;&quot;</span>
    <span class="k">return</span> <span class="n">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>

<span class="n">sigmoid_prime_vec</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">sigmoid_prime</span><span class="p">)</span>
</pre></div>-->
<div class="highlight"><pre><span class="k">class</span> <span class="nc">Network</span><span class="p">():</span>
<span class="o">...</span>
   <span class="k">def</span> <span class="nf">backprop</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="sd">        &quot;&quot;&quot;コスト関数の勾配を表すタプル&quot;(nabla_b, nabla_w)&quot;を返却する。</span>
<span class="sd">        &quot;self.biases&quot; and &quot;self.weights&quot;と同様に、</span>
<span class="sd">        &quot;nabla_b&quot;と&quot;nabla_w&quot;はnumpyのアレイのリストで</span>
<span class="sd">        各要素は各層に対応する。&quot;&quot;&quot;</span>
        <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">b</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">]</span>
        <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">w</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">]</span>
        <span class="c"># 順伝播</span>
        <span class="n">activation</span> <span class="o">=</span> <span class="n">x</span>
        <span class="n">activations</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="c"># 層ごとに活性を格納するリスト</span>
        <span class="n">zs</span> <span class="o">=</span> <span class="p">[]</span> <span class="c"># 層ごとにzベクトルを格納するリスト</span>
        <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">w</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">):</span>
            <span class="n">z</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">activation</span><span class="p">)</span><span class="o">+</span><span class="n">b</span>
            <span class="n">zs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">activation</span> <span class="o">=</span> <span class="n">sigmoid_vec</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">activations</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">activation</span><span class="p">)</span>
        <span class="c"># 逆伝播</span>
        <span class="n">delta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost_derivative</span><span class="p">(</span><span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">y</span><span class="p">)</span> <span class="o">*</span> \
            <span class="n">sigmoid_prime_vec</span><span class="p">(</span><span class="n">zs</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
        <span class="n">nabla_b</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">delta</span>
        <span class="n">nabla_w</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
        <span class="c"># 下記のループ変数lは第2章での記法と使用方法が若干異なる。</span>
        <span class="c"># l = 1は最終層を、l = 2は最後から2番目の層を意味する（以下同様）。</span>
        <span class="c"># 本書内での方法から番号付けのルールを変更したのは、</span>
        <span class="c"># Pythonのリストでの負の添字を有効活用するためである。</span>
        <span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">num_layers</span><span class="p">):</span>
            <span class="n">z</span> <span class="o">=</span> <span class="n">zs</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span>
            <span class="n">spv</span> <span class="o">=</span> <span class="n">sigmoid_prime_vec</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">delta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">(),</span> <span class="n">delta</span><span class="p">)</span> <span class="o">*</span> <span class="n">spv</span>
            <span class="n">nabla_b</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span> <span class="o">=</span> <span class="n">delta</span>
            <span class="n">nabla_w</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
        <span class="k">return</span> <span class="p">(</span><span class="n">nabla_b</span><span class="p">,</span> <span class="n">nabla_w</span><span class="p">)</span>
<span class="o">...</span>
    <span class="k">def</span> <span class="nf">cost_derivative</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">output_activations</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;出力活性に対する偏微分\partial C_x / \partial a </span>
<span class="sd">        のベクトルを返却する。&quot;&quot;&quot;</span>
        <span class="k">return</span> <span class="p">(</span><span class="n">output_activations</span><span class="o">-</span><span class="n">y</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;シグモイド関数&quot;&quot;&quot;</span>
    <span class="k">return</span> <span class="mf">1.0</span><span class="o">/</span><span class="p">(</span><span class="mf">1.0</span><span class="o">+</span><span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">z</span><span class="p">))</span>

<span class="n">sigmoid_vec</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">sigmoid</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">sigmoid_prime</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;シグモイド関数の導関数&quot;&quot;&quot;</span>
    <span class="k">return</span> <span class="n">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>

<span class="n">sigmoid_prime_vec</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">sigmoid_prime</span><span class="p">)</span>
</pre></div>
</p>
<p><a id="backprop_over_minibatch"></a>
<h4><a name="problem_269962"></a><a href="#problem_269962">
<!--Problem-->
問題
</a></h4><ul>
<li><strong>
<!--Fully matrix-based approach to backpropagation over a mini-batch-->
ミニバッチによる逆伝播の行列を用いた導出：
</strong>
<!--Our implementation of stochastic gradient descent loops
  over training examples in a mini-batch.  It's possible to modify the
  backpropagation algorithm so that it computes the gradients for all
  training examples in a mini-batch simultaneously.  -->
我々の確率的勾配降下法の実装ではミニバッチ内の訓練例についてループしています。しかし、逆伝播のアルゴリズムを書き換えると、ミニバッチ内の全訓練例の勾配を同時に計算するように変更できます。
<!--The idea is that
  instead of beginning with a single input vector, $x$, we can begin
  with a matrix $X = [x_1 x_2 \ldots x_m]$ whose columns are the
  vectors in the mini-batch.  We forward-propagate by multiplying by
  the weight matrices, adding a suitable matrix for the bias terms,
  and applying the sigmoid function everywhere. We backpropagate along
  similar lines.-->
単一の入力ベクトル$x$から始めるのではなく、各列がミニバッチ内のベクトルからなる行列$X = [x_1 x_2 \ldots x_m]$を用いるのが基本的なアイデアです。
この行列に重み行列を掛け、バイアス項に対応する適当な行列を足し、各要素にシグモイド関数を適用する事で順伝播をします。逆伝播も似た方法で行います
<!--  Explicitly write out pseudocode for this approach to
  the backpropagation algorithm.  Modify <tt>network.py</tt> so that it
  uses this fully matrix-based approach.  The advantage of this
  approach is that it takes full advantage of modern libraries for
  linear algebra.  As a result it can be quite a bit faster than
  looping over the mini-batch.  (On my laptop, for example, the
  speedup is about a factor of two when run on MNIST classification
  problems like those we considered in the last chapter.)  In
  practice, all serious libraries for backpropagation use this fully
  matrix-based approach or some variant.-->
このアプローチによる逆伝播アルゴリズムの擬似コードを具体的に書き下してください。
また、<tt>network.py</tt>を行列を用いたアプローチに変更してください。
このアプローチの利点は、最近の線形代数ライブラリをフルに有効活用でき、その結果ミニバッチ内をループする場合に比べて圧倒的に高速になる点です
（例えば私のノートパソコンでは、前章で考えたMNISTの分類問題で約2倍の高速化の効果が得られました）。
実際きちんと作られた逆伝播のライブラリでは、この行列のアプローチかその変種を用いています。
</ul></p>
<p><h3><a name="in_what_sense_is_backpropagation_a_fast_algorithm"></a>
<a href="#in_what_sense_is_backpropagation_a_fast_algorithm">
<!--In what sense is backpropagation a fast algorithm?-->
逆伝播が速いアルゴリズムであるとはどういう意味か？
</a></h3></p>
<p>
<!--In what sense is backpropagation a fast algorithm?  To answer this
question, let's consider another approach to computing the gradient.
Imagine it's the early days of neural networks research.  Maybe it's
the 1950s or 1960s, and you're the first person in the world to think
of using gradient descent to learn!  But to make the idea work you
need a way of computing the gradient of the cost function.  You think
back to your knowledge of calculus, and decide to see if you can use
the chain rule to compute the gradient.  But after playing around a
bit, the algebra looks complicated, and you get discouraged.  So you
try to find another approach. -->
どういう意味で逆伝播は速いアルゴリズムか。これに答える為に、勾配を計算する別のアプローチを考えてみましょう。
初期の時代のニューラルネットワーク研究を想像してみてください。
おそらく1950年代か60年代だと思いますが、あなたは学習への勾配降下法の適用を考えている世界で最初の研究者です！
あなたの考えがうまくいくかを確かめるには、コスト関数の勾配を計算する方法が必要です。
微積分学の知識を思い出して、勾配の計算に連鎖律が使うかを検討しています。
しかし、少しごにょごにょと計算してみると、式は複雑そうなのでがっかりしてしまいます。
<!-- You decide to regard the cost as a function of the weights $C = C(w)$ alone (we'll get back to the biases in a moment).  You number the weights $w_1, w_2, \ldots$, and want to compute $\partial C / \partial w_j$ for some particular weight $w_j$.
An obvious way of doing that is to use the approximation
<a class="displaced_anchor" name="eqtn46"></a>\begin{eqnarray}  \frac{\partial
    C}{\partial w_{j}} \approx \frac{C(w+\epsilon
    e_j)-C(w)}{\epsilon},
\tag{46}\end{eqnarray}
where $\epsilon > 0$ is a small positive number, and $e_j$ is the unit
vector in the $j^{\rm th}$ direction.  -->
そこで、別のアプローチを探します。コスト関数を重みのみの関数とみなし、$C = C(w)$と考えることにしました（バイアスについてはすぐ後で考えます）。
重みを$w_1, w_2, \ldots$と番号付けし、特定の重み$w_j$について$\partial C / \partial w_j$を計算します。
すぐに思いつくのは近似
<a class="displaced_anchor" name="eqtn46"></a>\begin{eqnarray}  \frac{\partial C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon}
\tag{46}\end{eqnarray}
を利用する方法です。
ここで、$\epsilon > 0$は微小な正の数で、$e_j$は$j$方向の単位ベクトルです。
<!--In other words, we can estimate
$\partial C / \partial w_j$ by computing the cost $C$ for two slightly
different values of $w_j$, and then applying Equation
<span id="margin_881761291800_reveal" class="equation_link">(46)</span><span id="margin_881761291800" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial
    C}{\partial w_{j}} \approx \frac{C(w+\epsilon
    e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_881761291800_reveal').click(function() {$('#margin_881761291800').toggle('slow', function() {});});</script>.
The same idea will let us
compute the partial derivatives $\partial C / \partial b$ with respect
to the biases.-->
言い換えれば、$\partial C / \partial w_j$を計算する為に2つの若干異なる$w_j$でコスト$C$の値を計算し、式
<span id="margin_881761291800_reveal" class="equation_link">(46)</span><span id="margin_881761291800" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_881761291800_reveal').click(function() {$('#margin_881761291800').toggle('slow', function() {});});</script>
を適用します。
同じアイデアでバイアスについての偏微分$\partial C / \partial b$にも計算できます。
</p>
<p>
<!--This approach looks very promising.  It's simple conceptually, and
extremely easy to implement, using just a few lines of code.
Certainly, it looks much more promising than the idea of using the
chain rule to compute the gradient!-->
このアプローチはよさそうに見えます。
発想がシンプルな上、実装も数行のコードで実現できとても簡単です。
連鎖律を用いて勾配を計算するアイデアよりもよっぽど有望なように思えます！
</p>
<p>
<!--Unfortunately, while this approach appears promising, when you
implement the code it turns out to be extremely slow.  To understand
why, imagine we have a million weights in our network.  Then for each
distinct weight $w_j$ we need to compute $C(w+\epsilon e_j)$ in order
to compute $\partial C / \partial w_j$.  That means that to compute
the gradient we need to compute the cost function a million different
times, requiring a million forward passes through the network (per
training example).  We need to compute $C(w)$ as well, so that's a
total of a million and one passes through the network.-->
このアプローチは有望そうですが、残念ながらこのコードを実装してみるととてつもなく遅い事がわかります。
なぜかを理解する為に、ニューラルネットワーク内に100万個の重みがあると想像してみてください。
すると、各重み$w_j$に対して$\partial C / \partial w_j$を計算するには、$C(w+\epsilon e_j)$の計算が必要です。
これには、勾配計算時に異なる値でのコスト関数計算が100万回必要で、各訓練例ごとに100万回の順伝播が必要な事を意味します。
$C(w)$の計算も必要なので、結局ニューラルネットワーク内伝播回数は100万1回です。
</p>
<p>
<!--
What's clever about backpropagation is that it enables us to
simultaneously compute <em>all</em> the partial derivatives $\partial C
/ \partial w_j$ using just one forward pass through the network,
followed by one backward pass through the network.  Roughly speaking,
the computational cost of the backward pass is about the same as the
forward pass-->
逆伝播の賢い所は、たった1回の順伝播とそれに続く1回の逆伝播で<em>すべての</em>偏微分$\partial C / \partial w_j$を同時に計算できる点です。
逆伝播の計算コストは大雑把には順伝播と同程度です
<!--
2015/1/3 Kenta OONO
backpropagationとbackward passの訳し分けに悩む。backpropagationは逆伝播ではなく、誤差逆伝播もしくは誤差逆伝播アルゴリズムと言ったほうが適切かもしれない
-->
<!--
*<span class="marginnote">
*This should be plausible, but it requires some
  analysis to make a careful statement.  It's plausible because the
  dominant computational cost in the forward pass is multiplying by
  the weight matrices, while in the backward pass it's multiplying by
  the transposes of the weight matrices.  These operations obviously
  have similar computational cost.</span>. -->
*<span class="marginnote">この見積りは妥当ですが、きちんと示すには若干の分析が必要です。フォワードパスの計算コストで支配的なのは重み行列の掛け算であるのに対し、バックワードパスで支配的なのは重み行列の転置の掛け算です。これらの操作は明らかに同程度の計算コストです。</span>。
<!--And so the total cost of backpropagation is roughly the same as making just two forward passes through the network.  Compare that to the million and one forward passes we needed for the approach based on
<span id="margin_570144158257_reveal" class="equation_link">(46)</span><span id="margin_570144158257" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial
    C}{\partial w_{j}} \approx \frac{C(w+\epsilon
    e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_570144158257_reveal').click(function() {$('#margin_570144158257').toggle('slow', function() {});});</script>
!-->
従って、逆伝播の合計のコストはニューラルネットワーク全体への順伝播約2回分です。
<span id="margin_570144158257_reveal" class="equation_link">(46)</span><span id="margin_570144158257" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial
    C}{\partial w_{j}} \approx \frac{C(w+\epsilon
    e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_570144158257_reveal').click(function() {$('#margin_570144158257').toggle('slow', function() {});});</script>に基づくアプローチで必要だった100万1回の順伝播と比較してみてください！
<!--And so even though backpropagation appears superficially more complex than the approach based on
<span id="margin_453558672576_reveal" class="equation_link">(46)</span><span id="margin_453558672576" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial
    C}{\partial w_{j}} \approx \frac{C(w+\epsilon
    e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_453558672576_reveal').click(function() {$('#margin_453558672576').toggle('slow', function() {});});</script>
, it's actually much, much faster.-->
逆伝播は一見
<span id="margin_453558672576_reveal" class="equation_link">(46)</span><span id="margin_453558672576" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial
    C}{\partial w_{j}} \approx \frac{C(w+\epsilon
    e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_453558672576_reveal').click(function() {$('#margin_453558672576').toggle('slow', function() {});});</script>
に基づく方法よりも複雑ですが、実際にはずっと、ずっと高速なのです。
</p>
<p>
<!--This speedup was first fully appreciated in 1986, and it greatly
expanded the range of problems that neural networks could solve.
That, in turn, caused a rush of people using neural networks.  Of
course, backpropagation is not a panacea.  Even in the late 1980s
people ran up against limits, especially when attempting to use
backpropagation to train deep neural networks, i.e., networks with
many hidden layers.  Later in the book we'll see how modern computers
and some clever new ideas now make it possible to use backpropagation
to train such deep neural networks.-->
この高速化は1986年に始めて真価がわかり、ニューラルネットワークが解ける問題の幅を大きく広げ、その結果多くの人が次々にニューラルネットワークに押しかけました。
もちろん、逆伝播は万能薬ではありません。
特にディープニューラルネットワーク、すなわち多くの隠れ層を持つネットワークの学習への逆伝播の適用においては、1980年代後半には既に壁にぶつかっていました。現代のコンピュータや新しい賢いアイデアにより、逆伝播を用いてディープニューラルネットワークを訓練する事が可能になった事をこの本では後述します。
</p>
<p><h3><a name="backpropagation_the_big_picture"></a><a href="#backpropagation_the_big_picture">
<!--Backpropagation: the big picture-->
逆伝播：全体像
<!--
2015/1/5 Kenta OONO
the big pictureがどういうニュアンスなのかがきちんとわかっていない。
-->
</a></h3></p>
<p>
<!--As I've explained it, backpropagation presents two mysteries.  First,
what's the algorithm really doing?  We've developed a picture of the
error being backpropagated from the output.  But can we go any deeper,
and build up more intuition about what is going on when we do all
these matrix and vector multiplications?  The second mystery is how
someone could ever have discovered backpropagation in the first place?
It's one thing to follow the steps in an algorithm, or even to follow
the proof that the algorithm works.  But that doesn't mean you
understand the problem so well that you could have discovered the
algorithm in the first place.  Is there a plausible line of reasoning
that could have led you to discover the backpropagation algorithm?  In
this section I'll address both these mysteries.-->
以前に説明したように、逆伝播には2つの謎があります。
1つはアルゴリズムが本当にやっている事は何かです。
出力から誤差が逆伝播していく様子を見てきました。
もう一歩踏み込んで、ベクトルに行列を掛ける時に何が起こっているかについてのもっと直感的な理解を得られないでしょうか。
2つ目の謎は、そもそもどうやって逆伝播を発見するかという点です。
アルゴリズムの手順に従ったり、アルゴリズムの正しさを示すを証明を追う事はできます。しかし、その事と、問題を理解しアルゴリズムをまっさらな状態から発見するのはまた別の話です。
逆伝播アルゴリズムの発見につながる妥当な論理づけは何かないでしょうか。
本節ではこれらの謎に重点を置きます。
</p>
<p>
<!--To improve our intuition about what the algorithm is doing, let's
imagine that we've made a small change $\Delta w^l_{jk}$ to some
weight in the network, $w^l_{jk}$:
<center>
<img src="images/tikz22.png"/>
</center>
That change in weight will cause a change in the output activation
from the corresponding neuron:
<center>
<img src="images/tikz23.png"/>
</center>
That, in turn, will cause a change in <em>all</em> the activations in
the next layer:
<center>
<img src="images/tikz24.png"/>
</center>-->
アルゴリズムの挙動に対する直感を養う為に、ニューラルネットワーク内の適当な重み$w^l_{jk}$に微小な変化$\Delta w^l_{jk}$を施してみましょう：
<center>
<img src="images/tikz22.png"/>
</center>
重みの変化により、対応するニューロンの出力活性が変化します：
<center>
<img src="images/tikz23.png"/>
</center>
この変化は引き続いて、次の層の<em>すべての</em>出力活性に変化を引き起こします：
<center>
<img src="images/tikz24.png"/>
</center>
<!--Those changes will in turn cause changes in the next layer, and then
the next, and so on all the way through to causing a change in the
final layer, and then in the cost function:
<center>
<img src="images/tikz25.png"/>
</center>-->
これらの変化はさらに次の層の変化を引き起こします。これを繰り返して最終層、そしてコスト関数を変化させます：
<center>
<img src="images/tikz25.png"/>
</center>
<!--The change $\Delta C$ in the cost is related to the change $\Delta
w^l_{jk}$ in the weight by the equation
<a class="displaced_anchor" name="eqtn47"></a>\begin{eqnarray}
  \Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk}.
\tag{47}\end{eqnarray}-->
コスト関数の変化$\Delta C$は重みの変化$\Delta w^l_{jk}$と次式で関連付けられます
<a class="displaced_anchor" name="eqtn47"></a>\begin{eqnarray}
  \Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk}.
\tag{47}\end{eqnarray}
<!--This suggests that a possible approach to computing $\frac{\partial
  C}{\partial w^l_{jk}}$ is to carefully track how a small change in
$w^l_{jk}$ propagates to cause a small change in $C$.  If we can do
that, being careful to express everything along the way in terms of
easily computable quantities, then we should be able to compute
$\partial C / \partial w^l_{jk}$.-->
この式から、$\frac{\partial C}{\partial w^l_{jk}}$を計算するのに考えられるアプローチとして次の方法が示唆されます。すなわち、$w^l_{jk}$の微小な変化がニューラルネットワークを伝播し、その結果$C$の微小な変化を引き起こす様子を丁寧に追跡するという方法です。
もしそれができたら、伝播経路の途中にあるすべてのものを、簡単に計算できる変数で表現する事で、$\partial C / \partial w^l_{jk}$を計算できるはずです。
</p>
<p>
<!--Let's try to carry this out.  The change $\Delta w^l_{jk}$ causes a
small change $\Delta a^{l}_j$ in the activation of the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer.
This change is given by
<a class="displaced_anchor" name="eqtn48"></a>\begin{eqnarray}
  \Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}.
\tag{48}\end{eqnarray}-->

このアイデアを実行に移してみましょう。
重みが$\Delta w^l_{jk}$だけ変化する事で$l$番目の層の$j$番目のニューロンの活性に微小な変化$\Delta a^{l}_j$が発生します。
この変化は
<a class="displaced_anchor" name="eqtn48"></a>\begin{eqnarray}
  \Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}
で与えられます。
\tag{48}\end{eqnarray}

<!--The change in activation $\Delta a^l_{j}$ will cause changes in
<em>all</em> the activations in the next layer, i.e., the $(l+1)^{\rm
  th}$ layer.  We'll concentrate on the way just a single one of those
activations is affected, say $a^{l+1}_q$,
<center>
<img src="images/tikz26.png"/>
</center>
In fact, it'll cause the following change:
<a class="displaced_anchor" name="eqtn49"></a>\begin{eqnarray}
  \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \Delta a^l_j.
\tag{49}\end{eqnarray}-->
<!--
2015/1/5 Kenta OONO
章全体で、「出力活性」と「活性」で表現がぶれている
-->
活性の変化$\Delta a^l_{j}$は次の層、すなわち$l+1$番目の層の<em>すべての</em>活性に変化を引き起こします。
私達はこれらの活性の中の1つ、例えば$a^{l+1}_q$がどのような影響を受けるかのみに注目します。
<center>
<img src="images/tikz26.png"/>
</center>
$\Delta a^l_{j}$は次のような変化を引き起こします
<a class="displaced_anchor" name="eqtn49"></a>\begin{eqnarray}
  \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \Delta a^l_j.
\tag{49}\end{eqnarray}
<!--Substituting in the expression from Equation
<span id="margin_545387421716_reveal" class="equation_link">(48)</span><span id="margin_545387421716" class="marginequation" style="display: none;"><a href="chap2.html#eqtn48" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk} \nonumber\end{eqnarray}</a></span><script>$('#margin_545387421716_reveal').click(function() {$('#margin_545387421716').toggle('slow', function() {});});</script>
, we get:
<a class="displaced_anchor" name="eqtn50"></a>\begin{eqnarray}
  \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}.
\tag{50}\end{eqnarray}-->
式
<span id="margin_545387421716_reveal" class="equation_link">(48)</span><span id="margin_545387421716" class="marginequation" style="display: none;"><a href="chap2.html#eqtn48" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk} \nonumber\end{eqnarray}</a></span><script>$('#margin_545387421716_reveal').click(function() {$('#margin_545387421716').toggle('slow', function() {});});</script>
内の表式をこれで置き換えると、
<a class="displaced_anchor" name="eqtn50"></a>\begin{eqnarray}
  \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}
\tag{50}\end{eqnarray}
が得られます。
<!--Of course, the change $\Delta a^{l+1}_q$ will, in turn, cause changes
in the activations in the next layer.  In fact, we can imagine a path
all the way through the network from $w^l_{jk}$ to $C$, with each
change in activation causing a change in the next activation, and,
finally, a change in the cost at the output.-->
もちろん今度は$\Delta a^{l+1}_q$が、次の層の活性に変化を引き起こします。
実際には、$w^l_{jk}$から$C$までのパスのうちの1つを考えると、このパスでは活性のそれぞれの変化が次の活性の変化を引き起こし、最終的に出力でのコストの変化を引き起こしています。
<!--If the path goes through activations $a^l_j, a^{l+1}_q, \ldots, a^{L-1}_n, a^L_m$ then the resulting expression is
<a class="displaced_anchor" name="eqtn51"></a>\begin{eqnarray}
  \Delta C \approx \frac{\partial C}{\partial a^L_m}
  \frac{\partial a^L_m}{\partial a^{L-1}_n}
  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
  \frac{\partial a^{l+1}_q}{\partial a^l_j}
  \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk},
\tag{51}\end{eqnarray}
that is, we've picked up a $\partial a / \partial a$ type term for
each additional neuron we've passed through, as well as the $\partial
C/\partial a^L_m$ term at the end.-->
もしこのパスが$a^l_j, a^{l+1}_q, \ldots, a^{L-1}_n, a^L_m$を通るとしたら、得られる表式は
<a class="displaced_anchor" name="eqtn51"></a>\begin{eqnarray}
  \Delta C \approx \frac{\partial C}{\partial a^L_m}
  \frac{\partial a^L_m}{\partial a^{L-1}_n}
  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
  \frac{\partial a^{l+1}_q}{\partial a^l_j}
  \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}
\tag{51}\end{eqnarray}
となります。すなわち、ニューロンを通過するごとに$\partial a / \partial a$の形の項が追加され、最後に$\partial C/\partial a^L_m$の項が付け加わります。
<!--This represents the change in $C$ due to changes in the activations along this particular path through the network.  Of course, there's many paths by which a change in $w^l_{jk}$ can propagate to affect the cost, and we've been considering just a single path.-->
この値は$C$の変化のうち、特定のパス内にある活性の変化に由来するものです。
$w^l_{jk}$の変化を伝播しコストに影響を与えるパスは他にもたくさんあり、この式はその中の1つしか考慮していません。
<!--To compute the total change in $C$ it is plausible that we should sum over all the possible paths between the weight and the final cost, i.e.,
<a class="displaced_anchor" name="eqtn52"></a>\begin{eqnarray}
  \Delta C \approx \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
  \frac{\partial a^L_m}{\partial a^{L-1}_n}
  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
  \frac{\partial a^{l+1}_q}{\partial a^l_j}
  \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk},
\tag{52}\end{eqnarray}
where we've summed over all possible choices for the intermediate neurons along the path. -->
$C$の変化の合計を計算するには、最初の重みと最後のコストの間で取りうる全てのパスについて和を取れば良いです。すなわち、
<a class="displaced_anchor" name="eqtn52"></a>\begin{eqnarray}
  \Delta C \approx \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
  \frac{\partial a^L_m}{\partial a^{L-1}_n}
  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
  \frac{\partial a^{l+1}_q}{\partial a^l_j}
  \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk},
\tag{52}\end{eqnarray}
ここで、和はパスを通る中間ニューロンの選び方として考えられる全体について足し合わせます。
<!--Comparing with
<span id="margin_941820302952_reveal" class="equation_link">(47)</span><span id="margin_941820302952" class="marginequation" style="display: none;"><a href="chap2.html#eqtn47" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk} \nonumber\end{eqnarray}</a></span><script>$('#margin_941820302952_reveal').click(function() {$('#margin_941820302952').toggle('slow', function() {});});</script>
we see that
<a class="displaced_anchor" name="eqtn53"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
  \frac{\partial a^L_m}{\partial a^{L-1}_n}
  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
  \frac{\partial a^{l+1}_q}{\partial a^l_j}
  \frac{\partial a^l_j}{\partial w^l_{jk}}.
\tag{53}\end{eqnarray}-->
<span id="margin_941820302952_reveal" class="equation_link">(47)</span><span id="margin_941820302952" class="marginequation" style="display: none;"><a href="chap2.html#eqtn47" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk} \nonumber\end{eqnarray}</a></span><script>$('#margin_941820302952_reveal').click(function() {$('#margin_941820302952').toggle('slow', function() {});});</script>
と比較すると、
<a class="displaced_anchor" name="eqtn53"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
  \frac{\partial a^L_m}{\partial a^{L-1}_n}
  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
  \frac{\partial a^{l+1}_q}{\partial a^l_j}
  \frac{\partial a^l_j}{\partial w^l_{jk}}
\tag{53}\end{eqnarray}
とわかります。
<!--Now, Equation
<span id="margin_315845878357_reveal" class="equation_link">(53)</span><span id="margin_315845878357" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
  \frac{\partial a^L_m}{\partial a^{L-1}_n}
  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
  \frac{\partial a^{l+1}_q}{\partial a^l_j}
  \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_315845878357_reveal').click(function() {$('#margin_315845878357').toggle('slow', function() {});});</script>
looks complicated. -->
式<span id="margin_315845878357_reveal" class="equation_link">(53)</span><span id="margin_315845878357" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
  \frac{\partial a^L_m}{\partial a^{L-1}_n}
  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
  \frac{\partial a^{l+1}_q}{\partial a^l_j}
  \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_315845878357_reveal').click(function() {$('#margin_315845878357').toggle('slow', function() {});});</script>
は一見すると複雑そうに見えます。
<!--However, it has a nice intuitive interpretation.
We're computing the rate of change of $C$ with respect to a weight in the network.
What the equation tells us is that every edge between two neurons in the
network is associated with a rate factor which is just the partial
derivative of one neuron's activation with respect to the other
neuron's activation.
The edge from the first weight to the first neuron has a rate factor $\partial a^{l}_j / \partial w^l_{jk}$. -->
しかし、これには直感的な良い解釈があります。
私達は今、ニューラルネットワーク内の重みに関する$C$の変化率を計算しています。
ニューラルネットワーク内の2つのニューロンを繋ぐ全ての枝に対して、変化率の因子が付随している事がこの式から分かります。その因子は一方のニューロンの活性に関する、もう一端のニューロンの活性の偏微分です。
ただし、先頭の重みから第1層目のニューロンに接続している枝には始点にニューロンが接続していないですが、この枝に対する変化率の因子は$\partial a^{l}_j / \partial w^l_{jk}$です。
<!--
2015/1/5 Kenta OONO
最後の文章に訳注を補足
-->
<!--
The rate factor for a path is just the product of the rate factors along
the path.
And the total rate of change $\partial C / \partial w^l_{jk}$ is just the sum of the rate factors of all paths from the initial weight to the final cost.  This procedure is illustrated here, for a single path:
<center>
<img src="images/tikz27.png"/>
</center>-->
パスに対する変化率の因子は、単純にパス内に含まれる変化率の因子を全て掛けたものとします。
そして、 $\partial C / \partial w^l_{jk}$に対する変化率の合計は最初の重みから最後のコストへ向かう全てのパスについての変化率の因子を足しあわせたものです。
下図では1つのパスについてこの手順を図示しています。
<img src="images/tikz27.png"/>
</center>
</p>
<p>
<!--What I've been providing up to now is a heuristic argument, a way of
thinking about what's going on when you perturb a weight in a network.
Let me sketch out a line of thinking you could use to further develop
this argument.-->
これまでの議論は、ニューラルネットワーク内の重みを摂動させた時に何が起こっているかを発見的に考察する方法でした。
この方向で議論をさらに進める方法を簡単に紹介します。
<!--First, you could derive explicit expressions for all
the individual partial derivatives in Equation
<span id="margin_881976946368_reveal" class="equation_link">(53)</span><span id="margin_881976946368" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
  \frac{\partial a^L_m}{\partial a^{L-1}_n}
  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
  \frac{\partial a^{l+1}_q}{\partial a^l_j}
  \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_881976946368_reveal').click(function() {$('#margin_881976946368').toggle('slow', function() {});});</script>
.-->
まず、式
<span id="margin_881976946368_reveal" class="equation_link">(53)</span><span id="margin_881976946368" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
  \frac{\partial a^L_m}{\partial a^{L-1}_n}
  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
  \frac{\partial a^{l+1}_q}{\partial a^l_j}
  \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_881976946368_reveal').click(function() {$('#margin_881976946368').toggle('slow', function() {});});</script>
内の偏微分はすべて具体的な表式を与えます。
<!-- That's easy to do with a bit of calculus.  Having done that, you could then try to figure out how to write all the sums over indices as matrix multiplications.  This turns out to be tedious, and requires some persistence, but not extraordinary insight.  After doing all this, and then simplifying as much as possible, what you discover is that you end up with exactly the backpropagation algorithm!  And so you can think of the
backpropagation algorithm as providing a way of computing the sum over
the rate factor for all these paths.  Or, to put it slightly
differently, the backpropagation algorithm is a clever way of keeping
track of small perturbations to the weights (and biases) as they
propagate through the network, reach the output, and then affect the
cost.-->
これは若干の計算をするだけで難しくはありません。
これを行うと、添字について和を取る操作を行列操作に書き直す事ができるようになります。退屈で忍耐が必要な作業かも知れませんが、賢い洞察は必要ありません。
その後できるだけ式を簡単にしていくと、なんと最終的に得られる式は逆伝播アルゴリズムそのものです！
つまり、逆伝播アルゴリズムは全パスの変化率の因子を総和を計算する方法とみなすことができるのです。
少し別の表現をすると、逆伝播アルゴリズムは重み（とバイアス）に与えた小さな摂動がニューラルネットワークを伝播しながら出力に到達し、コストに影響を及ぼす様子を追跡する為の賢い方法だと言えます。
</p>
<p>
<!--Now, I'm not going work through all this here.  It's messy and
requires considerable care to work through all the details.  If you're
up for a challenge, you may enjoy attempting it.  And even if not, I
hope this line of thinking gives you some insight into what
backpropagation is accomplishing.-->
ここでは上の議論には立ち入りません。議論の詳細を全て追うのは非常にややこしく、相当の注意が必要です。
もし挑戦する意欲があれば、試してみるとよいでしょう。
もしそうでなくても、以上の議論で誤差逆伝播が達成しようしている事について何かの洞察が得られる事を期待します。
</p>
<p>
<!--What about the other mystery - how backpropagation could have been
discovered in the first place?  In fact, if you follow the approach I
just sketched you will discover a proof of backpropagation.
Unfortunately, the proof is quite a bit longer and more complicated
than the one I described earlier in this chapter.  So how was that
short (but more mysterious) proof discovered?  What you find when you
write out all the details of the long proof is that, after the fact,
there are several obvious simplifications staring you in the face.
You make those simplifications, get a shorter proof, and write that
out.  And then several more obvious simplifications jump out at
you.So you repeat again. -->
もう1つの謎、すなわち、まっさらの状態から誤差逆伝播を発見する方法についてはどうでしょうか。
確かに今私が概説したアプローチに従えば、誤差逆伝播の証明は発見できます。
しかし、残念ながらその証明はこの章の前の方で挙げた証明よりも若干長くて複雑です。
では、どのようにすればこのもっと短い（けれど不思議な）証明を発見できるでしょうか。
長い証明の詳細をすべて書きだしてみると、幾つかの明らかな簡略化が目につくはずです。それらの簡略化を行うと証明を短くできます、それをまた書き出してみます。すると再び明らかな簡略化が飛び出しますので、同じようにその簡略化を行います。
<!-- The result after a few iterations is the proof we saw earlier *<span class="marginnote">
*There is one clever step required.  In
  Equation
<span id="margin_558321628107_reveal" class="equation_link">(53)</span><span id="margin_558321628107" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
  \frac{\partial a^L_m}{\partial a^{L-1}_n}
  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
  \frac{\partial a^{l+1}_q}{\partial a^l_j}
  \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_558321628107_reveal').click(function() {$('#margin_558321628107').toggle('slow', function() {});});</script>
the intermediate variables are
  activations like $a_q^{l+1}$.  The clever idea is to switch to using
  weighted inputs, like $z^{l+1}_q$, as the intermediate variables. If
  you don't have this idea, and instead continue using the activations
  $a^{l+1}_q$, the proof you obtain turns out to be slightly more
  complex than the proof given earlier in the chapter.</span>
- short, but somewhat obscure, because all the signposts to its construction have been removed!-->
これを数回繰り返すと、本章の前の方で挙げた、短いけれども、若干わかりにくい証明が得られます*<span class="marginnote">
*1箇所賢い操作が必要な箇所があります。
式
<span id="margin_558321628107_reveal" class="equation_link">(53)</span><span id="margin_558321628107" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
  \frac{\partial a^L_m}{\partial a^{L-1}_n}
  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
  \frac{\partial a^{l+1}_q}{\partial a^l_j}
  \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_558321628107_reveal').click(function() {$('#margin_558321628107').toggle('slow', function() {});});</script>
において、中間変数は$a_q^{l+1}$のような活性です。賢いアイデアというのは、中間変数を$z^{l+1}_q$のような重みつき入力に取り替えるというものです。このアイデアを採用せず、活性$a^{l+1}_q$を使い続けると、最終的に得られる証明は若干複雑になります。</span>。
証明がわかりにくいのは、それを構成する際に道標となるようなものが除かれてしまった為です。
<!--I am, of course, asking you to trust me on this, but
there really is no great mystery to the origin of the earlier proof.
It's just a lot of hard work simplifying the proof I've sketched in
this section.-->
私を信用してもらう必要があるのですが、本章で挙げた短い証明の起源には全くもって謎はないのです。本章で挙げた（短い）証明は、この章で紹介した（長い）証明を頑張って簡略化して得られたものです。
</p>
<p><br/><br/><br/></p>
<p>
</div><div class="footer"> <span class="left_footer"> In academic work,
please cite this book as: Michael A. Nielsen, "Neural Networks and
Deep Learning", Determination Press, 2014

<br/>
<br/>

This work is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"
style="color: #eee;">Creative Commons Attribution-NonCommercial 3.0
Unported License</a>.  This means you're free to copy, share, and
build on this book, but not to sell it.  If you're interested in
commercial use, please <a
href="mailto:mn@michaelnielsen.org">contact me</a>.
</span>
<span class="right_footer">
Last update: Tue Sep  2 09:19:44 2014
<br/>
<br/>
<br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"><img alt="Creative Commons Licence" style="border-width:0" src="http://i.creativecommons.org/l/by-nc/3.0/88x31.png" /></a>
</span>
</div>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-44208967-1', 'neuralnetworksanddeeplearning.com');
  ga('send', 'pageview');

</script>
</body>
</html>