本文淺顯直白的介紹反向傳播算法(BP 算法, The Back Propagation Algorithm)的理解思路, 計算過程和程式實現
我們考慮一個最簡單的layer,這個layer有兩個input \(x_1, x_2\) 和1個output \(y_1\):
則數學式可以寫成: \[ y_1 = w_{1}x_1+w_{2}x_2\]
上式的\(w\)下標是跟著\(x\)在跑, 考慮到\(y\)可以擴展成\(y_j\)(不只一個\(y\)), 我們改一下\(w\)的下標, 也把\(y\)的標號考慮進去: \[ y_1 = w_{11}x_1+w_{12}x_2\]
其中\(w_{ji}\)的第一個下標\(j\)是跟著\(y\)在跑, 而\(i\)是跟著\(x\)在跑, 整個式子可以寫成矩陣形式: \[Y=WX\]
其中令輸入\(X\)有\(n\)維, 輸出\(Y\)有\(m\)維:
\[X=[x_1,\ x_2, ... , x_i,\ ...\ ,\ x_n\ ]^T\]
\[Y=[y_1,\ y_2, ... , y_j,\ ...\ ,\ y_m\ ]^T\]
\[ W= {\left[ \begin{array}{ccc} w_{11},\ w_{12}, ... ,\ w_{1n}\ \\ ... , w_{ji}, ... \\ w_{m1},\ w_{m2}, ... ,\ w_{mn}\ \\ \end{array}\right ]} = \{ w_{ji} \}_{m \times n}\]
通常我們還會把上式加個偏置(bias) \(B\):
\[Y=WX+B\]
\(B\)維度會和\(Y\)相同 \[B=[b_1,\ b_2, ... , b_j,\ ...\ ,\ b_m\ ]^T\]
通常還會再加個可微分的激活函數\(\sigma\)來增加其非線性:
\[Y=\sigma(WX+B)\]
以上是"前饋"(forward)的部分
接下來考慮"反向傳播"(backward)的部分,假設損失函數(loss function)\(\mathbb{J}\) 有\(d\)維, 下標用\(k\)表示:
\[ Assume\ lost function\ \mathbb{J}\ has\ dim\ d \]
\[\mathbb{J}=[J_1,\ J_2, ... , J_k,\ ...\ ,\ J_d\ ]^T\]
損失梯度(Gradient)要從輸出\(Y\)流回輸入\(X\), 並流到\(W\)和\(B\)以進行權重和偏置的更新(update), 也就是可以把問題定義成:
\[ Given\ \frac{\partial J_k}{\partial y_{j}} , find\ \frac{\partial J_k}{\partial w_{ji}},\ \frac{\partial J_k}{\partial b_j} \, and\ \frac{\partial J_k}{\partial x_i} \]
梯度傳遞會用到微積分的鏈鎖率(Chain Rule), 為了方便起見我們多定義一個\(Z\):
\[ Z = WX+B \]
\[ Y = \sigma (Z) \]
\[Z=[z_1,\ z_2, ... , z_j,\ ...\ ,\ z_m\ ]^T\]
可以發現, 上式和原本的\(Y=\sigma(WX+B)\)並沒有不同, 只是計算過程中間多一個\(Z\)
定義輸出梯度(已知): \[\nabla_{out} = \frac{\partial J_k}{\partial y_{j}}\]
定義輸入梯度(待求): \[\nabla_{in} = \frac{\partial J_k}{\partial x_{i}}\]
用鏈鎖率(Chain Rule)將待求的項目展開:
\[ \frac{\partial J_k}{\partial w_{ji}} = \frac{\partial J_k}{\partial y_{j}} \frac{\partial y_j}{\partial z_j} \frac{\partial z_j}{\partial w_{ji}} = G_{kj}\ \frac{\partial z_j}{\partial w_{ji}} \ \ ...(1)\]
\[ \frac{\partial J_k}{\partial b_{j}} = \frac{\partial J_k}{\partial y_{j}} \frac{\partial y_j}{\partial z_j} \frac{\partial z_j}{\partial b_{j}} = G_{kj}\ \frac{\partial z_j}{\partial b_{j}} \ \ ...(2)\]
\[ \frac{\partial J_k}{\partial x_{i}}\ (\nabla_{in}) = \frac{\partial J_k}{\partial y_{j}} \frac{\partial y_j}{\partial z_j} \frac{\partial z_j}{\partial x_{i}} = G_{kj}\ \frac{\partial z_j}{\partial x_{i}} \ \ ...(3)\]
上列各式重複\(\frac{\partial J_k}{\partial y_{j}} \frac{\partial y_j}{\partial z_j}\)的部分, 方便起見, 定義中介梯度\(\mathbb{G}\):
\[ \mathbb{G}\ = \frac{\partial J_k}{\partial y_{j}} \frac{\partial y_j}{\partial z_j} \]
\[ \mathbb{G}= {\left[ \begin{array}{ccc} G_{11},\ G_{12}, ... ,\ G_{1m}\ \\ ... , G_{kj}, ... \\ G_{k1},\ G_{k2}, ... ,\ G_{dm}\ \\ \end{array}\right ]} =\{ G_{kj}\}_{d \times m}\]
至此, 整個計算流程(計算圖)已經很明朗了, 我們的目的就是找到:
\[ Find\ \ \mathbb{G},\ \frac{\partial z_j}{\partial w_{ji}} ,\ \frac{\partial z_j}{\partial b_{j}} ,\ \frac{\partial z_j}{\partial x_{i}} \]
只要算出上述四項, 就可以得到所求
- 計算\(G_{kj}\)
\[ \frac{\partial y_j}{\partial z_j} = \sigma ' (z_j) \]
\[ => G_{kj} = \frac{\partial J_k}{\partial y_{j}}\ \sigma ' (z_j) = (dout)\ \sigma ' (z_j)\]
- 計算\(\frac{\partial z_j}{\partial w_{ji}} ,\ \frac{\partial z_j}{\partial b_{j}} ,\ \frac{\partial z_j}{\partial x_{i}}\)
\[ \frac{\partial z_j}{\partial w_{ji}} = x_{i} \]
\[ \frac{\partial z_j}{\partial b_{j}} = 1 \]
\[ \frac{\partial z_j}{\partial x_{i}} = w_{ji} \]
- 代入式(1)(2)(3), 得到結果
\[ => \frac{\partial J_k}{\partial w_{ji}} = G_{kj} \ x_i\ ,\\ \ \frac{\partial J_k}{\partial b_{j}} = G_{kj} , \\ \ \frac{\partial J_k}{\partial x_{i}}\ (\nabla_{in}) = G_{kj}\ w_{ji}\ \]
以上就是整個BP演算法"單層"的計算過程, 化成python程式碼如下:
1 | def backward(self,dout): |
程式碼來源: https://github.com/purelyvivid/DeepLearning_practice/blob/master/3.%20BP%20algo.py (用numpy寫一個BP algorithm)