太硬核了,放个链接(挖个坑)就跑。

  1. https://zhuanlan.zhihu.com/p/24709748

  2. https://github.com/microsoft/ai-edu/blob/master/A-基础教程/A2-神经网络基本原理简明教程/Appendix/01.1-基本数学导数公式.md#115-矩阵求导

由于链接 2 有变动,因此下面对文章内容进行备份。

# 01.1-基本数学导数公式

# 1.1 基本函数导数公式

# 1.1.1 基本函数及其导数

公式序号 函数 导数 备注
1 y=cy=c y=0y'=0
2 y=xay=x^a y=axa1y'=ax^{a-1}
3 y=logaxy=log_ax y=1xlogae=1xlnay'=\frac{1}{x}log_ae=\frac{1}{xlna}
4 y=lnxy=lnx y=1xy'=\frac{1}{x}
5 y=axy=a^x y=axlnay'=a^xlna
6 y=exy=e^x y=exy'=e^x
7 y=exy=e^{-x} y=exy'=-e^{-x}
8 y=sin(x)y=sin(x) y=cos(x)y'=cos(x) 正弦函数
9 y=cos(x)y=cos(x) y=sin(x)y'=-sin(x) 余弦函数
10 y=tg(x)y=tg(x) y=sec2(x)=1cos2xy'=sec^2(x)=\frac{1}{cos^2x} 正切函数
11 y=ctg(x)y=ctg(x) y=csc2(x)y'=-csc^2(x) 余切函数
12 y=arcsin(x)y=arcsin(x) y=11x2y'=\frac{1}{\sqrt{1-x^2}} 反正弦函数
13 y=arccos(x)y=arccos(x) y=11x2y'=-\frac{1}{\sqrt{1-x^2}} 反余弦函数
14 y=arctan(x)y=arctan(x) y=11+x2y'=\frac{1}{1+x^2} 反正切函数
15 y=arcctg(x)y=arcctg(x) y=11+x2y'=-\frac{1}{1+x^2} 反余切函数
16 y=sinh(x)=(exex)/2y=sinh(x)=(e^x-e^{-x})/2 y=cosh(x)y'=cosh(x) 双曲正弦函数
17 y=cosh(x)=(ex+ex)/2y=cosh(x)=(e^x+e^{-x})/2 y=sinh(x)y'=sinh(x) 双曲余弦函数
18 y=tanh(x)=(exex)/(ex+ex)y=tanh(x)=(e^x-e^{-x})/(e^x+e^{-x}) y=sech2(x)=1tanh2(x)y'=sech^2(x)=1-tanh^2(x) 双曲正切函数
19 y=coth(x)=(ex+ex)/(exex)y=coth(x)=(e^x+e^{-x})/(e^x-e^{-x}) y=csch2(x)y'=-csch^2(x) 双曲余切函数
20 y=sech(x)=2/(ex+ex)y=sech(x)=2/(e^x+e^{-x}) y=sech(x)tanh(x)y'=-sech(x)*tanh(x) 双曲正割函数
21 y=csch(x)=2/(exex)y=csch(x)=2/(e^x-e^{-x}) y=csch(x)coth(x)y'=-csch(x)*coth(x) 双曲余割函数

# 1.1.2 导数四则运算

[u(x)+v(x)]=u(x)+v(x)(30)[u(x) + v(x)]' = u'(x) + v'(x) \tag{30}
[u(x)v(x)]=u(x)v(x)(31)[u(x) - v(x)]' = u'(x) - v'(x) \tag{31}
[u(x)v(x)]=u(x)v(x)+v(x)u(x)(32)[u(x)*v(x)]' = u'(x)*v(x) + v'(x)*u(x) \tag{32}
[u(x)v(x)]=u(x)v(x)v(x)u(x)v2(x)(33)[\frac{u(x)}{v(x)}]'=\frac{u'(x)v(x)-v'(x)u(x)}{v^2(x)} \tag{33}

# 1.1.3 偏导数

Z=f(x,y)Z=f(x,y),则Z对x的偏导可以理解为当y是个常数时,Z单独对x求导:

Zx=fx(x,y)=Zx(40)Z'_x=f'_x(x,y)=\frac{\partial{Z}}{\partial{x}} \tag{40}

则Z对y的偏导可以理解为当x是个常数时,Z单独对y求导:

Zy=fy(x,y)=Zy(41)Z'_y=f'_y(x,y)=\frac{\partial{Z}}{\partial{y}} \tag{41}

在二元函数中,偏导的何意义,就是对任意的y=y0y=y_0的取值,在二元函数曲面上做一个y=y0y=y_0切片,得到Z=f(x,y0)Z = f(x, y_0)的曲线,这条曲线的一阶导数就是Z对x的偏导。对x=x0x=x_0同样,就是Z对y的偏导。

# 1.1.4 复合函数求导(链式法则)

  • 如果 y=f(u),u=g(x)y=f(u), u=g(x) 则:
yx=f(u)u(x)=yuux=dydududx(50)y'_x = f'(u) \cdot u'(x) = y'_u \cdot u'_x=\frac{dy}{du} \cdot \frac{du}{dx} \tag{50}
  • 如果y=f(u),u=g(v),v=h(x)y=f(u),u=g(v),v=h(x) 则:
dydx=f(u)g(v)h(x)=dydududvdvdx(51) \frac{dy}{dx}=f'(u) \cdot g'(v) \cdot h'(x)=\frac{dy}{du} \cdot \frac{du}{dv} \cdot \frac{dv}{dx} \tag{51}
  • Z=f(U,V)Z=f(U,V),通过中间变量U=g(x,y),V=h(x,y)U = g(x,y), V=h(x,y)成为x,y的复合函数Z=f[g(x,y),h(x,y)]Z=f[g(x,y),h(x,y)] 则:
Zx=ZUUx+ZVVx(52) \frac{\partial{Z}}{\partial{x}}=\frac{\partial{Z}}{\partial{U}} \cdot \frac{\partial{U}}{\partial{x}} + \frac{\partial{Z}}{\partial{V}} \cdot \frac{\partial{V}}{\partial{x}} \tag{52}
Zy=ZUUy+ZVVy \frac{\partial{Z}}{\partial{y}}=\frac{\partial{Z}}{\partial{U}} \cdot \frac{\partial{U}}{\partial{y}} + \frac{\partial{Z}}{\partial{V}} \cdot \frac{\partial{V}}{\partial{y}}

# 1.1.5 矩阵求导

A,B,XA,B,X都是矩阵,则:

B(AX)X=ATB(60) B\frac{\partial{(AX)}}{\partial{X}} = A^TB \tag{60}
B(XA)X=BAT(61) B\frac{\partial{(XA)}}{\partial{X}} = BA^T \tag{61}
(XTA)X=(ATX)X=A(62) \frac{\partial{(X^TA)}}{\partial{X}} = \frac{\partial{(A^TX)}}{\partial{X}}=A \tag{62}
(ATXB)X=ABT(63) \frac{\partial{(A^TXB)}}{\partial{X}} = AB^T \tag{63}
(ATXTB)X=BAT,dXTAXdX=(A+AT)X(64) \frac{\partial{(A^TX^TB)}}{\partial{X}} = BA^T, {dX^TAX \over dX} = (A+A^T)X \tag{64}
dXTdX=I,dXdXT=I,dXTXdX=2X(65){dX^T \over dX} = I, {dX \over dX^T} = I, {dX^TX \over dX}=2X\tag{65}
dudXT=(duTdX)T{du \over dX^T} = ({du^T \over dX})^T
duTvdx=duTdxv+dvTdxuT,duvTdx=dudxvT+udvTdx(66){du^Tv \over dx} = {du^T \over dx}v + {dv^T \over dx}u^T, {duv^T \over dx} = {du \over dx}v^T + u{dv^T \over dx} \tag{66}
dABdX=dAdXB+AdBdX(67){dAB \over dX} = {dA \over dX}B + A{dB \over dX} \tag{67}
duTXvdx=uvT,duTXTXudX=2XuuT(68){du^TXv \over dx}=uv^T, {du^TX^TXu \over dX}=2Xuu^T \tag{68}
d[(Xuv)T(Xuv)]dX=2(Xuv)uT(69){d[(Xu-v)^T(Xu-v)] \over dX}=2(Xu-v)u^T \tag{69}

# 1.1.6 标量对矩阵导数的定义

假定yy是一个标量,XX是一个N×MN \times M大小的矩阵,有y=f(X)y=f(X)ff是一个函数。我们来看dfdf应该如何计算。

首先给出定义:

df=jMiNfxijdxij df = \sum_j^M\sum_i^N \frac{\partial{f}}{\partial{x_{ij}}}dx_{ij}

下面我们引入矩阵迹的概念,所谓矩阵的迹,就是矩阵对角线元素之和。也就是说:

tr(X)=ixii tr(X) = \sum_i x_{ii}

引入迹的概念后,我们来看上面的梯度计算是不是可以用迹来表达呢?

fX=(fx11fx12fx1Mfx21fx22fx2MfxN1fxN2fxNM)(90) \frac{\partial{f}}{\partial{X}} = \begin{pmatrix} \frac{\partial{f}}{\partial{x_{11}}} & \frac{\partial{f}}{\partial{x_{12}}} & \dots & \frac{\partial{f}}{\partial{x_{1M}}} \\ \frac{\partial{f}}{\partial{x_{21}}} & \frac{\partial{f}}{\partial{x_{22}}} & \dots & \frac{\partial{f}}{\partial{x_{2M}}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial{f}}{\partial{x_{N1}}} & \frac{\partial{f}}{\partial{x_{N2}}} & \dots & \frac{\partial{f}}{\partial{x_{NM}}} \end{pmatrix} \tag{90}
dX=(dx11dx12dx1Mdx21dx22dx2MdxN1dxN2dxNM)(91) dX = \begin{pmatrix} dx_{11} & d{x_{12}} & \dots & d{x_{1M}} \\ d{x_{21}} & d{x_{22}} & \dots & d{x_{2M}} \\ \vdots & \vdots & \ddots & \vdots \\ d{x_{N1}} & d{x_{N2}} & \dots & d{x_{NM}} \end{pmatrix} \tag{91}

我们来看矩阵(90)(90)的转置和矩阵(91)(91)乘积的对角线元素

((fX)TdX)jj=iNfxijdxij ((\frac{\partial f}{\partial X})^T dX)_{jj}=\sum_i^N \frac{\partial f}{\partial x_{ij}} dx_{ij}

因此,

tr((fX)TdX)=jMiNfxijdxij=df=tr(df)(92) tr({(\frac{\partial{f}}{\partial{X}})}^TdX) = \sum_j^M\sum_i^N\frac{\partial{f}}{\partial{x_{ij}}}dx_{ij} = df = tr(df) \tag{92}

上式的最后一个等号是因为dfdf是一个标量,标量的迹就等于其本身。

# 1.1.7 矩阵迹和导数的部分性质

这里将会给出部分矩阵的迹和导数的性质,作为后面推导过程的参考。性子急的同学可以姑且默认这是一些结论。

d(X+Y)=dX+dY(93) d(X + Y) = dX + dY \tag{93}
d(XY)=(dX)Y+X(dY)(94) d(XY) = (dX)Y + X(dY)\tag{94}
dXT=(dX)T(95) dX^T = {(dX)}^T \tag{95}
d(tr(X))=tr(dX)(96) d(tr(X)) = tr(dX) \tag{96}
d(XY)=dXY+XdY(97) d(X \odot Y) = dX \odot Y + X \odot dY \tag{97}
d(f(X))=f(X)dX(98) d(f(X)) = f^{'}(X) \odot dX \tag{98}
tr(XY)=tr(YX)(99) tr(XY) = tr(YX) \tag{99}
tr(AT(BC))=tr((AB)TC)(100) tr(A^T (B \odot C)) = tr((A \odot B)^T C) \tag{100}

以上各性质的证明方法类似,我们选取式(94)作为证明的示例:

Z=XY Z = XY

则Z中的任意一项是

zij=kxikykj z_{ij} = \sum_k x_{ik}y_{kj}
dzij=kd(xikykj) dz_{ij} = \sum_k d(x_{ik}y_{kj})
=k(dxik)ykj+kxik(dykj) = \sum_k (dx_{ik}) y_{kj} + \sum_k x_{ik} (dy_{kj})
=dXijYij+XijdYij =dX_{ij} \cdot Y_{ij} + X_{ij} \cdot dY_{ij}

从上式可见,dZdZ的每一项和(dX)Y+X(dY)(dX)Y + X(dY)的每一项都是相等的。因此,可以得出式(94)成立。

# 参考资料

矩阵求导术 (opens new window)