全连接层的实现-JobPlus

全连接层的推导

全连接层的每一个结点都与上一层的所有结点相连，用来把前边提取到的特征综合起来。由于其全相连的特性，一般全连接层的参数也是最多的。

全连接层的前向计算

下图中连线最密集的2个地方就是全连接层，这很明显的可以看出全连接层的参数的确很多。在前向计算过程，也就是一个线性的加权求和的过程，全连接层的每一个输出都可以看成前一层的每一个结点乘以一个权重系数W，最后加上一个偏置值b得到，即。如下图中第一个全连接层，输入有50*4*4个神经元结点，输出有500个结点，则一共需要50*4*4*500=400000个权值参数W和500个偏置参数b。

下面用一个简单的网络具体介绍一下推导过程

其中，x1、x2、x3为全连接层的输入，a1、a2、a3为输出，根据我前边在笔记1中的推导，有

可以写成如下矩阵形式：

全连接层的反向传播

以我们的第一个全连接层为例，该层有50*4*4=800个输入结点和500个输出结点。

由于需要对W和b进行更新，还要向前传递梯度，所以我们需要计算如下三个偏导数。

1、对上一层的输出（即当前层的输入）求导

若我们已知转递到该层的梯度，则我们可以通过链式法则求得loss对x的偏导数。
首先需要求得该层的输出ai对输入xj的偏导数

再通过链式法则求得loss对x的偏导数：

上边求导的结果也印证了我前边那句话：在反向传播过程中，若第x层的a节点通过权值W对x+1层的b节点有贡献，则在反向传播过程中，梯度通过权值W从b节点传播回a节点。

若我们的一次训练16张图片，即batch_size=16，则我们可以把计算转化为如下矩阵形式。

2、对权重系数W求导

我们前向计算的公式如下图，

由图可知，所以：。

当batch_size=16时，写成矩阵形式：

3、对偏置系数b求导

由上面前向推导公式可知，

即loss对偏置系数的偏导数等于对上一层输出的偏导数。

当batch_size=16时，将不同batch对应的相同b的偏导相加即可，写成矩阵形式即为乘以一个全1的矩阵：

Caffe中全连接层的实现

在caffe中，关于全连接层的配置信息如下：

[cpp]

layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}

该层类型为InnerProduct内积，也就是我们常说的全连接层，前一层（底层）为pool2一个池化层，顶层，即该层的输出ip1，即为一个全连接层。关于学习率的参数lr_mult我们后面在权值更新章节再看。其他的参数我们在之前的卷积层都遇到过，含义和卷积层也一样，这里就不再多说。

Caffe中全连接层相关的GPU文件有1个，为\src\caffe\layersi\nner_product_layer.cu 。

前向计算

前向过程代码如下，具体解释见注释部分：

[cpp]

template <typename Dtype>
void InnerProductLayer<Dtype>::Forward_gpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {
const Dtype* bottom_data = bottom[0]->gpu_data();
Dtype* top_data = top[0]->mutable_gpu_data();
const Dtype* weight = this->blobs_[0]->gpu_data();
//M_为batch_size
if (M_ == 1) {
//top_data(M*N) = bottom_data(M*K) * weight(K*N)
//这里的计算实际调用了cublas中的矩阵计算函数，我们之前也有讲解，有兴趣可以深入看一下
caffe_gpu_gemv<Dtype>(CblasNoTrans, N_, K_, (Dtype)1.,
weight, bottom_data, (Dtype)0., top_data);
//若有偏置，加上偏置
if (bias_term_)
caffe_gpu_axpy<Dtype>(N_, bias_multiplier_.cpu_data()[0],
this->blobs_[1]->gpu_data(), top_data);
} else {
//同上面
caffe_gpu_gemm<Dtype>(CblasNoTrans,
transpose_ ? CblasNoTrans : CblasTrans,
M_, N_, K_, (Dtype)1.,
bottom_data, weight, (Dtype)0., top_data);
if (bias_term_)
caffe_gpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, N_, 1, (Dtype)1.,
bias_multiplier_.gpu_data(),
this->blobs_[1]->gpu_data(), (Dtype)1., top_data);
}
}

反向传播

代码及注释如下

[cpp]

template <typename Dtype>
void InnerProductLayer<Dtype>::Backward_gpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down,
const vector<Blob<Dtype>*>& bottom) {
if (this->param_propagate_down_[0]) {
const Dtype* top_diff = top[0]->gpu_diff();
const Dtype* bottom_data = bottom[0]->gpu_data();
// Gradient with respect to weight
//对权重求导weight_diff = top_diff * bottom_data
if (transpose_) {
caffe_gpu_gemm<Dtype>(CblasTrans, CblasNoTrans,
K_, N_, M_,
(Dtype)1., bottom_data, top_diff,
(Dtype)1., this->blobs_[0]->mutable_gpu_diff());
} else {
caffe_gpu_gemm<Dtype>(CblasTrans, CblasNoTrans,
N_, K_, M_,
(Dtype)1., top_diff, bottom_data,
(Dtype)1., this->blobs_[0]->mutable_gpu_diff());
}
}
if (bias_term_ && this->param_propagate_down_[1]) {
const Dtype* top_diff = top[0]->gpu_diff();
// Gradient with respect to bias
//对偏置值b求导 bias_diff = bias * top_diff
//这个和我之前公式推导出来的不一样，不知道为什么，如果有谁知道请留言告诉我，谢谢
caffe_gpu_gemv<Dtype>(CblasTrans, M_, N_, (Dtype)1., top_diff,
bias_multiplier_.gpu_data(), (Dtype)1.,
this->blobs_[1]->mutable_gpu_diff());
}
if (propagate_down[0]) {
const Dtype* top_diff = top[0]->gpu_diff();
// Gradient with respect to bottom data
//对上一层的输出求导bottom_diff = top_diff * weight
if (transpose_) {
caffe_gpu_gemm<Dtype>(CblasNoTrans, CblasTrans,
M_, K_, N_,
(Dtype)1., top_diff, this->blobs_[0]->gpu_data(),
(Dtype)0., bottom[0]->mutable_gpu_diff());
} else {
caffe_gpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans,
M_, K_, N_,
(Dtype)1., top_diff, this->blobs_[0]->gpu_data(),
(Dtype)0., bottom[0]->mutable_gpu_diff());
}
}
}

<h2>全连接层的推导</h2>全连接层的每一个结点都与上一层的所有结点相连，用来把前边提取到的特征综合起来。由于其全相连的特性，一般全连接层的参数也是最多的。<h3>全连接层的前向计算</h3>下图中连线最密集的2个地方就是全连接层，这很明显的可以看出全连接层的参数的确很多。在前向计算过程，也就是一个线性的加权求和的过程，全连接层的每一个输出都可以看成前一层的每一个结点乘以一个权重系数W，最后加上一个偏置值b得到，即 。如下图中第一个全连接层，输入有50*4*4个神经元结点，输出有500个结点，则一共需要50*4*4*500=400000个权值参数W和500个偏置参数b。<img src="https://file.jobplus.com.cn/2018/05/09/d57ebbfd75d24f2bb7044e98e4d3e69a.png" _src="https://file.jobplus.com.cn/2018/05/09/d57ebbfd75d24f2bb7044e98e4d3e69a.png"/>下面用一个简单的网络具体介绍一下推导过程 <img src="https://file.jobplus.com.cn/2018/05/09/d33b9b0277824a889286eb5b2577fd8c.png" _src="https://file.jobplus.com.cn/2018/05/09/d33b9b0277824a889286eb5b2577fd8c.png"/> 其中，x1、x2、x3为全连接层的输入，a1、a2、a3为输出，根据我前边在笔记1中的推导，有<img src="https://file.jobplus.com.cn/2018/05/09/6dc247bff02f466394ca486aef83af05.png" _src="https://file.jobplus.com.cn/2018/05/09/6dc247bff02f466394ca486aef83af05.png"/> 可以写成如下矩阵形式：<img src="https://file.jobplus.com.cn/2018/05/09/beece3589cf64415bec0d9c5345b1ef7.png" _src="https://file.jobplus.com.cn/2018/05/09/beece3589cf64415bec0d9c5345b1ef7.png"/> <h3>全连接层的反向传播</h3> 以我们的第一个全连接层为例，该层有50*4*4=800个输入结点和500个输出结点。<img src="https://file.jobplus.com.cn/2018/05/09/a28512e73cbb4eefb002c303b9b2d0ff.png" _src="https://file.jobplus.com.cn/2018/05/09/a28512e73cbb4eefb002c303b9b2d0ff.png"/> 由于需要对W和b进行更新，还要向前传递梯度，所以我们需要计算如下三个偏导数。 1、对上一层的输出（即当前层的输入）求导若我们已知转递到该层的梯度<img src="https://file.jobplus.com.cn/2018/05/09/07e709a8269a4a0da2a1d88dc07dc20d.png" _src="https://file.jobplus.com.cn/2018/05/09/07e709a8269a4a0da2a1d88dc07dc20d.png"/>，则我们可以通过链式法则求得loss对x的偏导数。 首先需要求得该层的输出ai对输入xj的偏导数<img src="https://file.jobplus.com.cn/2018/05/09/e7dd90a892cb4952b687e9c1091a8ff9.png" _src="https://file.jobplus.com.cn/2018/05/09/e7dd90a892cb4952b687e9c1091a8ff9.png"/> 再通过链式法则求得loss对x的偏导数：<img src="https://file.jobplus.com.cn/2018/05/09/8d2ff42cff984d3bb5b8197c0299d7b2.png" _src="https://file.jobplus.com.cn/2018/05/09/8d2ff42cff984d3bb5b8197c0299d7b2.png"/>上边求导的结果也印证了我前边那句话：在反向传播过程中，若第x层的a节点通过权值W对x+1层的b节点有贡献，则在反向传播过程中，梯度通过权值W从b节点传播回a节点。若我们的一次训练16张图片，即batch_size=16，则我们可以把计算转化为如下矩阵形式。<img src="https://file.jobplus.com.cn/2018/05/09/30527142a1f349f588004e8883dd815e.png" _src="https://file.jobplus.com.cn/2018/05/09/30527142a1f349f588004e8883dd815e.png"/> 2、对权重系数W求导 我们前向计算的公式如下图，<img src="https://file.jobplus.com.cn/2018/05/09/2bf1643aea1148a89b404abf87ee612c.png" _src="https://file.jobplus.com.cn/2018/05/09/2bf1643aea1148a89b404abf87ee612c.png"/>由图可知<img src="https://file.jobplus.com.cn/2018/05/09/2a5a790645994f45848c9a9f8935af09.png" _src="https://file.jobplus.com.cn/2018/05/09/2a5a790645994f45848c9a9f8935af09.png"/>，所以：<img src="https://file.jobplus.com.cn/2018/05/09/8ebff687e71c4c01ae2ccb7b589e227c.png" _src="https://file.jobplus.com.cn/2018/05/09/8ebff687e71c4c01ae2ccb7b589e227c.png"/>。 当batch_size=16时，写成矩阵形式：<img src="https://file.jobplus.com.cn/2018/05/09/adbeee5f66a445f1b63e5a34303bd58d.png" _src="https://file.jobplus.com.cn/2018/05/09/adbeee5f66a445f1b63e5a34303bd58d.png"/> 3、对偏置系数b求导由上面前向推导公式可知<img src="https://file.jobplus.com.cn/2018/05/09/09bd6673fb9e4022b63979265b70cb61.png" _src="https://file.jobplus.com.cn/2018/05/09/09bd6673fb9e4022b63979265b70cb61.png"/>， 即loss对偏置系数的偏导数等于对上一层输出的偏导数。当batch_size=16时，将不同batch对应的相同b的偏导相加即可，写成矩阵形式即为乘以一个全1的矩阵：<img src="https://file.jobplus.com.cn/2018/05/09/fe9cd56a6d994e7892ed948532124634.png" _src="https://file.jobplus.com.cn/2018/05/09/fe9cd56a6d994e7892ed948532124634.png"/> <h2>Caffe中全连接层的实现</h2> 在caffe中，关于全连接层的配置信息如下：[cpp]  <ol><li>layer {  </li><li>  name: "ip1"  </li><li>  type: "InnerProduct"  </li><li>  bottom: "pool2"  </li><li>  top: "ip1"  </li><li>  param {  </li><li>    lr_mult: 1  </li><li>  }  </li><li>  param {  </li><li>    lr_mult: 2  </li><li>  }  </li><li>  inner_product_param {  </li><li>    num_output: 500  </li><li>    weight_filler {  </li><li>      type: "xavier"  </li><li>    }  </li><li>    bias_filler {  </li><li>      type: "constant"  </li><li>    }  </li><li>  }  </li><li>}  </li></ol>该层类型为InnerProduct内积，也就是我们常说的全连接层，前一层（底层）为pool2一个池化层，顶层，即该层的输出ip1，即为一个全连接层。关于学习率的参数lr_mult我们后面在权值更新章节再看。其他的参数我们在之前的卷积层都遇到过，含义和卷积层也一样，这里就不再多说。 Caffe中全连接层相关的GPU文件有1个，为\src\caffe\layersi\nner_product_layer.cu 。<h3>前向计算</h3>前向过程代码如下，具体解释见注释部分：[cpp] <ol><li>template <typename Dtype>  </li><li>void InnerProductLayer<Dtype>::Forward_gpu(const vector<Blob<Dtype>*>& bottom,  </li><li>    const vector<Blob<Dtype>*>& top) {  </li><li>  const Dtype* bottom_data = bottom[0]->gpu_data();  </li><li>  Dtype* top_data = top[0]->mutable_gpu_data();  </li><li>  const Dtype* weight = this->blobs_[0]->gpu_data();  </li><li>  //M_为batch_size  </li><li>  if (M_ == 1) {  </li><li>      //top_data(M*N) = bottom_data(M*K) * weight(K*N)  </li><li>      //这里的计算实际调用了cublas中的矩阵计算函数，我们之前也有讲解，有兴趣可以深入看一下  </li><li>    caffe_gpu_gemv<Dtype>(CblasNoTrans, N_, K_, (Dtype)1.,  </li><li>                         weight, bottom_data, (Dtype)0., top_data);  </li><li>    //若有偏置，加上偏置  </li><li>    if (bias_term_)  </li><li>      caffe_gpu_axpy<Dtype>(N_, bias_multiplier_.cpu_data()[0],  </li><li>                            this->blobs_[1]->gpu_data(), top_data);  </li><li>  } else {  </li><li>      //同上面  </li><li>    caffe_gpu_gemm<Dtype>(CblasNoTrans,  </li><li>                          transpose_ ? CblasNoTrans : CblasTrans,  </li><li>                          M_, N_, K_, (Dtype)1.,  </li><li>                          bottom_data, weight, (Dtype)0., top_data);  </li><li>    if (bias_term_)  </li><li>      caffe_gpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, N_, 1, (Dtype)1.,  </li><li>                            bias_multiplier_.gpu_data(),  </li><li>                            this->blobs_[1]->gpu_data(), (Dtype)1., top_data);  </li><li>  }  </li><li>}  </li></ol> <h3>反向传播</h3>代码及注释如下[cpp] <ol><li>template <typename Dtype>  </li><li>void InnerProductLayer<Dtype>::Backward_gpu(const vector<Blob<Dtype>*>& top,  </li><li>    const vector<bool>& propagate_down,  </li><li>    const vector<Blob<Dtype>*>& bottom) {  </li><li>  if (this->param_propagate_down_[0]) {  </li><li>    const Dtype* top_diff = top[0]->gpu_diff();  </li><li>    const Dtype* bottom_data = bottom[0]->gpu_data();  </li><li>    // Gradient with respect to weight  </li><li>    //对权重求导weight_diff = top_diff * bottom_data  </li><li>    if (transpose_) {  </li><li>      caffe_gpu_gemm<Dtype>(CblasTrans, CblasNoTrans,  </li><li>          K_, N_, M_,  </li><li>          (Dtype)1., bottom_data, top_diff,  </li><li>          (Dtype)1., this->blobs_[0]->mutable_gpu_diff());  </li><li>    } else {  </li><li>      caffe_gpu_gemm<Dtype>(CblasTrans, CblasNoTrans,  </li><li>          N_, K_, M_,  </li><li>          (Dtype)1., top_diff, bottom_data,  </li><li>          (Dtype)1., this->blobs_[0]->mutable_gpu_diff());  </li><li>    }  </li><li>  }  </li><li>  if (bias_term_ && this->param_propagate_down_[1]) {  </li><li>    const Dtype* top_diff = top[0]->gpu_diff();  </li><li>    // Gradient with respect to bias  </li><li>    //对偏置值b求导 bias_diff = bias * top_diff  </li><li>    //这个和我之前公式推导出来的不一样，不知道为什么，如果有谁知道请留言告诉我，谢谢  </li><li>    caffe_gpu_gemv<Dtype>(CblasTrans, M_, N_, (Dtype)1., top_diff,  </li><li>        bias_multiplier_.gpu_data(), (Dtype)1.,  </li><li>        this->blobs_[1]->mutable_gpu_diff());  </li><li>  }  </li><li>  if (propagate_down[0]) {  </li><li>    const Dtype* top_diff = top[0]->gpu_diff();  </li><li>    // Gradient with respect to bottom data  </li><li>    //对上一层的输出求导bottom_diff = top_diff * weight  </li><li>    if (transpose_) {  </li><li>      caffe_gpu_gemm<Dtype>(CblasNoTrans, CblasTrans,  </li><li>          M_, K_, N_,  </li><li>          (Dtype)1., top_diff, this->blobs_[0]->gpu_data(),  </li><li>          (Dtype)0., bottom[0]->mutable_gpu_diff());  </li><li>    } else {  </li><li>      caffe_gpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans,  </li><li>          M_, K_, N_,  </li><li>         (Dtype)1., top_diff, this->blobs_[0]->gpu_data(),  </li><li>         (Dtype)0., bottom[0]->mutable_gpu_diff());  </li><li>    }  </li><li>  }  </li><li>}  </li></ol>