速度慢的问题 #27

fujingling · 2020-05-11T08:58:07Z

大帅您好，我发现这个模型在P40上大概要15ms才能搞定一张图片，CPU上就慢的简直不能接受，一张图推理时间2-3s。您测试时快吗？

fujingling · 2020-05-11T08:59:39Z

大帅您好，我发现这个模型在P40上大概要15ms才能搞定一张图片，CPU上就慢的简直不能接受，一张图推理时间2-3s。您测试时快吗？

我说的是PFPLD

hanson-young · 2020-05-11T09:02:35Z

一个mobilenet的结构，还是112x112的输入，2-3s怕是过分了，我在2080ti上的速度是10ms左右，这个模型本身是可以再优化的，换一下backbone重新训练下，我试过了mobilenet0.25，精度上也差不了多少

fujingling · 2020-05-11T09:28:49Z

一个mobilenet的结构，还是112x112的输入，2-3s怕是过分了，我在2080ti上的速度是10ms左右，这个模型本身是可以再优化的，换一下backbone重新训练下，我试过了mobilenet0.25，精度上也差不了多少

您有在CPU上测试过吗？

hanson-young · 2020-05-11T09:58:38Z

caffe大概200ms

…

---原始邮件--- 发件人: "fujingling"<[email protected]> 发送时间: 2020年5月11日(周一) 下午5:29 收件人: "hanson-young/nniefacelib"<[email protected]>; 抄送: "Comment"<[email protected]>;"hanson.young"<[email protected]>; 主题: Re: [hanson-young/nniefacelib] 速度慢的问题 (#27) 一个mobilenet的结构，还是112x112的输入，2-3s怕是过分了，我在2080ti上的速度是10ms左右，这个模型本身是可以再优化的，换一下backbone重新训练下，我试过了mobilenet0.25，精度上也差不了多少您有在CPU上测试过吗？ — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

budaLi · 2020-05-11T10:55:44Z

一个mobilenet的结构，还是112x112的输入，2-3s怕是过分了，我在2080ti上的速度是10ms左右，这个模型本身是可以再优化的，换一下backbone重新训练下，我试过了mobilenet0。 25，精度上也差不了多少

您是如何更改为mobilnet0.25的呢，没有这方面的经验

hanson-young · 2020-05-12T02:07:07Z

一个mobilenet的结构，还是112x112的输入，2-3s怕是过分了，我在2080ti上的速度是10ms左右，这个模型本身是可以再优化的，换一下backbone重新训练下，我试过了mobilenet0。 25，精度上也差不了多少

您是如何更改为mobilnet0.25的呢，没有这方面的经验

要么是你直接砍下channel，要么就网上随便找个轻量级的backbone，pytorch调试起来压力不大

xuguozhi · 2020-05-12T02:57:31Z

一个mobilenet的结构，还是112x112的输入，2-3s怕是过分了，我在2080ti上的速度是10ms左右，这个模型本身是可以再优化的，换一下backbone重新训练下，我试过了mobilenet0。 25，精度上也差不了多少

您是如何更改为mobilnet0.25的呢，没有这方面的经验

要么是你直接砍下channel，要么就网上随便找个轻量级的backbone，pytorch调试起来压力不大

直接输入输出channel数量*0.25么

budaLi · 2020-05-12T03:01:59Z

#!/usr/bin/env python3

-- coding:utf-8 --

######################################################

pfld.py -

written by zhaozhichao and Hanson

######################################################

import torch
import torch.nn as nn
import math
import torch.nn.init as init

def conv_bn(inp, oup, kernel, stride, padding=1):

return nn.Sequential(
    nn.Conv2d(inp, oup, kernel, stride, padding, bias=False),
    nn.BatchNorm2d(oup),
    nn.ReLU(inplace=True))

def conv_1x1_bn(inp, oup):

return nn.Sequential(
    nn.Conv2d(inp, oup, 1, 1, 0, bias=False),
    nn.BatchNorm2d(oup),
    nn.ReLU(inplace=True))

class InvertedResidual(nn.Module):

def __init__(self, inp, oup, stride, use_res_connect, expand_ratio=6):
    super(InvertedResidual, self).__init__()
    self.stride = stride
    assert stride in [1, 2]

    self.use_res_connect = use_res_connect
    self.conv = nn.Sequential(
        nn.Conv2d(inp, inp * expand_ratio, 1, 1, 0, bias=False),
        nn.BatchNorm2d(inp * expand_ratio),
        nn.ReLU(inplace=True),
        nn.Conv2d(
            inp * expand_ratio,
            inp * expand_ratio,
            3,
            stride,
            1,
            groups=inp * expand_ratio,
            bias=False),
        nn.BatchNorm2d(inp * expand_ratio),
        nn.ReLU(inplace=True),
        nn.Conv2d(inp * expand_ratio, oup, 1, 1, 0, bias=False),
        nn.BatchNorm2d(oup),
    )

def forward(self, x):
    if self.use_res_connect:
        return x + self.conv(x)
    else:
        return self.conv(x)

class PFLDInference(nn.Module):

def __init__(self):
    super(PFLDInference, self).__init__()
    newchannels = int(64*0.25)
    newchannels2 = newchannels*2
    self.conv1 = nn.Conv2d(
        3, newchannels, kernel_size=3, stride=2, padding=1, bias=False)
    self.bn1 = nn.BatchNorm2d(newchannels)
    self.relu = nn.ReLU(inplace=True)

    self.conv2 = nn.Conv2d(
        newchannels, newchannels, kernel_size=3, stride=1, padding=1, bias=False)
    self.bn2 = nn.BatchNorm2d(newchannels)
    self.relu = nn.ReLU(inplace=True)

    self.conv3_1 = InvertedResidual(newchannels, newchannels, 2, False, 2)

    self.block3_2 = InvertedResidual(newchannels, newchannels, 1, True, 2)
    self.block3_3 = InvertedResidual(newchannels, newchannels, 1, True, 2)
    self.block3_4 = InvertedResidual(newchannels, newchannels, 1, True, 2)
    self.block3_5 = InvertedResidual(newchannels, newchannels, 1, True, 2)

    self.conv4_1 = InvertedResidual(newchannels, newchannels2, 2, False, 2)

    self.conv5_1 = InvertedResidual(newchannels2, newchannels2, 1, False, 4)
    self.block5_2 = InvertedResidual(newchannels2, newchannels2, 1, True, 4)
    self.block5_3 = InvertedResidual(newchannels2, newchannels2, 1, True, 4)
    self.block5_4 = InvertedResidual(newchannels2, newchannels2, 1, True, 4)
    self.block5_5 = InvertedResidual(newchannels2, newchannels2, 1, True, 4)
    self.block5_6 = InvertedResidual(newchannels2, newchannels2, 1, True, 4)

    self.conv6_1 = InvertedResidual(newchannels2, 16, 1, False, 2)  # [16, 14, 14]

    self.conv7 = conv_bn(16, 32, 3, 2)  # [32, 7, 7]
    self.conv8 = nn.Conv2d(32, newchannels2, 7, 1, 0)  # [128, 1, 1]
    self.bn8 = nn.BatchNorm2d(newchannels2)

    self.avg_pool1 = nn.AvgPool2d(14)
    self.avg_pool2 = nn.AvgPool2d(7)
    self.fc = nn.Linear(16+32+newchannels2, 196)
    self.fc_aux = nn.Linear(16+32+newchannels2, 3)

    self.conv1_aux = conv_bn(newchannels, newchannels2, 3, 2)
    self.conv2_aux = conv_bn(newchannels2, newchannels2, 3, 1)
    self.conv3_aux = conv_bn(newchannels2, 32, 3, 2)
    self.conv4_aux = conv_bn(32, newchannels2, 7, 1)
    self.max_pool1_aux = nn.MaxPool2d(3)
    self.fc1_aux = nn.Linear(newchannels2, 32)
    self.fc2_aux = nn.Linear(32 + 16+32+newchannels2, 3)

def forward(self, x):  # x: 3, 112, 112
    x = self.relu(self.bn1(self.conv1(x)))  # [64, 56, 56]
    x = self.relu(self.bn2(self.conv2(x)))  # [64, 56, 56]
    x = self.conv3_1(x)
    x = self.block3_2(x)
    x = self.block3_3(x)
    x = self.block3_4(x)
    out1 = self.block3_5(x)

    x = self.conv4_1(out1)
    x = self.conv5_1(x)
    x = self.block5_2(x)
    x = self.block5_3(x)
    x = self.block5_4(x)
    x = self.block5_5(x)
    x = self.block5_6(x)
    x = self.conv6_1(x)
    x1 = self.avg_pool1(x)
    x1 = x1.view(x1.size(0), -1)

    x = self.conv7(x)
    x2 = self.avg_pool2(x)
    x2 = x2.view(x2.size(0), -1)

    x3 = self.relu(self.conv8(x))
    x3 = x3.view(x1.size(0), -1)

    multi_scale = torch.cat([x1, x2, x3], 1)
    landmarks = self.fc(multi_scale)


    aux = self.conv1_aux(out1)
    aux = self.conv2_aux(aux)
    aux = self.conv3_aux(aux)
    aux = self.conv4_aux(aux)
    aux = self.max_pool1_aux(aux)
    aux = aux.view(aux.size(0), -1)
    aux = self.fc1_aux(aux)
    aux = torch.cat([aux, multi_scale], 1)
    pose = self.fc2_aux(aux)

    return pose, landmarks

if name == 'main':

input = torch.randn(1, 3, 112, 112)
plfd_backbone = PFLDInference()
angle, landmarks = plfd_backbone(input)
print(plfd_backbone)
print("angle.shape:{0:}, landmarks.shape: {1:}".format(
    angle.shape, landmarks.shape))

是这样嘛 channel*0.25 ，不过训练一次就出错了

xuguozhi · 2020-05-12T03:15:25Z

python3 pfld.py 看看运行到哪里报的错

budaLi · 2020-05-12T03:31:10Z

python3 pfld.py 看看运行到哪里报的错

直接运行是没错的可以看到 angle.shape:torch.Size([1, 3]), landmarks.shape: torch.Size([1, 196])

KaiOtter · 2020-05-15T16:49:14Z

我也发现这个模型在cpu上爆慢。原因我已经查出来了，是权值精度的问题，保存的参数很多都突破了float精度，是double。python的底层虽然只有double这个类型没有float，但是两者在深度学习框架里跑起来速度完全不同。
我是移植了模型到mxnet上，发现这个问题后，试了一些方法都没用，无法避免模型在10多个epoch的时候，权值急速缩小，每个epoch可以缩小约10e-10以上。我之前以为是我复现的问题，昨天试着读了一下作者release的模型，发现也是这样，采样 conv1，bn1的参数，里面成群10e-40+。

这就是为啥这个模型在gpu上好好的，但在cpu上就有问题的原因。

具体是什么原因导致权值学成这个鬼样，我还没头绪，希望作者能有点建议。

hanson-young · 2020-05-16T02:40:34Z

我也发现这个模型在cpu上爆慢。原因我已经查出来了，是权值精度的问题，保存的参数很多都突破了float精度，是double。python的底层虽然只有double这个类型没有float，但是两者在深度学习框架里跑起来速度完全不同。
我是移植了模型到mxnet上，发现这个问题后，试了一些方法都没用，无法避免模型在10多个epoch的时候，权值急速缩小，每个epoch可以缩小约10e-10以上。我之前以为是我复现的问题，昨天试着读了一下作者release的模型，发现也是这样，采样 conv1，bn1的参数，里面成群10e-40+。

这就是为啥这个模型在gpu上好好的，但在cpu上就有问题的原因。

具体是什么原因导致权值学成这个鬼样，我还没头绪，希望作者能有点建议。

感谢，这个问题我没有注意到，但是10e-40+这样的精度其实是很接近0了，按道理只需要float精度就够了。另外我在pc上用ncnn跑1thread也差不多是200ms，ncnn是float32的，归根到底其实是需要再去优化网络结构的呢 ^_^

KaiOtter · 2020-05-16T16:56:15Z

我也发现这个模型在cpu上爆慢。原因我已经查出来了，是权值精度的问题，保存的参数很多都突破了float精度，是double。python的底层虽然只有double这个类型没有float，但是两者在深度学习框架里跑起来速度完全不同。
我是移植了模型到mxnet上，发现这个问题后，试了一些方法都没用，无法避免模型在10多个epoch的时候，权值急速缩小，每个epoch可以缩小约10e-10以上。我之前以为是我复现的问题，昨天试着读了一下作者release的模型，发现也是这样，采样 conv1，bn1的参数，里面成群10e-40+。
这就是为啥这个模型在gpu上好好的，但在cpu上就有问题的原因。
具体是什么原因导致权值学成这个鬼样，我还没头绪，希望作者能有点建议。

感谢，这个问题我没有注意到，但是10e-40+这样的精度其实是很接近0了，按道理只需要float精度就够了。另外我在pc上用ncnn跑1thread也差不多是200ms，ncnn是float32的，归根到底其实是需要再去优化网络结构的呢 ^_^

我没具体实验过pytorch，但在mxnet上强制令这些tensor的精度为float32并不会截断那些超出精度的数值。所以，用随机初始化的模型跑，比用训练好的模型跑要快很多很多。另外，具体是这样，您的checkpoint_robust.pth里的Conv1[0, :, :, :]全是小于10e-40的值，且bn1的64个running_var里面有43个小于10e-40。从这个表现看，是不是可以认为通道数过大，模型在收敛到一定程度后，只用少量通道就可以有效表达特征，进而弃用了很多通道？

bn1.weight torch.Size([64])
tensor([ 1.1142e-14, 1.9840e+00, 1.8077e-06, 2.6957e-04, -1.5956e-11,
9.9732e-01, -4.8662e-14, -6.6997e-03, -6.3511e-41, 1.8081e+00,
7.7917e-03, -8.9655e-22, -9.6205e-41, -2.3129e-40, -2.3074e-41,
2.1375e+00, 1.8774e+00, 2.4078e+00, 1.5833e+00, 4.7437e-41,
5.5131e-03, 3.5874e-03, -3.3295e-42, 1.3943e+00, 1.8719e-03,
2.2324e-06, -3.4177e-02, 2.4324e+00, 2.7387e+00, 1.3493e-41,
4.6918e-19, 2.3578e-03, 1.9607e+00, 2.2181e+00, -6.0134e-41,
1.2981e+00, -6.9195e-41, 4.2235e-04, 2.0681e+00, 7.4136e-41,
-3.3194e-41, -5.3852e-04, 1.3303e+00, -5.4791e-42, 3.9523e-10,
8.0862e-10, -2.6836e-41, -7.2862e-36, 8.1163e-42, -3.8488e-41,
-5.6975e-41, 4.5821e-14, 1.0590e-13, 2.2362e-41, -2.1330e-40,
-1.7882e-41, -5.4627e-07, 3.7301e-02, 2.2653e-41, -6.8765e-41,
-6.7247e-41, 8.8496e-04, 2.4797e-10, 1.1136e-27], device='cuda:0')

bn1.bias torch.Size([64])
tensor([-7.3735e-23, 1.0018e+00, -1.5060e-40, -8.2336e-05, -1.7604e-36,
5.0374e-01, -2.2037e-36, 3.1105e-01, -5.6910e-24, -4.0917e-01,
-4.5703e-02, -4.7472e-40, -1.1765e-18, -2.0253e-40, 3.5438e-29,
-2.7786e+00, 5.0326e-01, 1.3121e+00, 2.8182e+00, -3.4718e-04,
-6.4969e-18, -3.0385e-03, -7.1485e-29, 5.5154e-01, -3.2004e-03,
-1.1490e-09, -5.0653e-03, -6.1328e-01, -2.6092e+00, -9.7728e-13,
-5.3176e-38, -4.9417e-03, 2.5313e+00, 1.5049e+00, -2.3744e-31,
1.9715e+00, -7.9868e-21, -1.8723e-02, 1.0208e+00, -3.6904e-32,
-3.5802e-37, -2.0418e-03, 2.8885e+00, -1.0221e-37, -1.4217e-07,
-7.6839e-39, -1.3619e-38, -2.4728e-31, -3.7616e-17, -9.2246e-34,
-2.5572e-18, -2.3757e-20, -9.6840e-26, -3.7811e-41, -4.7578e-38,
-1.5225e-28, -1.4360e-26, -4.0357e-02, -4.3764e-34, -2.4557e-20,
-5.5333e-30, -4.1141e-01, -3.5403e-13, -1.3820e-36], device='cuda:0')

bn1.running_mean torch.Size([64])
tensor([-5.1580e-41, -2.2162e-01, -3.8536e-41, 1.0664e-42, 1.5112e-40,
-1.0727e+00, -1.6114e-41, -1.9895e-01, 5.2016e-41, -1.0795e+00,
1.8222e-01, -7.1178e-41, 7.8977e-42, 7.0868e-41, 5.6472e-42,
-3.1051e+00, -2.3679e-01, -2.0414e-01, -8.4614e-02, 8.4601e-41,
4.4686e-41, 2.0061e-41, 8.7164e-41, -1.6375e-01, 6.5995e-38,
-2.3002e-41, -6.7373e-26, -5.3808e-01, -4.0784e+00, -9.7357e-41,
-2.3948e-41, -8.4134e-41, 3.5338e-01, -3.4059e-02, -4.3182e-41,
5.4297e-02, -9.8287e-42, -4.9485e-20, -1.7786e-01, -3.1166e-41,
6.7524e-41, -1.2594e-11, 5.1578e+00, 1.2567e-40, -2.0809e-41,
-4.5475e-41, 1.4825e-40, 7.3806e-41, -5.3417e-41, 6.8528e-41,
9.7266e-41, -4.3735e-41, -5.3389e-43, -1.9674e-41, -8.1706e-41,
3.6630e-41, -3.2006e-41, -5.7090e-06, 6.3986e-41, 8.4730e-41,
-7.3008e-41, -9.1969e-03, 4.2166e-41, 1.1506e-40], device='cuda:0')

bn1.running_var torch.Size([64])
tensor([5.6052e-45, 2.3596e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 2.2151e+00,
5.6052e-45, 1.4921e-02, 5.6052e-45, 7.1242e-01, 1.1669e-02, 5.6052e-45,
5.6052e-45, 5.6052e-45, 5.6052e-45, 3.2612e+00, 2.5997e+00, 2.2400e+00,
3.5325e-01, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 3.2991e-01,
5.6052e-45, 5.6052e-45, 5.6052e-45, 4.0805e+00, 5.2654e+00, 5.6052e-45,
5.6052e-45, 5.6052e-45, 7.9652e-01, 2.1556e+00, 5.6052e-45, 3.4721e-01,
5.6052e-45, 1.0450e-39, 1.3032e+00, 5.6052e-45, 5.6052e-45, 7.0362e-23,
8.3884e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45,
5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45,
5.6052e-45, 5.6052e-45, 5.6052e-45, 1.2762e-11, 5.6052e-45, 5.6052e-45,
5.6052e-45, 2.9369e-05, 5.6052e-45, 5.6052e-45], device='cuda:0')

hanson-young · 2020-05-17T00:11:01Z

是的，如果按照channel pruning的思想来看，其实完全不用那么大个网络做这个任务 Learning Efficient Convolutional Networks through Network Slimming

…

---原始邮件--- 发件人: "Kaidi Chen"<[email protected]> 发送时间: 2020年5月17日(周日) 凌晨0:56 收件人: "hanson-young/nniefacelib"<[email protected]>; 抄送: "Comment"<[email protected]>;"hanson.young"<[email protected]>; 主题: Re: [hanson-young/nniefacelib] 速度慢的问题 (#27) 我也发现这个模型在cpu上爆慢。原因我已经查出来了，是权值精度的问题，保存的参数很多都突破了float精度，是double。python的底层虽然只有double这个类型没有float，但是两者在深度学习框架里跑起来速度完全不同。我是移植了模型到mxnet上，发现这个问题后，试了一些方法都没用，无法避免模型在10多个epoch的时候，权值急速缩小，每个epoch可以缩小约10e-10以上。我之前以为是我复现的问题，昨天试着读了一下作者release的模型，发现也是这样，采样 conv1，bn1的参数，里面成群10e-40+。这就是为啥这个模型在gpu上好好的，但在cpu上就有问题的原因。具体是什么原因导致权值学成这个鬼样，我还没头绪，希望作者能有点建议。感谢，这个问题我没有注意到，但是10e-40+这样的精度其实是很接近0了，按道理只需要float精度就够了。另外我在pc上用ncnn跑1thread也差不多是200ms，ncnn是float32的，归根到底其实是需要再去优化网络结构的呢 ^_^ 我没具体实验过pytorch，但在mxnet上强制令这些tensor的精度为float32并不会截断那些超出精度的数值。所以，用随机初始化的模型跑，比用训练好的模型跑要快很多很多。另外，具体是这样，您的checkpoint_robust.pth里的Conv1[0, :, :, :]全是小于10e-40的值，且bn1的64个running_var里面有43个小于10e-40。从这个表现看，是不是可以认为通道数过大，模型在收敛到一定程度后，只用少量通道就可以有效表达特征，进而弃用了很多通道？ bn1.weight torch.Size([64]) tensor([ 1.1142e-14, 1.9840e+00, 1.8077e-06, 2.6957e-04, -1.5956e-11, 9.9732e-01, -4.8662e-14, -6.6997e-03, -6.3511e-41, 1.8081e+00, 7.7917e-03, -8.9655e-22, -9.6205e-41, -2.3129e-40, -2.3074e-41, 2.1375e+00, 1.8774e+00, 2.4078e+00, 1.5833e+00, 4.7437e-41, 5.5131e-03, 3.5874e-03, -3.3295e-42, 1.3943e+00, 1.8719e-03, 2.2324e-06, -3.4177e-02, 2.4324e+00, 2.7387e+00, 1.3493e-41, 4.6918e-19, 2.3578e-03, 1.9607e+00, 2.2181e+00, -6.0134e-41, 1.2981e+00, -6.9195e-41, 4.2235e-04, 2.0681e+00, 7.4136e-41, -3.3194e-41, -5.3852e-04, 1.3303e+00, -5.4791e-42, 3.9523e-10, 8.0862e-10, -2.6836e-41, -7.2862e-36, 8.1163e-42, -3.8488e-41, -5.6975e-41, 4.5821e-14, 1.0590e-13, 2.2362e-41, -2.1330e-40, -1.7882e-41, -5.4627e-07, 3.7301e-02, 2.2653e-41, -6.8765e-41, -6.7247e-41, 8.8496e-04, 2.4797e-10, 1.1136e-27], device='cuda:0') bn1.bias torch.Size([64]) tensor([-7.3735e-23, 1.0018e+00, -1.5060e-40, -8.2336e-05, -1.7604e-36, 5.0374e-01, -2.2037e-36, 3.1105e-01, -5.6910e-24, -4.0917e-01, -4.5703e-02, -4.7472e-40, -1.1765e-18, -2.0253e-40, 3.5438e-29, -2.7786e+00, 5.0326e-01, 1.3121e+00, 2.8182e+00, -3.4718e-04, -6.4969e-18, -3.0385e-03, -7.1485e-29, 5.5154e-01, -3.2004e-03, -1.1490e-09, -5.0653e-03, -6.1328e-01, -2.6092e+00, -9.7728e-13, -5.3176e-38, -4.9417e-03, 2.5313e+00, 1.5049e+00, -2.3744e-31, 1.9715e+00, -7.9868e-21, -1.8723e-02, 1.0208e+00, -3.6904e-32, -3.5802e-37, -2.0418e-03, 2.8885e+00, -1.0221e-37, -1.4217e-07, -7.6839e-39, -1.3619e-38, -2.4728e-31, -3.7616e-17, -9.2246e-34, -2.5572e-18, -2.3757e-20, -9.6840e-26, -3.7811e-41, -4.7578e-38, -1.5225e-28, -1.4360e-26, -4.0357e-02, -4.3764e-34, -2.4557e-20, -5.5333e-30, -4.1141e-01, -3.5403e-13, -1.3820e-36], device='cuda:0') bn1.running_mean torch.Size([64]) tensor([-5.1580e-41, -2.2162e-01, -3.8536e-41, 1.0664e-42, 1.5112e-40, -1.0727e+00, -1.6114e-41, -1.9895e-01, 5.2016e-41, -1.0795e+00, 1.8222e-01, -7.1178e-41, 7.8977e-42, 7.0868e-41, 5.6472e-42, -3.1051e+00, -2.3679e-01, -2.0414e-01, -8.4614e-02, 8.4601e-41, 4.4686e-41, 2.0061e-41, 8.7164e-41, -1.6375e-01, 6.5995e-38, -2.3002e-41, -6.7373e-26, -5.3808e-01, -4.0784e+00, -9.7357e-41, -2.3948e-41, -8.4134e-41, 3.5338e-01, -3.4059e-02, -4.3182e-41, 5.4297e-02, -9.8287e-42, -4.9485e-20, -1.7786e-01, -3.1166e-41, 6.7524e-41, -1.2594e-11, 5.1578e+00, 1.2567e-40, -2.0809e-41, -4.5475e-41, 1.4825e-40, 7.3806e-41, -5.3417e-41, 6.8528e-41, 9.7266e-41, -4.3735e-41, -5.3389e-43, -1.9674e-41, -8.1706e-41, 3.6630e-41, -3.2006e-41, -5.7090e-06, 6.3986e-41, 8.4730e-41, -7.3008e-41, -9.1969e-03, 4.2166e-41, 1.1506e-40], device='cuda:0') bn1.running_var torch.Size([64]) tensor([5.6052e-45, 2.3596e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 2.2151e+00, 5.6052e-45, 1.4921e-02, 5.6052e-45, 7.1242e-01, 1.1669e-02, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 3.2612e+00, 2.5997e+00, 2.2400e+00, 3.5325e-01, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 3.2991e-01, 5.6052e-45, 5.6052e-45, 5.6052e-45, 4.0805e+00, 5.2654e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 7.9652e-01, 2.1556e+00, 5.6052e-45, 3.4721e-01, 5.6052e-45, 1.0450e-39, 1.3032e+00, 5.6052e-45, 5.6052e-45, 7.0362e-23, 8.3884e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 1.2762e-11, 5.6052e-45, 5.6052e-45, 5.6052e-45, 2.9369e-05, 5.6052e-45, 5.6052e-45], device='cuda:0') — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

fujingling · 2020-05-18T08:45:26Z

是的，如果按照channel pruning的思想来看，其实完全不用那么大个网络做这个任务 Learning Efficient Convolutional Networks through Network Slimming
…
---原始邮件--- 发件人: "Kaidi Chen"<[email protected]> 发送时间: 2020年5月17日(周日) 凌晨0:56 收件人: "hanson-young/nniefacelib"<[email protected]>; 抄送: "Comment"<[email protected]>;"hanson.young"<[email protected]>; 主题: Re: [hanson-young/nniefacelib] 速度慢的问题 (#27) 我也发现这个模型在cpu上爆慢。原因我已经查出来了，是权值精度的问题，保存的参数很多都突破了float精度，是double。python的底层虽然只有double这个类型没有float，但是两者在深度学习框架里跑起来速度完全不同。我是移植了模型到mxnet上，发现这个问题后，试了一些方法都没用，无法避免模型在10多个epoch的时候，权值急速缩小，每个epoch可以缩小约10e-10以上。我之前以为是我复现的问题，昨天试着读了一下作者release的模型，发现也是这样，采样 conv1，bn1的参数，里面成群10e-40+。这就是为啥这个模型在gpu上好好的，但在cpu上就有问题的原因。具体是什么原因导致权值学成这个鬼样，我还没头绪，希望作者能有点建议。感谢，这个问题我没有注意到，但是10e-40+这样的精度其实是很接近0了，按道理只需要float精度就够了。另外我在pc上用ncnn跑1thread也差不多是200ms，ncnn是float32的，归根到底其实是需要再去优化网络结构的呢 ^_^ 我没具体实验过pytorch，但在mxnet上强制令这些tensor的精度为float32并不会截断那些超出精度的数值。所以，用随机初始化的模型跑，比用训练好的模型跑要快很多很多。另外，具体是这样，您的checkpoint_robust.pth里的Conv1[0, :, :, :]全是小于10e-40的值，且bn1的64个running_var里面有43个小于10e-40。从这个表现看，是不是可以认为通道数过大，模型在收敛到一定程度后，只用少量通道就可以有效表达特征，进而弃用了很多通道？ bn1.weight torch.Size([64]) tensor([ 1.1142e-14, 1.9840e+00, 1.8077e-06, 2.6957e-04, -1.5956e-11, 9.9732e-01, -4.8662e-14, -6.6997e-03, -6.3511e-41, 1.8081e+00, 7.7917e-03, -8.9655e-22, -9.6205e-41, -2.3129e-40, -2.3074e-41, 2.1375e+00, 1.8774e+00, 2.4078e+00, 1.5833e+00, 4.7437e-41, 5.5131e-03, 3.5874e-03, -3.3295e-42, 1.3943e+00, 1.8719e-03, 2.2324e-06, -3.4177e-02, 2.4324e+00, 2.7387e+00, 1.3493e-41, 4.6918e-19, 2.3578e-03, 1.9607e+00, 2.2181e+00, -6.0134e-41, 1.2981e+00, -6.9195e-41, 4.2235e-04, 2.0681e+00, 7.4136e-41, -3.3194e-41, -5.3852e-04, 1.3303e+00, -5.4791e-42, 3.9523e-10, 8.0862e-10, -2.6836e-41, -7.2862e-36, 8.1163e-42, -3.8488e-41, -5.6975e-41, 4.5821e-14, 1.0590e-13, 2.2362e-41, -2.1330e-40, -1.7882e-41, -5.4627e-07, 3.7301e-02, 2.2653e-41, -6.8765e-41, -6.7247e-41, 8.8496e-04, 2.4797e-10, 1.1136e-27], device='cuda:0') bn1.bias torch.Size([64]) tensor([-7.3735e-23, 1.0018e+00, -1.5060e-40, -8.2336e-05, -1.7604e-36, 5.0374e-01, -2.2037e-36, 3.1105e-01, -5.6910e-24, -4.0917e-01, -4.5703e-02, -4.7472e-40, -1.1765e-18, -2.0253e-40, 3.5438e-29, -2.7786e+00, 5.0326e-01, 1.3121e+00, 2.8182e+00, -3.4718e-04, -6.4969e-18, -3.0385e-03, -7.1485e-29, 5.5154e-01, -3.2004e-03, -1.1490e-09, -5.0653e-03, -6.1328e-01, -2.6092e+00, -9.7728e-13, -5.3176e-38, -4.9417e-03, 2.5313e+00, 1.5049e+00, -2.3744e-31, 1.9715e+00, -7.9868e-21, -1.8723e-02, 1.0208e+00, -3.6904e-32, -3.5802e-37, -2.0418e-03, 2.8885e+00, -1.0221e-37, -1.4217e-07, -7.6839e-39, -1.3619e-38, -2.4728e-31, -3.7616e-17, -9.2246e-34, -2.5572e-18, -2.3757e-20, -9.6840e-26, -3.7811e-41, -4.7578e-38, -1.5225e-28, -1.4360e-26, -4.0357e-02, -4.3764e-34, -2.4557e-20, -5.5333e-30, -4.1141e-01, -3.5403e-13, -1.3820e-36], device='cuda:0') bn1.running_mean torch.Size([64]) tensor([-5.1580e-41, -2.2162e-01, -3.8536e-41, 1.0664e-42, 1.5112e-40, -1.0727e+00, -1.6114e-41, -1.9895e-01, 5.2016e-41, -1.0795e+00, 1.8222e-01, -7.1178e-41, 7.8977e-42, 7.0868e-41, 5.6472e-42, -3.1051e+00, -2.3679e-01, -2.0414e-01, -8.4614e-02, 8.4601e-41, 4.4686e-41, 2.0061e-41, 8.7164e-41, -1.6375e-01, 6.5995e-38, -2.3002e-41, -6.7373e-26, -5.3808e-01, -4.0784e+00, -9.7357e-41, -2.3948e-41, -8.4134e-41, 3.5338e-01, -3.4059e-02, -4.3182e-41, 5.4297e-02, -9.8287e-42, -4.9485e-20, -1.7786e-01, -3.1166e-41, 6.7524e-41, -1.2594e-11, 5.1578e+00, 1.2567e-40, -2.0809e-41, -4.5475e-41, 1.4825e-40, 7.3806e-41, -5.3417e-41, 6.8528e-41, 9.7266e-41, -4.3735e-41, -5.3389e-43, -1.9674e-41, -8.1706e-41, 3.6630e-41, -3.2006e-41, -5.7090e-06, 6.3986e-41, 8.4730e-41, -7.3008e-41, -9.1969e-03, 4.2166e-41, 1.1506e-40], device='cuda:0') bn1.running_var torch.Size([64]) tensor([5.6052e-45, 2.3596e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 2.2151e+00, 5.6052e-45, 1.4921e-02, 5.6052e-45, 7.1242e-01, 1.1669e-02, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 3.2612e+00, 2.5997e+00, 2.2400e+00, 3.5325e-01, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 3.2991e-01, 5.6052e-45, 5.6052e-45, 5.6052e-45, 4.0805e+00, 5.2654e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 7.9652e-01, 2.1556e+00, 5.6052e-45, 3.4721e-01, 5.6052e-45, 1.0450e-39, 1.3032e+00, 5.6052e-45, 5.6052e-45, 7.0362e-23, 8.3884e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 1.2762e-11, 5.6052e-45, 5.6052e-45, 5.6052e-45, 2.9369e-05, 5.6052e-45, 5.6052e-45], device='cuda:0') — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

我把模型剪到很小了，650k，也还是比较慢

KaiOtter · 2020-05-19T08:18:53Z

是的，如果按照channel pruning的思想来看，其实完全不用那么大个网络做这个任务 Learning Efficient Convolutional Networks through Network Slimming
…
---原始邮件--- 发件人: "Kaidi Chen"<[email protected]> 发送时间: 2020年5月17日(周日) 凌晨0:56 收件人: "hanson-young/nniefacelib"<[email protected]>; 抄送: "Comment"<[email protected]>;"hanson.young"<[email protected]>; 主题: Re: [hanson-young/nniefacelib] 速度慢的问题 (#27) 我也发现这个模型在cpu上爆慢。原因我已经查出来了，是权值精度的问题，保存的参数很多都突破了float精度，是double。python的底层虽然只有double这个类型没有float，但是两者在深度学习框架里跑起来速度完全不同。我是移植了模型到mxnet上，发现这个问题后，试了一些方法都没用，无法避免模型在10多个epoch的时候，权值急速缩小，每个epoch可以缩小约10e-10以上。我之前以为是我复现的问题，昨天试着读了一下作者release的模型，发现也是这样，采样 conv1，bn1的参数，里面成群10e-40+。这就是为啥这个模型在gpu上好好的，但在cpu上就有问题的原因。具体是什么原因导致权值学成这个鬼样，我还没头绪，希望作者能有点建议。感谢，这个问题我没有注意到，但是10e-40+这样的精度其实是很接近0了，按道理只需要float精度就够了。另外我在pc上用ncnn跑1thread也差不多是200ms，ncnn是float32的，归根到底其实是需要再去优化网络结构的呢 ^_^ 我没具体实验过pytorch，但在mxnet上强制令这些tensor的精度为float32并不会截断那些超出精度的数值。所以，用随机初始化的模型跑，比用训练好的模型跑要快很多很多。另外，具体是这样，您的checkpoint_robust.pth里的Conv1[0, :, :, :]全是小于10e-40的值，且bn1的64个running_var里面有43个小于10e-40。从这个表现看，是不是可以认为通道数过大，模型在收敛到一定程度后，只用少量通道就可以有效表达特征，进而弃用了很多通道？ bn1.weight torch.Size([64]) tensor([ 1.1142e-14, 1.9840e+00, 1.8077e-06, 2.6957e-04, -1.5956e-11, 9.9732e-01, -4.8662e-14, -6.6997e-03, -6.3511e-41, 1.8081e+00, 7.7917e-03, -8.9655e-22, -9.6205e-41, -2.3129e-40, -2.3074e-41, 2.1375e+00, 1.8774e+00, 2.4078e+00, 1.5833e+00, 4.7437e-41, 5.5131e-03, 3.5874e-03, -3.3295e-42, 1.3943e+00, 1.8719e-03, 2.2324e-06, -3.4177e-02, 2.4324e+00, 2.7387e+00, 1.3493e-41, 4.6918e-19, 2.3578e-03, 1.9607e+00, 2.2181e+00, -6.0134e-41, 1.2981e+00, -6.9195e-41, 4.2235e-04, 2.0681e+00, 7.4136e-41, -3.3194e-41, -5.3852e-04, 1.3303e+00, -5.4791e-42, 3.9523e-10, 8.0862e-10, -2.6836e-41, -7.2862e-36, 8.1163e-42, -3.8488e-41, -5.6975e-41, 4.5821e-14, 1.0590e-13, 2.2362e-41, -2.1330e-40, -1.7882e-41, -5.4627e-07, 3.7301e-02, 2.2653e-41, -6.8765e-41, -6.7247e-41, 8.8496e-04, 2.4797e-10, 1.1136e-27], device='cuda:0') bn1.bias torch.Size([64]) tensor([-7.3735e-23, 1.0018e+00, -1.5060e-40, -8.2336e-05, -1.7604e-36, 5.0374e-01, -2.2037e-36, 3.1105e-01, -5.6910e-24, -4.0917e-01, -4.5703e-02, -4.7472e-40, -1.1765e-18, -2.0253e-40, 3.5438e-29, -2.7786e+00, 5.0326e-01, 1.3121e+00, 2.8182e+00, -3.4718e-04, -6.4969e-18, -3.0385e-03, -7.1485e-29, 5.5154e-01, -3.2004e-03, -1.1490e-09, -5.0653e-03, -6.1328e-01, -2.6092e+00, -9.7728e-13, -5.3176e-38, -4.9417e-03, 2.5313e+00, 1.5049e+00, -2.3744e-31, 1.9715e+00, -7.9868e-21, -1.8723e-02, 1.0208e+00, -3.6904e-32, -3.5802e-37, -2.0418e-03, 2.8885e+00, -1.0221e-37, -1.4217e-07, -7.6839e-39, -1.3619e-38, -2.4728e-31, -3.7616e-17, -9.2246e-34, -2.5572e-18, -2.3757e-20, -9.6840e-26, -3.7811e-41, -4.7578e-38, -1.5225e-28, -1.4360e-26, -4.0357e-02, -4.3764e-34, -2.4557e-20, -5.5333e-30, -4.1141e-01, -3.5403e-13, -1.3820e-36], device='cuda:0') bn1.running_mean torch.Size([64]) tensor([-5.1580e-41, -2.2162e-01, -3.8536e-41, 1.0664e-42, 1.5112e-40, -1.0727e+00, -1.6114e-41, -1.9895e-01, 5.2016e-41, -1.0795e+00, 1.8222e-01, -7.1178e-41, 7.8977e-42, 7.0868e-41, 5.6472e-42, -3.1051e+00, -2.3679e-01, -2.0414e-01, -8.4614e-02, 8.4601e-41, 4.4686e-41, 2.0061e-41, 8.7164e-41, -1.6375e-01, 6.5995e-38, -2.3002e-41, -6.7373e-26, -5.3808e-01, -4.0784e+00, -9.7357e-41, -2.3948e-41, -8.4134e-41, 3.5338e-01, -3.4059e-02, -4.3182e-41, 5.4297e-02, -9.8287e-42, -4.9485e-20, -1.7786e-01, -3.1166e-41, 6.7524e-41, -1.2594e-11, 5.1578e+00, 1.2567e-40, -2.0809e-41, -4.5475e-41, 1.4825e-40, 7.3806e-41, -5.3417e-41, 6.8528e-41, 9.7266e-41, -4.3735e-41, -5.3389e-43, -1.9674e-41, -8.1706e-41, 3.6630e-41, -3.2006e-41, -5.7090e-06, 6.3986e-41, 8.4730e-41, -7.3008e-41, -9.1969e-03, 4.2166e-41, 1.1506e-40], device='cuda:0') bn1.running_var torch.Size([64]) tensor([5.6052e-45, 2.3596e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 2.2151e+00, 5.6052e-45, 1.4921e-02, 5.6052e-45, 7.1242e-01, 1.1669e-02, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 3.2612e+00, 2.5997e+00, 2.2400e+00, 3.5325e-01, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 3.2991e-01, 5.6052e-45, 5.6052e-45, 5.6052e-45, 4.0805e+00, 5.2654e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 7.9652e-01, 2.1556e+00, 5.6052e-45, 3.4721e-01, 5.6052e-45, 1.0450e-39, 1.3032e+00, 5.6052e-45, 5.6052e-45, 7.0362e-23, 8.3884e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 1.2762e-11, 5.6052e-45, 5.6052e-45, 5.6052e-45, 2.9369e-05, 5.6052e-45, 5.6052e-45], device='cuda:0') — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

我把模型剪到很小了，650k，也还是比较慢

你把参数打出来看下，如果是精度问题，直接把异常小值全部置零，再跑下看看，对精度没任何影响。我试过，原始模型直接裁剪数值，速度立马起飞。

epoc88 · 2021-01-09T14:39:50Z

the speed on CPU
inference_cost_time: 29.044181

裁剪数值，the speed has no major improvement.

kaikaizhu · 2021-07-28T01:57:49Z

@KaiOtter 我把所有的<1e-36的数值全部置为0e-00了，在cpu的速度上几乎没有变化，请问你那儿具体是怎么操作的，方便把修改后的权重文件发我一下么？[email protected]

fujingling closed this as completed May 18, 2020

fujingling reopened this May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

速度慢的问题 #27

速度慢的问题 #27

fujingling commented May 11, 2020

fujingling commented May 11, 2020

hanson-young commented May 11, 2020 •

edited

Loading

fujingling commented May 11, 2020

hanson-young commented May 11, 2020 via email

budaLi commented May 11, 2020

hanson-young commented May 12, 2020

xuguozhi commented May 12, 2020

budaLi commented May 12, 2020 •

edited

Loading

xuguozhi commented May 12, 2020

budaLi commented May 12, 2020

KaiOtter commented May 15, 2020

hanson-young commented May 16, 2020

KaiOtter commented May 16, 2020

hanson-young commented May 17, 2020 via email

fujingling commented May 18, 2020

KaiOtter commented May 19, 2020

epoc88 commented Jan 9, 2021

kaikaizhu commented Jul 28, 2021

速度慢的问题 #27

速度慢的问题 #27

Comments

fujingling commented May 11, 2020

fujingling commented May 11, 2020

hanson-young commented May 11, 2020 • edited Loading

fujingling commented May 11, 2020

hanson-young commented May 11, 2020 via email

budaLi commented May 11, 2020

hanson-young commented May 12, 2020

xuguozhi commented May 12, 2020

budaLi commented May 12, 2020 • edited Loading

-- coding:utf-8 --

pfld.py -

written by zhaozhichao and Hanson

xuguozhi commented May 12, 2020

budaLi commented May 12, 2020

KaiOtter commented May 15, 2020

hanson-young commented May 16, 2020

KaiOtter commented May 16, 2020

hanson-young commented May 17, 2020 via email

fujingling commented May 18, 2020

KaiOtter commented May 19, 2020

epoc88 commented Jan 9, 2021

kaikaizhu commented Jul 28, 2021

hanson-young commented May 11, 2020 •

edited

Loading

budaLi commented May 12, 2020 •

edited

Loading