Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

速度慢的问题 #27

Open
fujingling opened this issue May 11, 2020 · 18 comments
Open

速度慢的问题 #27

fujingling opened this issue May 11, 2020 · 18 comments

Comments

@fujingling
Copy link

大帅您好,我发现这个模型在P40上大概要15ms才能搞定一张图片,CPU上就慢的简直不能接受,一张图推理时间2-3s。您测试时快吗?

@fujingling
Copy link
Author

大帅您好,我发现这个模型在P40上大概要15ms才能搞定一张图片,CPU上就慢的简直不能接受,一张图推理时间2-3s。您测试时快吗?

我说的是PFPLD

@hanson-young
Copy link
Owner

hanson-young commented May 11, 2020

一个mobilenet的结构,还是112x112的输入,2-3s怕是过分了,我在2080ti上的速度是10ms左右,这个模型本身是可以再优化的,换一下backbone重新训练下,我试过了mobilenet0.25,精度上也差不了多少

@fujingling
Copy link
Author

一个mobilenet的结构,还是112x112的输入,2-3s怕是过分了,我在2080ti上的速度是10ms左右,这个模型本身是可以再优化的,换一下backbone重新训练下,我试过了mobilenet0.25,精度上也差不了多少

您有在CPU上测试过吗?

@hanson-young
Copy link
Owner

hanson-young commented May 11, 2020 via email

@budaLi
Copy link

budaLi commented May 11, 2020

一个mobilenet的结构,还是112x112的输入,2-3s怕是过分了,我在2080ti上的速度是10ms左右,这个模型本身是可以再优化的,换一下backbone重新训练下,我试过了mobilenet0。 25,精度上也差不了多少

您是如何更改为mobilnet0.25的呢,没有这方面的经验

@hanson-young
Copy link
Owner

一个mobilenet的结构,还是112x112的输入,2-3s怕是过分了,我在2080ti上的速度是10ms左右,这个模型本身是可以再优化的,换一下backbone重新训练下,我试过了mobilenet0。 25,精度上也差不了多少

您是如何更改为mobilnet0.25的呢,没有这方面的经验

要么是你直接砍下channel,要么就网上随便找个轻量级的backbone,pytorch调试起来压力不大

@xuguozhi
Copy link

一个mobilenet的结构,还是112x112的输入,2-3s怕是过分了,我在2080ti上的速度是10ms左右,这个模型本身是可以再优化的,换一下backbone重新训练下,我试过了mobilenet0。 25,精度上也差不了多少

您是如何更改为mobilnet0.25的呢,没有这方面的经验

要么是你直接砍下channel,要么就网上随便找个轻量级的backbone,pytorch调试起来压力不大

直接输入输出channel数量*0.25么

@budaLi
Copy link

budaLi commented May 12, 2020

#!/usr/bin/env python3

-- coding:utf-8 --

######################################################

pfld.py -

written by zhaozhichao and Hanson

######################################################

import torch
import torch.nn as nn
import math
import torch.nn.init as init

def conv_bn(inp, oup, kernel, stride, padding=1):

return nn.Sequential(
    nn.Conv2d(inp, oup, kernel, stride, padding, bias=False),
    nn.BatchNorm2d(oup),
    nn.ReLU(inplace=True))

def conv_1x1_bn(inp, oup):

return nn.Sequential(
    nn.Conv2d(inp, oup, 1, 1, 0, bias=False),
    nn.BatchNorm2d(oup),
    nn.ReLU(inplace=True))

class InvertedResidual(nn.Module):

def __init__(self, inp, oup, stride, use_res_connect, expand_ratio=6):
    super(InvertedResidual, self).__init__()
    self.stride = stride
    assert stride in [1, 2]

    self.use_res_connect = use_res_connect
    self.conv = nn.Sequential(
        nn.Conv2d(inp, inp * expand_ratio, 1, 1, 0, bias=False),
        nn.BatchNorm2d(inp * expand_ratio),
        nn.ReLU(inplace=True),
        nn.Conv2d(
            inp * expand_ratio,
            inp * expand_ratio,
            3,
            stride,
            1,
            groups=inp * expand_ratio,
            bias=False),
        nn.BatchNorm2d(inp * expand_ratio),
        nn.ReLU(inplace=True),
        nn.Conv2d(inp * expand_ratio, oup, 1, 1, 0, bias=False),
        nn.BatchNorm2d(oup),
    )

def forward(self, x):
    if self.use_res_connect:
        return x + self.conv(x)
    else:
        return self.conv(x)

class PFLDInference(nn.Module):

def __init__(self):
    super(PFLDInference, self).__init__()
    newchannels = int(64*0.25)
    newchannels2 = newchannels*2
    self.conv1 = nn.Conv2d(
        3, newchannels, kernel_size=3, stride=2, padding=1, bias=False)
    self.bn1 = nn.BatchNorm2d(newchannels)
    self.relu = nn.ReLU(inplace=True)

    self.conv2 = nn.Conv2d(
        newchannels, newchannels, kernel_size=3, stride=1, padding=1, bias=False)
    self.bn2 = nn.BatchNorm2d(newchannels)
    self.relu = nn.ReLU(inplace=True)

    self.conv3_1 = InvertedResidual(newchannels, newchannels, 2, False, 2)

    self.block3_2 = InvertedResidual(newchannels, newchannels, 1, True, 2)
    self.block3_3 = InvertedResidual(newchannels, newchannels, 1, True, 2)
    self.block3_4 = InvertedResidual(newchannels, newchannels, 1, True, 2)
    self.block3_5 = InvertedResidual(newchannels, newchannels, 1, True, 2)

    self.conv4_1 = InvertedResidual(newchannels, newchannels2, 2, False, 2)

    self.conv5_1 = InvertedResidual(newchannels2, newchannels2, 1, False, 4)
    self.block5_2 = InvertedResidual(newchannels2, newchannels2, 1, True, 4)
    self.block5_3 = InvertedResidual(newchannels2, newchannels2, 1, True, 4)
    self.block5_4 = InvertedResidual(newchannels2, newchannels2, 1, True, 4)
    self.block5_5 = InvertedResidual(newchannels2, newchannels2, 1, True, 4)
    self.block5_6 = InvertedResidual(newchannels2, newchannels2, 1, True, 4)

    self.conv6_1 = InvertedResidual(newchannels2, 16, 1, False, 2)  # [16, 14, 14]

    self.conv7 = conv_bn(16, 32, 3, 2)  # [32, 7, 7]
    self.conv8 = nn.Conv2d(32, newchannels2, 7, 1, 0)  # [128, 1, 1]
    self.bn8 = nn.BatchNorm2d(newchannels2)

    self.avg_pool1 = nn.AvgPool2d(14)
    self.avg_pool2 = nn.AvgPool2d(7)
    self.fc = nn.Linear(16+32+newchannels2, 196)
    self.fc_aux = nn.Linear(16+32+newchannels2, 3)

    self.conv1_aux = conv_bn(newchannels, newchannels2, 3, 2)
    self.conv2_aux = conv_bn(newchannels2, newchannels2, 3, 1)
    self.conv3_aux = conv_bn(newchannels2, 32, 3, 2)
    self.conv4_aux = conv_bn(32, newchannels2, 7, 1)
    self.max_pool1_aux = nn.MaxPool2d(3)
    self.fc1_aux = nn.Linear(newchannels2, 32)
    self.fc2_aux = nn.Linear(32 + 16+32+newchannels2, 3)

def forward(self, x):  # x: 3, 112, 112
    x = self.relu(self.bn1(self.conv1(x)))  # [64, 56, 56]
    x = self.relu(self.bn2(self.conv2(x)))  # [64, 56, 56]
    x = self.conv3_1(x)
    x = self.block3_2(x)
    x = self.block3_3(x)
    x = self.block3_4(x)
    out1 = self.block3_5(x)

    x = self.conv4_1(out1)
    x = self.conv5_1(x)
    x = self.block5_2(x)
    x = self.block5_3(x)
    x = self.block5_4(x)
    x = self.block5_5(x)
    x = self.block5_6(x)
    x = self.conv6_1(x)
    x1 = self.avg_pool1(x)
    x1 = x1.view(x1.size(0), -1)

    x = self.conv7(x)
    x2 = self.avg_pool2(x)
    x2 = x2.view(x2.size(0), -1)

    x3 = self.relu(self.conv8(x))
    x3 = x3.view(x1.size(0), -1)

    multi_scale = torch.cat([x1, x2, x3], 1)
    landmarks = self.fc(multi_scale)


    aux = self.conv1_aux(out1)
    aux = self.conv2_aux(aux)
    aux = self.conv3_aux(aux)
    aux = self.conv4_aux(aux)
    aux = self.max_pool1_aux(aux)
    aux = aux.view(aux.size(0), -1)
    aux = self.fc1_aux(aux)
    aux = torch.cat([aux, multi_scale], 1)
    pose = self.fc2_aux(aux)

    return pose, landmarks

if name == 'main':

input = torch.randn(1, 3, 112, 112)
plfd_backbone = PFLDInference()
angle, landmarks = plfd_backbone(input)
print(plfd_backbone)
print("angle.shape:{0:}, landmarks.shape: {1:}".format(
    angle.shape, landmarks.shape))

是这样嘛 channel*0.25 ,不过训练一次就出错了

@xuguozhi
Copy link

python3 pfld.py 看看运行到哪里报的错

@budaLi
Copy link

budaLi commented May 12, 2020

python3 pfld.py 看看运行到哪里报的错

直接运行是没错的 可以看到 angle.shape:torch.Size([1, 3]), landmarks.shape: torch.Size([1, 196])

@KaiOtter
Copy link

我也发现这个模型在cpu上爆慢。原因我已经查出来了,是权值精度的问题,保存的参数很多都突破了float精度,是double。python的底层虽然只有double这个类型没有float,但是两者在深度学习框架里跑起来速度完全不同。
我是移植了模型到mxnet上,发现这个问题后,试了一些方法都没用,无法避免模型在10多个epoch的时候,权值急速缩小,每个epoch可以缩小约10e-10以上。我之前以为是我复现的问题,昨天试着读了一下作者release的模型,发现也是这样,采样 conv1,bn1的参数,里面成群10e-40+。

这就是为啥这个模型在gpu上好好的,但在cpu上就有问题的原因。

具体是什么原因导致权值学成这个鬼样,我还没头绪,希望作者能有点建议。

@hanson-young
Copy link
Owner

我也发现这个模型在cpu上爆慢。原因我已经查出来了,是权值精度的问题,保存的参数很多都突破了float精度,是double。python的底层虽然只有double这个类型没有float,但是两者在深度学习框架里跑起来速度完全不同。
我是移植了模型到mxnet上,发现这个问题后,试了一些方法都没用,无法避免模型在10多个epoch的时候,权值急速缩小,每个epoch可以缩小约10e-10以上。我之前以为是我复现的问题,昨天试着读了一下作者release的模型,发现也是这样,采样 conv1,bn1的参数,里面成群10e-40+。

这就是为啥这个模型在gpu上好好的,但在cpu上就有问题的原因。

具体是什么原因导致权值学成这个鬼样,我还没头绪,希望作者能有点建议。

感谢,这个问题我没有注意到,但是10e-40+这样的精度其实是很接近0了,按道理只需要float精度就够了。另外我在pc上用ncnn跑1thread也差不多是200ms,ncnn是float32的,归根到底其实是需要再去优化网络结构的呢 ^_^

@KaiOtter
Copy link

我也发现这个模型在cpu上爆慢。原因我已经查出来了,是权值精度的问题,保存的参数很多都突破了float精度,是double。python的底层虽然只有double这个类型没有float,但是两者在深度学习框架里跑起来速度完全不同。
我是移植了模型到mxnet上,发现这个问题后,试了一些方法都没用,无法避免模型在10多个epoch的时候,权值急速缩小,每个epoch可以缩小约10e-10以上。我之前以为是我复现的问题,昨天试着读了一下作者release的模型,发现也是这样,采样 conv1,bn1的参数,里面成群10e-40+。
这就是为啥这个模型在gpu上好好的,但在cpu上就有问题的原因。
具体是什么原因导致权值学成这个鬼样,我还没头绪,希望作者能有点建议。

感谢,这个问题我没有注意到,但是10e-40+这样的精度其实是很接近0了,按道理只需要float精度就够了。另外我在pc上用ncnn跑1thread也差不多是200ms,ncnn是float32的,归根到底其实是需要再去优化网络结构的呢 ^_^

我没具体实验过pytorch,但在mxnet上强制令这些tensor的精度为float32并不会截断那些超出精度的数值。所以,用随机初始化的模型跑,比用训练好的模型跑要快很多很多。另外,具体是这样,您的checkpoint_robust.pth里的Conv1[0, :, :, :]全是小于10e-40的值,且bn1的64个running_var里面有43个小于10e-40。从这个表现看,是不是可以认为通道数过大,模型在收敛到一定程度后,只用少量通道就可以有效表达特征,进而弃用了很多通道?

bn1.weight torch.Size([64])
tensor([ 1.1142e-14, 1.9840e+00, 1.8077e-06, 2.6957e-04, -1.5956e-11,
9.9732e-01, -4.8662e-14, -6.6997e-03, -6.3511e-41, 1.8081e+00,
7.7917e-03, -8.9655e-22, -9.6205e-41, -2.3129e-40, -2.3074e-41,
2.1375e+00, 1.8774e+00, 2.4078e+00, 1.5833e+00, 4.7437e-41,
5.5131e-03, 3.5874e-03, -3.3295e-42, 1.3943e+00, 1.8719e-03,
2.2324e-06, -3.4177e-02, 2.4324e+00, 2.7387e+00, 1.3493e-41,
4.6918e-19, 2.3578e-03, 1.9607e+00, 2.2181e+00, -6.0134e-41,
1.2981e+00, -6.9195e-41, 4.2235e-04, 2.0681e+00, 7.4136e-41,
-3.3194e-41, -5.3852e-04, 1.3303e+00, -5.4791e-42, 3.9523e-10,
8.0862e-10, -2.6836e-41, -7.2862e-36, 8.1163e-42, -3.8488e-41,
-5.6975e-41, 4.5821e-14, 1.0590e-13, 2.2362e-41, -2.1330e-40,
-1.7882e-41, -5.4627e-07, 3.7301e-02, 2.2653e-41, -6.8765e-41,
-6.7247e-41, 8.8496e-04, 2.4797e-10, 1.1136e-27], device='cuda:0')

bn1.bias torch.Size([64])
tensor([-7.3735e-23, 1.0018e+00, -1.5060e-40, -8.2336e-05, -1.7604e-36,
5.0374e-01, -2.2037e-36, 3.1105e-01, -5.6910e-24, -4.0917e-01,
-4.5703e-02, -4.7472e-40, -1.1765e-18, -2.0253e-40, 3.5438e-29,
-2.7786e+00, 5.0326e-01, 1.3121e+00, 2.8182e+00, -3.4718e-04,
-6.4969e-18, -3.0385e-03, -7.1485e-29, 5.5154e-01, -3.2004e-03,
-1.1490e-09, -5.0653e-03, -6.1328e-01, -2.6092e+00, -9.7728e-13,
-5.3176e-38, -4.9417e-03, 2.5313e+00, 1.5049e+00, -2.3744e-31,
1.9715e+00, -7.9868e-21, -1.8723e-02, 1.0208e+00, -3.6904e-32,
-3.5802e-37, -2.0418e-03, 2.8885e+00, -1.0221e-37, -1.4217e-07,
-7.6839e-39, -1.3619e-38, -2.4728e-31, -3.7616e-17, -9.2246e-34,
-2.5572e-18, -2.3757e-20, -9.6840e-26, -3.7811e-41, -4.7578e-38,
-1.5225e-28, -1.4360e-26, -4.0357e-02, -4.3764e-34, -2.4557e-20,
-5.5333e-30, -4.1141e-01, -3.5403e-13, -1.3820e-36], device='cuda:0')

bn1.running_mean torch.Size([64])
tensor([-5.1580e-41, -2.2162e-01, -3.8536e-41, 1.0664e-42, 1.5112e-40,
-1.0727e+00, -1.6114e-41, -1.9895e-01, 5.2016e-41, -1.0795e+00,
1.8222e-01, -7.1178e-41, 7.8977e-42, 7.0868e-41, 5.6472e-42,
-3.1051e+00, -2.3679e-01, -2.0414e-01, -8.4614e-02, 8.4601e-41,
4.4686e-41, 2.0061e-41, 8.7164e-41, -1.6375e-01, 6.5995e-38,
-2.3002e-41, -6.7373e-26, -5.3808e-01, -4.0784e+00, -9.7357e-41,
-2.3948e-41, -8.4134e-41, 3.5338e-01, -3.4059e-02, -4.3182e-41,
5.4297e-02, -9.8287e-42, -4.9485e-20, -1.7786e-01, -3.1166e-41,
6.7524e-41, -1.2594e-11, 5.1578e+00, 1.2567e-40, -2.0809e-41,
-4.5475e-41, 1.4825e-40, 7.3806e-41, -5.3417e-41, 6.8528e-41,
9.7266e-41, -4.3735e-41, -5.3389e-43, -1.9674e-41, -8.1706e-41,
3.6630e-41, -3.2006e-41, -5.7090e-06, 6.3986e-41, 8.4730e-41,
-7.3008e-41, -9.1969e-03, 4.2166e-41, 1.1506e-40], device='cuda:0')

bn1.running_var torch.Size([64])
tensor([5.6052e-45, 2.3596e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 2.2151e+00,
5.6052e-45, 1.4921e-02, 5.6052e-45, 7.1242e-01, 1.1669e-02, 5.6052e-45,
5.6052e-45, 5.6052e-45, 5.6052e-45, 3.2612e+00, 2.5997e+00, 2.2400e+00,
3.5325e-01, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 3.2991e-01,
5.6052e-45, 5.6052e-45, 5.6052e-45, 4.0805e+00, 5.2654e+00, 5.6052e-45,
5.6052e-45, 5.6052e-45, 7.9652e-01, 2.1556e+00, 5.6052e-45, 3.4721e-01,
5.6052e-45, 1.0450e-39, 1.3032e+00, 5.6052e-45, 5.6052e-45, 7.0362e-23,
8.3884e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45,
5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45,
5.6052e-45, 5.6052e-45, 5.6052e-45, 1.2762e-11, 5.6052e-45, 5.6052e-45,
5.6052e-45, 2.9369e-05, 5.6052e-45, 5.6052e-45], device='cuda:0')

@hanson-young
Copy link
Owner

hanson-young commented May 17, 2020 via email

@fujingling
Copy link
Author

是的,如果按照channel pruning的思想来看,其实完全不用那么大个网络做这个任务 Learning Efficient Convolutional Networks through Network Slimming

---原始邮件--- 发件人: "Kaidi Chen"<[email protected]> 发送时间: 2020年5月17日(周日) 凌晨0:56 收件人: "hanson-young/nniefacelib"<[email protected]>; 抄送: "Comment"<[email protected]>;"hanson.young"<[email protected]>; 主题: Re: [hanson-young/nniefacelib] 速度慢的问题 (#27) 我也发现这个模型在cpu上爆慢。原因我已经查出来了,是权值精度的问题,保存的参数很多都突破了float精度,是double。python的底层虽然只有double这个类型没有float,但是两者在深度学习框架里跑起来速度完全不同。 我是移植了模型到mxnet上,发现这个问题后,试了一些方法都没用,无法避免模型在10多个epoch的时候,权值急速缩小,每个epoch可以缩小约10e-10以上。我之前以为是我复现的问题,昨天试着读了一下作者release的模型,发现也是这样,采样 conv1,bn1的参数,里面成群10e-40+。 这就是为啥这个模型在gpu上好好的,但在cpu上就有问题的原因。 具体是什么原因导致权值学成这个鬼样,我还没头绪,希望作者能有点建议。 感谢,这个问题我没有注意到,但是10e-40+这样的精度其实是很接近0了,按道理只需要float精度就够了。另外我在pc上用ncnn跑1thread也差不多是200ms,ncnn是float32的,归根到底其实是需要再去优化网络结构的呢 ^_^ 我没具体实验过pytorch,但在mxnet上强制令这些tensor的精度为float32并不会截断那些超出精度的数值。所以,用随机初始化的模型跑,比用训练好的模型跑要快很多很多。另外,具体是这样,您的checkpoint_robust.pth里的Conv1[0, :, :, :]全是小于10e-40的值,且bn1的64个running_var里面有43个小于10e-40。从这个表现看,是不是可以认为通道数过大,模型在收敛到一定程度后,只用少量通道就可以有效表达特征,进而弃用了很多通道? bn1.weight torch.Size([64]) tensor([ 1.1142e-14, 1.9840e+00, 1.8077e-06, 2.6957e-04, -1.5956e-11, 9.9732e-01, -4.8662e-14, -6.6997e-03, -6.3511e-41, 1.8081e+00, 7.7917e-03, -8.9655e-22, -9.6205e-41, -2.3129e-40, -2.3074e-41, 2.1375e+00, 1.8774e+00, 2.4078e+00, 1.5833e+00, 4.7437e-41, 5.5131e-03, 3.5874e-03, -3.3295e-42, 1.3943e+00, 1.8719e-03, 2.2324e-06, -3.4177e-02, 2.4324e+00, 2.7387e+00, 1.3493e-41, 4.6918e-19, 2.3578e-03, 1.9607e+00, 2.2181e+00, -6.0134e-41, 1.2981e+00, -6.9195e-41, 4.2235e-04, 2.0681e+00, 7.4136e-41, -3.3194e-41, -5.3852e-04, 1.3303e+00, -5.4791e-42, 3.9523e-10, 8.0862e-10, -2.6836e-41, -7.2862e-36, 8.1163e-42, -3.8488e-41, -5.6975e-41, 4.5821e-14, 1.0590e-13, 2.2362e-41, -2.1330e-40, -1.7882e-41, -5.4627e-07, 3.7301e-02, 2.2653e-41, -6.8765e-41, -6.7247e-41, 8.8496e-04, 2.4797e-10, 1.1136e-27], device='cuda:0') bn1.bias torch.Size([64]) tensor([-7.3735e-23, 1.0018e+00, -1.5060e-40, -8.2336e-05, -1.7604e-36, 5.0374e-01, -2.2037e-36, 3.1105e-01, -5.6910e-24, -4.0917e-01, -4.5703e-02, -4.7472e-40, -1.1765e-18, -2.0253e-40, 3.5438e-29, -2.7786e+00, 5.0326e-01, 1.3121e+00, 2.8182e+00, -3.4718e-04, -6.4969e-18, -3.0385e-03, -7.1485e-29, 5.5154e-01, -3.2004e-03, -1.1490e-09, -5.0653e-03, -6.1328e-01, -2.6092e+00, -9.7728e-13, -5.3176e-38, -4.9417e-03, 2.5313e+00, 1.5049e+00, -2.3744e-31, 1.9715e+00, -7.9868e-21, -1.8723e-02, 1.0208e+00, -3.6904e-32, -3.5802e-37, -2.0418e-03, 2.8885e+00, -1.0221e-37, -1.4217e-07, -7.6839e-39, -1.3619e-38, -2.4728e-31, -3.7616e-17, -9.2246e-34, -2.5572e-18, -2.3757e-20, -9.6840e-26, -3.7811e-41, -4.7578e-38, -1.5225e-28, -1.4360e-26, -4.0357e-02, -4.3764e-34, -2.4557e-20, -5.5333e-30, -4.1141e-01, -3.5403e-13, -1.3820e-36], device='cuda:0') bn1.running_mean torch.Size([64]) tensor([-5.1580e-41, -2.2162e-01, -3.8536e-41, 1.0664e-42, 1.5112e-40, -1.0727e+00, -1.6114e-41, -1.9895e-01, 5.2016e-41, -1.0795e+00, 1.8222e-01, -7.1178e-41, 7.8977e-42, 7.0868e-41, 5.6472e-42, -3.1051e+00, -2.3679e-01, -2.0414e-01, -8.4614e-02, 8.4601e-41, 4.4686e-41, 2.0061e-41, 8.7164e-41, -1.6375e-01, 6.5995e-38, -2.3002e-41, -6.7373e-26, -5.3808e-01, -4.0784e+00, -9.7357e-41, -2.3948e-41, -8.4134e-41, 3.5338e-01, -3.4059e-02, -4.3182e-41, 5.4297e-02, -9.8287e-42, -4.9485e-20, -1.7786e-01, -3.1166e-41, 6.7524e-41, -1.2594e-11, 5.1578e+00, 1.2567e-40, -2.0809e-41, -4.5475e-41, 1.4825e-40, 7.3806e-41, -5.3417e-41, 6.8528e-41, 9.7266e-41, -4.3735e-41, -5.3389e-43, -1.9674e-41, -8.1706e-41, 3.6630e-41, -3.2006e-41, -5.7090e-06, 6.3986e-41, 8.4730e-41, -7.3008e-41, -9.1969e-03, 4.2166e-41, 1.1506e-40], device='cuda:0') bn1.running_var torch.Size([64]) tensor([5.6052e-45, 2.3596e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 2.2151e+00, 5.6052e-45, 1.4921e-02, 5.6052e-45, 7.1242e-01, 1.1669e-02, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 3.2612e+00, 2.5997e+00, 2.2400e+00, 3.5325e-01, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 3.2991e-01, 5.6052e-45, 5.6052e-45, 5.6052e-45, 4.0805e+00, 5.2654e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 7.9652e-01, 2.1556e+00, 5.6052e-45, 3.4721e-01, 5.6052e-45, 1.0450e-39, 1.3032e+00, 5.6052e-45, 5.6052e-45, 7.0362e-23, 8.3884e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 1.2762e-11, 5.6052e-45, 5.6052e-45, 5.6052e-45, 2.9369e-05, 5.6052e-45, 5.6052e-45], device='cuda:0') — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

我把模型剪到很小了,650k,也还是比较慢

@fujingling fujingling reopened this May 18, 2020
@KaiOtter
Copy link

是的,如果按照channel pruning的思想来看,其实完全不用那么大个网络做这个任务 Learning Efficient Convolutional Networks through Network Slimming

---原始邮件--- 发件人: "Kaidi Chen"<[email protected]> 发送时间: 2020年5月17日(周日) 凌晨0:56 收件人: "hanson-young/nniefacelib"<[email protected]>; 抄送: "Comment"<[email protected]>;"hanson.young"<[email protected]>; 主题: Re: [hanson-young/nniefacelib] 速度慢的问题 (#27) 我也发现这个模型在cpu上爆慢。原因我已经查出来了,是权值精度的问题,保存的参数很多都突破了float精度,是double。python的底层虽然只有double这个类型没有float,但是两者在深度学习框架里跑起来速度完全不同。 我是移植了模型到mxnet上,发现这个问题后,试了一些方法都没用,无法避免模型在10多个epoch的时候,权值急速缩小,每个epoch可以缩小约10e-10以上。我之前以为是我复现的问题,昨天试着读了一下作者release的模型,发现也是这样,采样 conv1,bn1的参数,里面成群10e-40+。 这就是为啥这个模型在gpu上好好的,但在cpu上就有问题的原因。 具体是什么原因导致权值学成这个鬼样,我还没头绪,希望作者能有点建议。 感谢,这个问题我没有注意到,但是10e-40+这样的精度其实是很接近0了,按道理只需要float精度就够了。另外我在pc上用ncnn跑1thread也差不多是200ms,ncnn是float32的,归根到底其实是需要再去优化网络结构的呢 ^_^ 我没具体实验过pytorch,但在mxnet上强制令这些tensor的精度为float32并不会截断那些超出精度的数值。所以,用随机初始化的模型跑,比用训练好的模型跑要快很多很多。另外,具体是这样,您的checkpoint_robust.pth里的Conv1[0, :, :, :]全是小于10e-40的值,且bn1的64个running_var里面有43个小于10e-40。从这个表现看,是不是可以认为通道数过大,模型在收敛到一定程度后,只用少量通道就可以有效表达特征,进而弃用了很多通道? bn1.weight torch.Size([64]) tensor([ 1.1142e-14, 1.9840e+00, 1.8077e-06, 2.6957e-04, -1.5956e-11, 9.9732e-01, -4.8662e-14, -6.6997e-03, -6.3511e-41, 1.8081e+00, 7.7917e-03, -8.9655e-22, -9.6205e-41, -2.3129e-40, -2.3074e-41, 2.1375e+00, 1.8774e+00, 2.4078e+00, 1.5833e+00, 4.7437e-41, 5.5131e-03, 3.5874e-03, -3.3295e-42, 1.3943e+00, 1.8719e-03, 2.2324e-06, -3.4177e-02, 2.4324e+00, 2.7387e+00, 1.3493e-41, 4.6918e-19, 2.3578e-03, 1.9607e+00, 2.2181e+00, -6.0134e-41, 1.2981e+00, -6.9195e-41, 4.2235e-04, 2.0681e+00, 7.4136e-41, -3.3194e-41, -5.3852e-04, 1.3303e+00, -5.4791e-42, 3.9523e-10, 8.0862e-10, -2.6836e-41, -7.2862e-36, 8.1163e-42, -3.8488e-41, -5.6975e-41, 4.5821e-14, 1.0590e-13, 2.2362e-41, -2.1330e-40, -1.7882e-41, -5.4627e-07, 3.7301e-02, 2.2653e-41, -6.8765e-41, -6.7247e-41, 8.8496e-04, 2.4797e-10, 1.1136e-27], device='cuda:0') bn1.bias torch.Size([64]) tensor([-7.3735e-23, 1.0018e+00, -1.5060e-40, -8.2336e-05, -1.7604e-36, 5.0374e-01, -2.2037e-36, 3.1105e-01, -5.6910e-24, -4.0917e-01, -4.5703e-02, -4.7472e-40, -1.1765e-18, -2.0253e-40, 3.5438e-29, -2.7786e+00, 5.0326e-01, 1.3121e+00, 2.8182e+00, -3.4718e-04, -6.4969e-18, -3.0385e-03, -7.1485e-29, 5.5154e-01, -3.2004e-03, -1.1490e-09, -5.0653e-03, -6.1328e-01, -2.6092e+00, -9.7728e-13, -5.3176e-38, -4.9417e-03, 2.5313e+00, 1.5049e+00, -2.3744e-31, 1.9715e+00, -7.9868e-21, -1.8723e-02, 1.0208e+00, -3.6904e-32, -3.5802e-37, -2.0418e-03, 2.8885e+00, -1.0221e-37, -1.4217e-07, -7.6839e-39, -1.3619e-38, -2.4728e-31, -3.7616e-17, -9.2246e-34, -2.5572e-18, -2.3757e-20, -9.6840e-26, -3.7811e-41, -4.7578e-38, -1.5225e-28, -1.4360e-26, -4.0357e-02, -4.3764e-34, -2.4557e-20, -5.5333e-30, -4.1141e-01, -3.5403e-13, -1.3820e-36], device='cuda:0') bn1.running_mean torch.Size([64]) tensor([-5.1580e-41, -2.2162e-01, -3.8536e-41, 1.0664e-42, 1.5112e-40, -1.0727e+00, -1.6114e-41, -1.9895e-01, 5.2016e-41, -1.0795e+00, 1.8222e-01, -7.1178e-41, 7.8977e-42, 7.0868e-41, 5.6472e-42, -3.1051e+00, -2.3679e-01, -2.0414e-01, -8.4614e-02, 8.4601e-41, 4.4686e-41, 2.0061e-41, 8.7164e-41, -1.6375e-01, 6.5995e-38, -2.3002e-41, -6.7373e-26, -5.3808e-01, -4.0784e+00, -9.7357e-41, -2.3948e-41, -8.4134e-41, 3.5338e-01, -3.4059e-02, -4.3182e-41, 5.4297e-02, -9.8287e-42, -4.9485e-20, -1.7786e-01, -3.1166e-41, 6.7524e-41, -1.2594e-11, 5.1578e+00, 1.2567e-40, -2.0809e-41, -4.5475e-41, 1.4825e-40, 7.3806e-41, -5.3417e-41, 6.8528e-41, 9.7266e-41, -4.3735e-41, -5.3389e-43, -1.9674e-41, -8.1706e-41, 3.6630e-41, -3.2006e-41, -5.7090e-06, 6.3986e-41, 8.4730e-41, -7.3008e-41, -9.1969e-03, 4.2166e-41, 1.1506e-40], device='cuda:0') bn1.running_var torch.Size([64]) tensor([5.6052e-45, 2.3596e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 2.2151e+00, 5.6052e-45, 1.4921e-02, 5.6052e-45, 7.1242e-01, 1.1669e-02, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 3.2612e+00, 2.5997e+00, 2.2400e+00, 3.5325e-01, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 3.2991e-01, 5.6052e-45, 5.6052e-45, 5.6052e-45, 4.0805e+00, 5.2654e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 7.9652e-01, 2.1556e+00, 5.6052e-45, 3.4721e-01, 5.6052e-45, 1.0450e-39, 1.3032e+00, 5.6052e-45, 5.6052e-45, 7.0362e-23, 8.3884e+00, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 5.6052e-45, 1.2762e-11, 5.6052e-45, 5.6052e-45, 5.6052e-45, 2.9369e-05, 5.6052e-45, 5.6052e-45], device='cuda:0') — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

我把模型剪到很小了,650k,也还是比较慢

你把参数打出来看下,如果是精度问题,直接把异常小值全部置零,再跑下看看,对精度没任何影响。我试过,原始模型直接裁剪数值,速度立马起飞。

@epoc88
Copy link

epoc88 commented Jan 9, 2021

the speed on CPU
inference_cost_time: 29.044181

裁剪数值,the speed has no major improvement.

@kaikaizhu
Copy link

@KaiOtter 我把所有的<1e-36的数值全部置为0e-00了,在cpu的速度上几乎没有变化,请问你那儿具体是怎么操作的,方便把修改后的权重文件发我一下么?[email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants