Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net.Connections("all") causes high CPU usage on server with 120K connections #784

Open
jimis opened this issue Oct 29, 2019 · 9 comments
Open

Comments

@jimis
Copy link

jimis commented Oct 29, 2019

telegraf-1.12.3 with data collection interval 10s, sending data to influxDB
OS: Linux x86_64

Opening this issue since gopsutil was found to be the culprit for influxdata/telegraf#6559

@jimis
Copy link
Author

jimis commented Oct 29, 2019

I believe that telegraf only does net.Connections("all"), as seen in telegraf's netstat plugin ps.go.

I was expecting some slow parsing of /proc/net/tcp6 (like netstat does), but it seems that gopsutil goes through /proc/*/fd/*, which would explain the slowness, as seen in gopsutil's net_linux.go.

@Lomanic
Copy link
Collaborator

Lomanic commented Nov 1, 2019

This will be fixed in #695.

@Lomanic Lomanic closed this as completed Nov 1, 2019
@jimis
Copy link
Author

jimis commented Nov 1, 2019

@Lomanic this is not related to using netlink or not. In fact cat /proc/net/tcp6 | wc -l takes only half second with 120K connections. The problem here is the complicated logic when doing net.Connections("all"), that walks through all PIDs, even though it doesn't need to.

gopsutil/net/net_linux.go

Lines 409 to 430 in 6a8ab03

func ConnectionsPidWithContext(ctx context.Context, kind string, pid int32) ([]ConnectionStat, error) {
tmap, ok := netConnectionKindMap[kind]
if !ok {
return nil, fmt.Errorf("invalid kind, %s", kind)
}
root := common.HostProc()
var err error
var inodes map[string][]inodeMap
if pid == 0 {
inodes, err = getProcInodesAll(root, 0)
} else {
inodes, err = getProcInodes(root, pid, 0)
if len(inodes) == 0 {
// no connection for the pid
return []ConnectionStat{}, nil
}
}
if err != nil {
return nil, fmt.Errorf("cound not get pid(s), %d: %s", pid, err)
}
return statsFromInodes(root, pid, tmap, inodes)
}

If pid == 0 then why do you need to getProcInodesAll() which traverses all of /proc/*/fd/*? Can't you just go through all of /proc/net/tcp* ? I believe you will be missing the pid information then, but is it really worth the overhead?

@Lomanic
Copy link
Collaborator

Lomanic commented Nov 1, 2019

So, this is similar to what's talked about in #771 and not related to #695 indeed.

@jimis
Copy link
Author

jimis commented Nov 1, 2019

#771 is talking about getting rid of getUids() which parses /proc/*/status. This is good but different. I am talking about avoiding iterating over all processes, but just open one file: /proc/net/tcp (or whatever protocol you want). BTW that file contains UIDs anyway.

@Lomanic
Copy link
Collaborator

Lomanic commented Nov 1, 2019

OK, thanks, sorry for misunderstanding, I'm not very familiar with this part of the code.

@danielnelson
Copy link
Contributor

Since the []ConnectionStat type includes the pid, I believe we do need to walk the pid section of procfs in order to fill this field. This appears to be how netstat and even ss works when using the -p flag.

@jimis
Copy link
Author

jimis commented Dec 6, 2019

@danielnelson I agree that the PID can not be traced just with /proc/net/tcp*. So we can't really avoid parsing the whole hierarchy, because that would break the current API. Unless we add a parameter, or even a new API call, that would exclude the PID from the returned data.

ghost pushed a commit to signalfx/signalfx-agent that referenced this issue Jan 21, 2020
This skips the uid lookup since we don't use it. See upstream PRs
for some more background:

shirou/gopsutil#784
shirou/gopsutil#783

This requires upgrading gopsutil which changed some structs which in
turn requires upgrading telegraf. Apparently the `Stolen` field was never set
(shirou/gopsutil#677)

Even with this change there are still some performance issues in the host
observer on machines with a large number of connections. Something like this
change would be required to deal with the remaining issues:

shirou/gopsutil#695
ghost pushed a commit to signalfx/signalfx-agent that referenced this issue Jan 21, 2020
This skips the uid lookup since we don't use it. See upstream PRs
for some more background:

shirou/gopsutil#784
shirou/gopsutil#783

This requires upgrading gopsutil which changed some structs which in
turn requires upgrading telegraf. Apparently the `Stolen` field was never set
(shirou/gopsutil#677)

Even with this change there are still some performance issues in the host
observer on machines with a large number of connections. Something like this
change would be required to deal with the remaining issues:

shirou/gopsutil#695
ghost pushed a commit to signalfx/signalfx-agent that referenced this issue Jan 21, 2020
This skips the uid lookup since we don't use it. See upstream PRs
for some more background:

shirou/gopsutil#784
shirou/gopsutil#783

This requires upgrading gopsutil which changed some structs which in
turn requires upgrading telegraf. Apparently the `Stolen` field was never set
(shirou/gopsutil#677)

Even with this change there are still some performance issues in the host
observer on machines with a large number of connections. Something like this
change would be required to deal with the remaining issues:

shirou/gopsutil#695
ghost pushed a commit to signalfx/signalfx-agent that referenced this issue Jan 27, 2020
This skips the uid lookup since we don't use it. See upstream PRs
for some more background:

shirou/gopsutil#784
shirou/gopsutil#783

This requires upgrading gopsutil which changed some structs which in
turn requires upgrading telegraf. Apparently the `Stolen` field was never set
(shirou/gopsutil#677)

Even with this change there are still some performance issues in the host
observer on machines with a large number of connections. Something like this
change would be required to deal with the remaining issues:

shirou/gopsutil#695
ghost pushed a commit to signalfx/signalfx-agent that referenced this issue Jan 27, 2020
This skips the uid lookup since we don't use it. See upstream PRs
for some more background:

shirou/gopsutil#784
shirou/gopsutil#783

This requires upgrading gopsutil which changed some structs which in
turn requires upgrading telegraf. Apparently the `Stolen` field was never set
(shirou/gopsutil#677)

Even with this change there are still some performance issues in the host
observer on machines with a large number of connections. Something like this
change would be required to deal with the remaining issues:

shirou/gopsutil#695
@shirou shirou added v4 and removed v4 labels Nov 5, 2023
florianl added a commit to florianl/gopsutil that referenced this issue Jun 5, 2024
To retrieve network information use NETLINK_SOCK_DIAG instead of walking
/proc/<PID> and parsing files.
Consequently this reduces the number of syscalls, as walking /proc/<PID> and
opening, reading and closing files is no longer required. But it also reduces
the memory footprint as reading files into memory for processing is no longer
required.

Related issues:
- shirou#695
- shirou#784

Supersedes shirou#809

Signed-off-by: Florian Lehner <[email protected]>
florianl added a commit to florianl/gopsutil that referenced this issue Jun 8, 2024
To retrieve network information use NETLINK_SOCK_DIAG instead of walking
/proc/<PID> and parsing files.
Consequently this reduces the number of syscalls, as walking /proc/<PID> and
opening, reading and closing files is no longer required. But it also reduces
the memory footprint as reading files into memory for processing is no longer
required.

Related issues:
- shirou#695
- shirou#784

Supersedes shirou#809

Signed-off-by: Florian Lehner <[email protected]>
@florianl
Copy link

In #1660 a proper netlink based implementation for net.Connections() is proposed. I'm happy to get your feedback on this proposed change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants