Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use netlink API to get net connections for better performance #695

Open
wcc526 opened this issue May 28, 2019 · 8 comments
Open

Use netlink API to get net connections for better performance #695

wcc526 opened this issue May 28, 2019 · 8 comments

Comments

@wcc526
Copy link
Contributor

wcc526 commented May 28, 2019

/proc interface is inadequate, unfortunately. When amount of sockets is enough large, netstat or even plain cat /proc/net/tcp/ cause nothing but pains and curses. In linux-2.4 the desease became worse: even if amount of sockets is small reading /proc/net/tcp/ is slow enough https://www.cyberciti.biz/files/ss.html

https://github.com/shirou/gopsutil/blob/master/net/net_linux.go

like ss command

https://github.com/raboof/connbeat/blob/master/sockets/tcp_diag/tcp_diag.go

https://stackoverflow.com/questions/11763376/difference-between-netstat-and-ss-in-linux

https://pcarleton.com/2018/05/31/netstat-vs-ss/

@Lomanic Lomanic changed the title Please use the tcp_diag to get net connections,which is more quciker Use netlink API to get net connections for better performance May 28, 2019
@Lomanic
Copy link
Collaborator

Lomanic commented May 28, 2019

That looks doable, there is even quite a few netlink libraries available for Go. Would you be interested working on it @wcc526?

@wcc526
Copy link
Contributor Author

wcc526 commented May 29, 2019

OK,I tried it.

@Brian-Williams
Copy link

@shirou Expanding on the idea to use gosigar from this comment.

I briefly looked into this and I think there are some open questions about how this can/should be implemented.

Here is a non-working implementation outline to create talking points:

func sigarLookup(ctx context.Context, kind string) ([]ConnectionStat, error) {
	tmap, ok := familyKindMap[kind]
	if !ok {
		return nil, fmt.Errorf("invalid kind, %s", kind)
	}
	msg := sigar.NewInetDiagReqV2(tmap)
	// TODO: set correct sequence 
	// msg.Header.Seq = 
	msgs, err := sigar.NetlinkInetDiag(msg)
	if err != nil {
		return nil, fmt.Errorf("couldn't send netlink message: %v", err)
	}
	var cs []ConnectionStat
	for _, diag := range msgs {
		conn := ConnectionStat{
			//Fd:     c.fd,
			Family: uint32(diag.Family),
			// Breaking change: this will always be netlink; for backwards compat we would have to loop over previous
			// tmap and use that info
			Type:   syscall.AF_NETLINK,
			Laddr:  Addr{diag.SrcIP().String(), uint32(diag.SrcPort())},
			Raddr:  Addr{diag.DstIP().String(), uint32(diag.DstPort())},
			Status: sigar.TCPState(diag.State).String(),
			// todo: I believe int32(diag.UID) is the UID of the req, so can't be used as the Uids field this may mean
			// that WithoutUids functions may still be relevant post sigar change
			//Uids: []int32{int32(diag.UID)},
			//Pid:    c.pid,
		}
		cs = append(cs, conn)
	}
	return cs, nil
}

Do we make a breaking change to the Type field of ConnectionStat? If we report netlink it is accurate, but changes the struct. If we don't then the response is lying about how the information was collected.

Do we have a fall-through if netlink request fails to previous implementation?

func ConnectionsPidMaxWithContext(ctx context.Context, kind string, pid int32, max int) ([]ConnectionStat, error) {
	ret, err := sigarLookup(ctx, kind)
	if err != nil {
		return ret, nil
	}
	# current ConnectionsPidMaxWithContext implementation here?

I would generally be against this because it is basically silently failing and attempting sigar every time even if it will never work on the host.

Based on my parsing of netlink I believe we may still need to parse /proc/ files to get pid and uid information. Would we potentially deprecate the WithoutUids funcs and do a functional replacement of WithNetlink, which only gathers the information that netlink can query?

@shirou
Copy link
Owner

shirou commented Nov 14, 2019

Thank you for the info. I still can not have a time but I agree to gosigar can not satisfy gopsutil current feature. So I am planing to re-implement by myself (or any contribution is always welcome!)

ghost pushed a commit to signalfx/signalfx-agent that referenced this issue Jan 21, 2020
This skips the uid lookup since we don't use it. See upstream PRs
for some more background:

shirou/gopsutil#784
shirou/gopsutil#783

This requires upgrading gopsutil which changed some structs which in
turn requires upgrading telegraf. Apparently the `Stolen` field was never set
(shirou/gopsutil#677)

Even with this change there are still some performance issues in the host
observer on machines with a large number of connections. Something like this
change would be required to deal with the remaining issues:

shirou/gopsutil#695
ghost pushed a commit to signalfx/signalfx-agent that referenced this issue Jan 21, 2020
This skips the uid lookup since we don't use it. See upstream PRs
for some more background:

shirou/gopsutil#784
shirou/gopsutil#783

This requires upgrading gopsutil which changed some structs which in
turn requires upgrading telegraf. Apparently the `Stolen` field was never set
(shirou/gopsutil#677)

Even with this change there are still some performance issues in the host
observer on machines with a large number of connections. Something like this
change would be required to deal with the remaining issues:

shirou/gopsutil#695
ghost pushed a commit to signalfx/signalfx-agent that referenced this issue Jan 21, 2020
This skips the uid lookup since we don't use it. See upstream PRs
for some more background:

shirou/gopsutil#784
shirou/gopsutil#783

This requires upgrading gopsutil which changed some structs which in
turn requires upgrading telegraf. Apparently the `Stolen` field was never set
(shirou/gopsutil#677)

Even with this change there are still some performance issues in the host
observer on machines with a large number of connections. Something like this
change would be required to deal with the remaining issues:

shirou/gopsutil#695
ghost pushed a commit to signalfx/signalfx-agent that referenced this issue Jan 27, 2020
This skips the uid lookup since we don't use it. See upstream PRs
for some more background:

shirou/gopsutil#784
shirou/gopsutil#783

This requires upgrading gopsutil which changed some structs which in
turn requires upgrading telegraf. Apparently the `Stolen` field was never set
(shirou/gopsutil#677)

Even with this change there are still some performance issues in the host
observer on machines with a large number of connections. Something like this
change would be required to deal with the remaining issues:

shirou/gopsutil#695
ghost pushed a commit to signalfx/signalfx-agent that referenced this issue Jan 27, 2020
This skips the uid lookup since we don't use it. See upstream PRs
for some more background:

shirou/gopsutil#784
shirou/gopsutil#783

This requires upgrading gopsutil which changed some structs which in
turn requires upgrading telegraf. Apparently the `Stolen` field was never set
(shirou/gopsutil#677)

Even with this change there are still some performance issues in the host
observer on machines with a large number of connections. Something like this
change would be required to deal with the remaining issues:

shirou/gopsutil#695
@wcc526
Copy link
Contributor Author

wcc526 commented Nov 2, 2020

there is open source project https://github.com/dean2021/goss

florianl added a commit to florianl/gopsutil that referenced this issue Jun 5, 2024
To retrieve network information use NETLINK_SOCK_DIAG instead of walking
/proc/<PID> and parsing files.
Consequently this reduces the number of syscalls, as walking /proc/<PID> and
opening, reading and closing files is no longer required. But it also reduces
the memory footprint as reading files into memory for processing is no longer
required.

Related issues:
- shirou#695
- shirou#784

Supersedes shirou#809

Signed-off-by: Florian Lehner <[email protected]>
florianl added a commit to florianl/gopsutil that referenced this issue Jun 8, 2024
To retrieve network information use NETLINK_SOCK_DIAG instead of walking
/proc/<PID> and parsing files.
Consequently this reduces the number of syscalls, as walking /proc/<PID> and
opening, reading and closing files is no longer required. But it also reduces
the memory footprint as reading files into memory for processing is no longer
required.

Related issues:
- shirou#695
- shirou#784

Supersedes shirou#809

Signed-off-by: Florian Lehner <[email protected]>
@florianl
Copy link

In #1660 a proper netlink based implementation for net.Connections() is proposed. I'm happy to get your feedback on this proposed change.

@cforce
Copy link

cforce commented Jul 4, 2024

In #1660 a proper netlink based implementation for net.Connections() is proposed. I'm happy to get your feedback on this proposed change.

Is the performance/cpu load better or worse finally? There seems to be different opinions #1660 (comment)

Will that as well improve the performance of accessing process statistics via the /proc filesystem ?
see related effort
open-telemetry/opentelemetry-collector#8789

@florianl
Copy link

florianl commented Jul 19, 2024

@cforce if using #1660 in a real world application, CPU consumption is reduced by 10% and heap size is also reduced by 8 to 10%.
Data accuracy is another reason for #1660. With netlink one gets all network connections at once. While with the current approach short the returned resuts can contain network connections that no longer exists or even miss network connections.

Once mdlayher/netlink#215 got merged, the memory consumption is expected to be even less.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants