Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentd workers SIGSEGV for specific Windows events channels (security and powershell events) #106

Open
iivvss opened this issue Jun 5, 2024 · 4 comments

Comments

@iivvss
Copy link

iivvss commented Jun 5, 2024

Hi team,

I am experiencing regular crashes from some ruby workers collecting windows events using windows event forwarding and custom event channels.

I am seeing worker 2 (WEC_Security-OS-Security-Low-3 - Specific security event IDs) and 4 (WEC_AppAndServices-OS-PowerShell - Powershell events) and worker crashing in a loop:

2024-06-05 15:09:47 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:09:56 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:01 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:09 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:18 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:20 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:28 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:30 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:40 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:43 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:57 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:11:01 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV

I suspect the windows_eventlog2 plugin to reach a limit somewhere but it's not correlated to the number of events, as neither Security low 3 nor Powershell have the biggest number of events accross all other channels.

Here is the fluentd configuration I am working with:


<system>
	log_level info
	workers 7
	# restart_worker_interval 1s
	<log>
		rotate_size 1048576
	</log>
</system>

<worker 0>
	<source>
		# @log_level trace
		@type windows_eventlog2
		@id windows_eventlog2 #id must be unique
		channels WEC_AppAndServices-OS-AppLocker,WEC_AppAndServices-OS-DNSServer,WEC_AppAndServices-OS-RemoteAccess,WEC_AppAndServices-OS-ADFS,WEC_AppAndServices-OS-Application,WEC_Security-OS-Security-Medium
		read_existing_events false
		tag windows_event_logs
		rate_limit 1000
		read_interval 1s
		render_as_xml true
		<storage>
			persistent false #in memory storage
		</storage>
		<parse>
			@type none
		</parse>
	</source>

	<filter windows_event_logs>
		@type record_transformer
		enable_ruby
		<record>
			message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
		</record>
	</filter>

	<match windows_event_logs>
		@type forward
		send_timeout 60s
		recover_wait 10s
		hard_timeout 60s
		keepalive true
		keepalive_timeout 180
		#compress gzip
		<server>
			name device_name # Set  Name
			host x.x.x.x  of the  provisioned for the client
			port 10000
			weight 60
		</server>
		<buffer>
			@type memory
			flush_mode immediate #no flush_interval if mode is immediate
			chunk_limit_size 512m
			flush_thread_count 16
			total_limit_size 4GB #max total RAM allocation
		</buffer>
	</match>
</worker>

<worker 1>
	<source>
		# @log_level trace
		@type windows_eventlog2
		@id windows_eventlog21 #id must be unique
		channels WEC_Security-OS-Security-Low-2
		read_existing_events false
		tag windows_event_logs
		rate_limit 100
		read_interval 1s
		render_as_xml true
		<storage>
			persistent false
		</storage>
		<parse>
			@type none
		</parse>
	</source>

	<filter windows_event_logs>
		@type record_transformer
		enable_ruby
		<record>
			message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
		</record>
	</filter>

	<match windows_event_logs>
		@type forward
		send_timeout 60s
		recover_wait 10s
		hard_timeout 60s
		keepalive true
		keepalive_timeout 180
		#compress gzip
		<server>
			name device_name # Set  Name
			host x.x.x.x  of the  provisioned for the client
			port 10000
			weight 60
		</server>
		<buffer>
			@type memory
			flush_mode immediate #no flush_interval if mode is immediate
			chunk_limit_size 512m
			flush_thread_count 16
			total_limit_size 4GB #max total RAM allocation
		</buffer>
	</match>
</worker>

<worker 2>
	<source>
		# @log_level trace
		@type windows_eventlog2
		@id windows_eventlog22 #id must be unique
		channels WEC_Security-OS-Security-Low-3
		read_existing_events false
		tag windows_event_logs
		rate_limit 50
		read_interval 1s
		render_as_xml true
		<storage>
			persistent false
		</storage>
		<parse>
			@type none
		</parse>
	</source>

	<filter windows_event_logs>
		@type record_transformer
		enable_ruby
		<record>
			message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
		</record>
	</filter>

	<match windows_event_logs>
		@type forward
		send_timeout 60s
		recover_wait 10s
		hard_timeout 60s
		keepalive true
		keepalive_timeout 180
		#compress gzip
		<server>
			name device_name # Set  Name
			host x.x.x.x  of the  provisioned for the client
			port 10000
			weight 60
		</server>
		<buffer>
			@type memory
			flush_mode immediate #no flush_interval if mode is immediate
			chunk_limit_size 512m
			flush_thread_count 16
			total_limit_size 4GB #max total RAM allocation
		</buffer>
	</match>
</worker>

<worker 3>
	<source>
		@type windows_eventlog2
		@id windows_eventlog23 #id must be unique
		channels WEC_Security-OS-Security-High,WEC_System-OS-System,WEC_AppAndServices-OS-MicrosoftDefender
		read_existing_events false
		tag windows_event_logs
		rate_limit 1000
		read_interval 1s
		render_as_xml true
		<storage>
			persistent false
		</storage>
		<parse>
			@type none
		</parse>
	</source>

	<filter windows_event_logs>
		@type record_transformer
		enable_ruby
		<record>
			message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
		</record>
	</filter>

	<match windows_event_logs>
		@type forward
		send_timeout 60s
		recover_wait 10s
		hard_timeout 60s
		keepalive true
		keepalive_timeout 180
		#compress gzip
		<server>
			name device_name # Set  Name
			host x.x.x.x  of the  provisioned for the client
			port 10000
			weight 60
		</server>
		<buffer>
			@type memory
			flush_mode immediate #no flush_interval if mode is immediate
			chunk_limit_size 512m
			flush_thread_count 16
			total_limit_size 4GB #max total RAM allocation
		</buffer>
	</match>
</worker>

<worker 4>
	<source>
		@type windows_eventlog2
		@id windows_eventlog24 #id must be unique
		channels WEC_AppAndServices-OS-PowerShell
		read_existing_events false
		tag windows_event_logs
		rate_limit 50
		read_interval 1s
		render_as_xml true
		<storage>
			persistent false
		</storage>
		<parse>
			@type none
		</parse>
	</source>

	<filter windows_event_logs>
		@type record_transformer
		enable_ruby
		<record>
			message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
		</record>
	</filter>

	<match windows_event_logs>
		@type forward
		send_timeout 60s
		recover_wait 10s
		hard_timeout 60s
		keepalive true
		keepalive_timeout 180
		#compress gzip
		<server>
			name device_name # Set  Name
			host x.x.x.x  of the  provisioned for the client
			port 10000
			weight 60
		</server>
		<buffer>
			@type memory
			flush_mode immediate #no flush_interval if mode is immediate
			chunk_limit_size 512m
			flush_thread_count 16
			total_limit_size 4GB #max total RAM allocation
		</buffer>
	</match>
</worker>

<worker 5>
	<source>
		@type windows_eventlog2
		@id windows_eventlog25 #id must be unique
		channels WEC_Security-OS-Security-Low-1
		read_existing_events false
		tag windows_event_logs
		rate_limit 50
		read_interval 1s
		render_as_xml true
		<storage>
			persistent false
		</storage>
		<parse>
			@type none
		</parse>
	</source>

	<filter windows_event_logs>
		@type record_transformer
		enable_ruby
		<record>
			message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
		</record>
	</filter>

	<match windows_event_logs>
		@type forward
		send_timeout 60s
		recover_wait 10s
		hard_timeout 60s
		keepalive true
		keepalive_timeout 180
		#compress gzip
		<server>
			name device_name
			host x.x.x.x 
			port 10000
			weight 60
		</server>
		<buffer>
			@type memory
			flush_mode immediate #no flush_interval if mode is immediate
			chunk_limit_size 512m
			flush_thread_count 16
			total_limit_size 4GB #max total RAM allocation
		</buffer>
	</match>
</worker>

Could you help ?

Cheers,

Ivs

@daipom
Copy link
Contributor

daipom commented Jun 6, 2024

Hmm, I wonder what could be the cause...
I know several cases that work without these problems, so maybe there is some unique cause.
Could you please tell me the following info?

  • The info about Fluentd (fluent-package? td-agent? Docker container?).
  • The gem version of fluent-plugin-windows-eventlog and winevt_c.
  • Are there any resource problems such as memory exhaustion?

@iivvss
Copy link
Author

iivvss commented Jun 6, 2024

Hi daipom,

Thanks for the quick answer. Sure, here is the information:

  • The info about Fluentd (fluent-package? td-agent? Docker container?).
    fluent-package-5.0.3-x64.msi
  • The gem version of fluent-plugin-windows-eventlog and winevt_c.
    fluent-plugin-windows-eventlog (0.8.3)
    winevt_c (0.10.1)
  • Are there any resource problems such as memory exhaustion?
    The Windows Server 2022 azure VM has 4 vCPU and 16 GB RAM, the average CPU usage is 20% and the average RAM usage is 4.6 GB. There is plenty of disk space for the system (84.4 GB free) and the other drive where fluentd is installed (245 GB free). There doesn't seem to be any IO bottlenecks on the disks.

@daipom
Copy link
Contributor

daipom commented Jun 7, 2024

Thanks!
I don't see anything wrong with the environment...

I am experiencing regular crashes

So, there's some reproducibility.
We need to narrow down the cause.

Is it caused by a specific event log?
(Like, it contains characters that are not supposed...)

Is there a particular setting we can change to avoid this issue?
(Such as render_as_xml)

@iivvss
Copy link
Author

iivvss commented Jun 10, 2024

So I have set therender_as_xml to false for workers 2 and 4 but I see no improvements (I had to remove the record transformer section also) :

<worker 2>
	<source>
		# @log_level trace
		@type windows_eventlog2
		@id windows_eventlog22 #id must be unique
		channels WEC_Security-OS-Security-Low-3
		read_existing_events false
		tag windows_event_logs
		rate_limit 50
		read_interval 1s
		render_as_xml false
		<storage>
			persistent false
		</storage>
		<parse>
			@type none
		</parse>
	</source>

	# <filter windows_event_logs>
		# @type record_transformer
		# enable_ruby
		# <record>
			# message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
		# </record>
	# </filter>

	<match windows_event_logs>
		@type forward
		send_timeout 60s
		recover_wait 10s
		hard_timeout 60s
		keepalive true
		keepalive_timeout 180
		#compress gzip
		<server>
			name <hostname># Set FPE Name
			host x.x.x.x 
			port 10000
			weight 60
		</server>
		<buffer>
			@type memory
			flush_mode immediate #no flush_interval if mode is immediate
			chunk_limit_size 512m
			flush_thread_count 16
			total_limit_size 4GB #max total RAM allocation
		</buffer>
	</match>
</worker>

I still see the workers crashing:

2024-06-10 14:23:36 +0000 [info]: spawn command to main:  cmdline=["E:/opt/fluent/bin/ruby.exe", "-Eascii-8bit:ascii-8bit", "E:/opt/fluent/bin/fluentd", "-c", "E:\\opt\\fluent\\etc\\fluent\\fluentd.conf", "-o", "E:\\opt\\fluent\\fluentd.log", "-x", "fluentdwinsvc", "--under-supervisor"]
2024-06-10 14:24:03 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:03 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:15 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:22 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:24 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:49 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:51 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:02 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:02 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:13 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:22 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:23 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:32 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:44 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:57 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:07 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:09 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:21 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:35 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:48 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:58 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:27:00 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV

Is it caused by a specific event log?

I couldn't pinpoint any specific windows event. Worker 2 and 4 are configured to read .evtx files containing specific subset of security and powershell events.

The WEC_Security-OS-Security-Low-3 evtx has a majority of 4627 "Group Membership" events.
The WEC_AppAndServices-OS-PowerShell evtx file has majority of event id 400 and 600 events.

I see no particular event linked to the frequency of the worker crashes. I can try suppressing events from the subscriptions one by one but perhaps you have other ideas before I go this way :) (this will be tedious).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants