Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart caclmgrd whenever catch exception in child thread or in main thread #194

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

ZhaohuiS
Copy link
Contributor

  • Description
    If there is exception happens in child thread of caclmgrd, the whole caclmgrd service will get stuck.
    Can't get any chance to recover it until we restart caclmgrd service manually.
    So, if it detect any exception in child thread or main thread, it will kill the caclmgrd process itself and systemctl service will restart it.

  • Microsoft work item
    27122359

  • Test evidence

Dec 12 11:06:35.984388 bjw2-can-4600c-2 INFO caclmgrd[1692587]: Translating ACL rules for control plane ACL 'EXTERNAL_CLIENT_ACL' (service: 'EXTERNAL_CLIENT')
Dec 12 11:06:35.984530 bjw2-can-4600c-2 ERR caclmgrd[1692587]: Exception occured at Thread-2 thread due to KeyError('L4_DST_PORT')
Dec 12 11:06:35.985585 bjw2-can-4600c-2 ERR caclmgrd[1692587]: Traceback (most recent call last):
Dec 12 11:06:35.985677 bjw2-can-4600c-2 ERR caclmgrd[1692587]:   File "/usr/local/bin/caclmgrd", line 890, in check_and_update_control_plane_acls
Dec 12 11:06:35.985759 bjw2-can-4600c-2 ERR caclmgrd[1692587]:     self.update_control_plane_acls(namespace, new_config_db_connector)
Dec 12 11:06:35.985870 bjw2-can-4600c-2 ERR caclmgrd[1692587]:   File "/usr/local/bin/caclmgrd", line 827, in update_control_plane_acls
Dec 12 11:06:35.985982 bjw2-can-4600c-2 ERR caclmgrd[1692587]:     iptables_cmds, service_to_source_ip_map  = self.get_acl_rules_and_translate_to_iptables_commands(namespace, config_db_connector)
Dec 12 11:06:35.986100 bjw2-can-4600c-2 ERR caclmgrd[1692587]:   File "/usr/local/bin/caclmgrd", line 715, in get_acl_rules_and_translate_to_iptables_commands
Dec 12 11:06:35.986201 bjw2-can-4600c-2 ERR caclmgrd[1692587]:     dst_ports = [rule_props["L4_DST_PORT"]]
Dec 12 11:06:35.986304 bjw2-can-4600c-2 ERR caclmgrd[1692587]: KeyError: 'L4_DST_PORT'
Dec 12 11:06:35.986405 bjw2-can-4600c-2 ERR caclmgrd[1692587]: Exiting thread Thread-2, put it into exception_queue <queue.Queue object at 0x7fe3e44aec10>
Dec 12 11:09:30.917205 bjw2-can-4600c-2 INFO caclmgrd[1692587]: Checking for exceptions in the queue ...
Dec 12 11:09:30.917414 bjw2-can-4600c-2 ERR caclmgrd[1692587]: Exception in namespace '': KeyError('L4_DST_PORT')
Dec 12 11:09:30.917498 bjw2-can-4600c-2 ERR caclmgrd[1692587]: Traceback (most recent call last):
Dec 12 11:09:30.917561 bjw2-can-4600c-2 ERR caclmgrd[1692587]:   File "/usr/local/bin/caclmgrd", line 890, in check_and_update_control_plane_acls
Dec 12 11:09:30.917615 bjw2-can-4600c-2 ERR caclmgrd[1692587]:     self.update_control_plane_acls(namespace, new_config_db_connector)
Dec 12 11:09:30.917665 bjw2-can-4600c-2 ERR caclmgrd[1692587]:   File "/usr/local/bin/caclmgrd", line 827, in update_control_plane_acls
Dec 12 11:09:30.917715 bjw2-can-4600c-2 ERR caclmgrd[1692587]:     iptables_cmds, service_to_source_ip_map  = self.get_acl_rules_and_translate_to_iptables_commands(namespace, config_db_connector)
Dec 12 11:09:30.917773 bjw2-can-4600c-2 ERR caclmgrd[1692587]:   File "/usr/local/bin/caclmgrd", line 715, in get_acl_rules_and_translate_to_iptables_commands
Dec 12 11:09:30.917828 bjw2-can-4600c-2 ERR caclmgrd[1692587]:     dst_ports = [rule_props["L4_DST_PORT"]]
Dec 12 11:09:30.917882 bjw2-can-4600c-2 ERR caclmgrd[1692587]: KeyError: 'L4_DST_PORT'
Dec 12 11:09:30.917944 bjw2-can-4600c-2 ERR caclmgrd[1692587]: Detect exception in Child thread Thread-2 , generating SIGKILL for main thread
Dec 12 11:09:30.919860 bjw2-can-4600c-2 WARNING systemd[1]: caclmgrd.service: Main process exited, code=killed, status=9/KILL
Dec 12 11:09:30.919949 bjw2-can-4600c-2 WARNING systemd[1]: caclmgrd.service: Failed with result 'signal'.

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Zhaohui Sun <[email protected]>
@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Zhaohui Sun <[email protected]>
@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Zhaohui Sun <[email protected]>
@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Zhaohui Sun <[email protected]>
@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Zhaohui Sun <[email protected]>
@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

scripts/caclmgrd:858

  • The parameter name 'exception_queue' is clear and appropriate.
def check_and_update_control_plane_acls(self, namespace, num_changes, exception_queue):

scripts/caclmgrd:1055

  • The exception handling logic is correctly implemented and ensures that the process is terminated if an exception occurs in any child thread.
namespace, error, _ = exception_queue.get_nowait()
scripts/caclmgrd Outdated Show resolved Hide resolved
scripts/caclmgrd Outdated
msg = traceback.format_exception(exc_type, exc_value, exc_traceback)
for tb_line in msg:
for tb_line_split in tb_line.splitlines():
self.log_error(tb_line_split)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_error

syslog may drop message or mess order due to UDP protocol. suggest queue the exception, and re-raise in main thread.

Alternatively, using join in main thread to handle thread exception. No need to invent exception_queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In main thread, it still needs to keep checking if there are next db updates, so it can't join and wait for the child thread's result and handle the exception, that's why I choose exception queue and only check if the queue is empty or not before starting checking the db updates every time in main thread.

Signed-off-by: Zhaohui Sun <[email protected]>
@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Zhaohui Sun <[email protected]>
@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ZhaohuiS
Copy link
Contributor Author

@qiluo-msft could you please help review? thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants