Turns out firewalld does NOT play nice with cloud-init and will stochastically KILL AWS EC2 instances on startup #49

Xunnamius · 2023-09-05T16:53:47Z

Installing firewalld on a system that is using cloud-init, which CyberPanel does by default, causes a race condition that, around 75% of the time for me, resulted in an instance with a broken network stack that I could not SSH into.

Since I thought it was a problem with my custom netplan, I kept spinning up new instances with different init scripts trying to disable the netplan... and eventually got in! ... only to realize that my init scripts were never actually running, and netplan had never been changed. It wasn't netplan being problematic, it was something else. Further investigation lead me to the following with journalctl:

Sep 05 07:37:40 X systemd[1]: network-pre.target: Found ordering cycle on firewalld.service/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on basic.target/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on sockets.target/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on apport-forward.socket/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on sysinit.target/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on cloud-init.service/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on systemd-networkd-wait-online.service/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on systemd-networkd.service/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on network-pre.target/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Job firewalld.service/start deleted to break ordering cycle starting with network-pre.target/start

It was different services being "deleted to break [the] ordering cycle" on different snapshots, such as cloud-init, dbus (which I'm pretty sure caused this for this poor fellow), etc. I kept restarting various snapshots to see when next they'd let me in. About 25% of the time they did.

This strange behavior and weird error logs eventually lead me to this forum post from 2019. Which lead me to firewalld/firewalld#414. Quite the ~~three day time sink~~ debugging adventure.

So: CP either shouldn't install firewalld on systems where cloud-init is present, or CP should delete the firewalld.service file and supply its own (e.g. patch firewalld to run later). AWS already has an instance-level firewall, so firewalld isn't useful. And fail2ban already works with iptables by default.

For now, uninstalling firewalld will do.

Enhancement 1: instead of firewalld as the backend, just use AWS CLI/API firewall controls and let Amazon deal with running the firewall.

Enhancement 2: CP installer must ask if firewalld should be installed, and must default to "no" on systems with cloud-init installed. Warnings should be given about installing firewalld on a cloud-init system (like AWS VPSes) and how it can break the network stack (so take a snapshot before continuing!)

The text was updated successfully, but these errors were encountered:

Xunnamius added bug Something isn't working priority:high enhancement New feature or request labels Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turns out firewalld does NOT play nice with cloud-init and will stochastically KILL AWS EC2 instances on startup #49

Turns out firewalld does NOT play nice with cloud-init and will stochastically KILL AWS EC2 instances on startup #49

Xunnamius commented Sep 5, 2023 •

edited

Loading

Turns out firewalld does NOT play nice with cloud-init and will stochastically KILL AWS EC2 instances on startup #49

Turns out firewalld does NOT play nice with cloud-init and will stochastically KILL AWS EC2 instances on startup #49

Comments

Xunnamius commented Sep 5, 2023 • edited Loading

Xunnamius commented Sep 5, 2023 •

edited

Loading