Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turns out firewalld does NOT play nice with cloud-init and will stochastically KILL AWS EC2 instances on startup #49

Open
Xunnamius opened this issue Sep 5, 2023 · 0 comments
Labels
bug Something isn't working enhancement New feature or request priority:high

Comments

@Xunnamius
Copy link
Owner

Xunnamius commented Sep 5, 2023

Holy shit what the fuck is this?

Installing firewalld on a system that is using cloud-init, which CyberPanel does by default, causes a race condition that, around 75% of the time for me, resulted in an instance with a broken network stack that I could not SSH into.

Since I thought it was a problem with my custom netplan, I kept spinning up new instances with different init scripts trying to disable the netplan... and eventually got in! ... only to realize that my init scripts were never actually running, and netplan had never been changed. It wasn't netplan being problematic, it was something else. Further investigation lead me to the following with journalctl:

Sep 05 07:37:40 X systemd[1]: network-pre.target: Found ordering cycle on firewalld.service/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on basic.target/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on sockets.target/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on apport-forward.socket/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on sysinit.target/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on cloud-init.service/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on systemd-networkd-wait-online.service/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on systemd-networkd.service/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Found dependency on network-pre.target/start
Sep 05 07:37:40 X systemd[1]: network-pre.target: Job firewalld.service/start deleted to break ordering cycle starting with network-pre.target/start

It was different services being "deleted to break [the] ordering cycle" on different snapshots, such as cloud-init, dbus (which I'm pretty sure caused this for this poor fellow), etc. I kept restarting various snapshots to see when next they'd let me in. About 25% of the time they did.

This strange behavior and weird error logs eventually lead me to this forum post from 2019. Which lead me to firewalld/firewalld#414. Quite the three day time sink debugging adventure.

So: CP either shouldn't install firewalld on systems where cloud-init is present, or CP should delete the firewalld.service file and supply its own (e.g. patch firewalld to run later). AWS already has an instance-level firewall, so firewalld isn't useful. And fail2ban already works with iptables by default.

For now, uninstalling firewalld will do.

Enhancement 1: instead of firewalld as the backend, just use AWS CLI/API firewall controls and let Amazon deal with running the firewall.

Enhancement 2: CP installer must ask if firewalld should be installed, and must default to "no" on systems with cloud-init installed. Warnings should be given about installing firewalld on a cloud-init system (like AWS VPSes) and how it can break the network stack (so take a snapshot before continuing!)

@Xunnamius Xunnamius added bug Something isn't working priority:high enhancement New feature or request labels Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request priority:high
Projects
None yet
Development

No branches or pull requests

1 participant