Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please document in a "known issues" section that installing OKD v4.16/v4.17 in UPI+ABI/AI on baremetal fails #47

Open
GingerGeek opened this issue Dec 17, 2024 · 3 comments

Comments

@GingerGeek
Copy link
Member

Discussed in okd-project/okd#2068

Originally posted by titou10titou10 December 17, 2024
So OKD v4.16/4.17 is officially out: https://github.com/okd-project/okd/releases/tag/4.17.0-okd-scos.0
Good work!

Could you add somewhere that installing those version in UPI with the agent based installer (ABI) or agent installer (AI) on baremetal does not work .
The problem and possible workaround is described here and in many other discussions (#1938, #2015, #1983 ...)

@JaimeMagiera suggested this in a comment here: okd-project/okd#2019 (reply in thread)

IMHO it could spare to time for people that use this method to install OKD

@GingerGeek GingerGeek transferred this issue from okd-project/okd Dec 17, 2024
@titou10titou10
Copy link

@GingerGeek I don't know how you want me to send you the page... Here is a draft

@titou10titou10
Copy link

titou10titou10 commented Dec 18, 2024

Workaround to install OKD v4.17.0-okd-scos.0 on UPI baremetal with the Agent Based Installer (ABI) or the Assisted Installer (AI)

(This page does not explain how to install OKD with the ABI or AI tools, Check the official docs for the procedure)

The problems and workarounds described here have way more details in many "discussions" and "issues" in the OKD github project: eg #2015, #2018, #2035 ...

Context and problems

The current stable version of OKD (v4.17.0-okd-scos.0) is based on SCOS. However the bootstrap image is still based on an "old" FCOS image.

Installing v4.17.0-okd-scos.0 on baremetal in UPI with the ABI with the bundled bootstrap FCOS image (39.20231101.3.0) fails.

Two problems arise:

  • the "assisted-service-db" service fails to start because of an invalid PostgreSQL script to startup the service

    systemctl status assisted-service
    journalctl -f -u assisted-service-db
    > FATAL:  could not create lock file "/var/run/postgresql/.s.PGSQL.5432.lock": No such file or directory"
    
  • the "release-image-pivot" service is unable to pivot the OS to the final SCOS image ("/sysroot read-write: Permission denied")

    systemctl status release-image-pivot.service
    bootstrap-pivot.sh[68605]: error: Remounting /sysroot read-write: Permission denied
    release-image-pivot.service: Main process exited, code=exited, status=1/FAILURE
    release-image-pivot.service: Failed with result 'exit-code'.
    Failed to start release-image-pivot.service - Pivot bootstrap to the OpenShift Release Image.
    

Until the FCOS boostrap image is replaced by a working SCOS one, the following workaround can be used

For ABI install

Override the "OPENSHIFT_INSTALL_OS_IMAGE" before generating the install ISO image:

   export OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE=https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.17/4.17.0/rhcos-4.17.0-x86_64-live.x86_64.iso
   oc adm release extract --command=openshift-install quay.io/okd/scos-release:4.17.0-okd-scos.0
   #  prepare the install directory with agent-config.yaml and install-config.yaml...
   ./openshift-install agent create image --dir install --log-level=debug    

For Assisted Installer install

Edit the "okd-configmap.yml" or your own AI ConfigMap file (from https://github.com/openshift/assisted-service/blob/master/deploy/podman/okd-configmap.yml)

   OS_IMAGES: '[{"openshift_version":"4.17","cpu_architecture":"x86_64","url":"https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.17/4.17.0/rhcos-4.17.0-x86_64-live.x86_64.iso","version":"417.94.202408270355-0"}}'
   RELEASE_IMAGES: '[{"openshift_version":"4.17","cpu_architecture":"x86_64","cpu_architectures":["x86_64"],"url":"quay.io/okd/scos-release:4.17.0-okd-scos.0","version":"4.17.0-okd-scos.0","default":true,"support_level":"beta"}]'

(The selected images comes from the OCP assisted-service maintained here: https://github.com/openshift/assisted-service/blob/master/deploy/podman/configmap.yml)

Then, for both methods

Boot the nodes/Start the install
After a few minutes the installation on the boostrap/rendez-vous node will never go forward because the "assisted-service-db" fails to start

  systemctl status assisted-service
  journalctl -f -u assisted-service-db
  > FATAL:  could not create lock file "/var/run/postgresql/.s.PGSQL.5432.lock": No such file or directory"

The "/var/run/postgresql/" directory for the PostgreSQL database to use for unix socket does not exist. This is most likely a bug in the "quay.io/okd/scos-content@sha256:..." image used for the service or a problem with the definition of the systemd service

The solution is to tell PostgreSQL to use another directory for its unix socket. for that, login to the bootstrap node

ssh core@<bootstrap/rendez-vous node>
sudo su -
vi /etc/systemd/system/assisted-service-db.service
>  replace the ExecStart=... line by this one
ExecStart=/usr/bin/podman run --net host --user=postgres --cidfile=%t/%n.ctr-id --cgroups=no-conmon --log-driver=journald --rm --pod-id-file=%t/assisted-service-pod.pod-id --sdnotify=conmon --replace -d --name=assisted-db --env-file=/usr/local/share/assisted-service/assisted-db.env $SERVICE_IMAGE /bin/bash -c '/usr/bin/pg_ctl -D /tmp/postgres/data/ -l /tmp/postgres/logfile start -w -o "-k /tmp"; createuser -s admin -h localhost; createdb installer -h localhost; /usr/bin/pg_ctl -D /tmp/postgres/data/ -l /tmp/postgres/logfile stop -w -o "-k /tmp"; exec postgres -D /tmp/postgres/data/ -k /tmp'
> save, exit :wq

systemctl daemon-reload && systemctl restart assisted-service-db

Basically this is replacing the command "/bin/bash start_db.sh" used in the systemd start script by the content of the "start_db.sh" and adding a parameter to change the location of the socket to each "pg_ctl" command ("-k /tmp")

Then the installation continues and after some time you can enjoy your freshly installed v4.17.0-okd-scos.0 cluster

@GingerGeek
Copy link
Member Author

Hi, thank you for writing this. I don't think we can use this workaround since it relies on using RHCOS image which I think is subject to RHEL-license stuff. I understand that it's only momentarily for initial bootstrap

Perhaps it's allowed under the usual RHEL developer license which i think allows X amount of machines? I will need to take this to the working group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants