Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix 20284]: Enhance smartswitch environment variables parsing #21209

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Pterosaur
Copy link
Contributor

@Pterosaur Pterosaur commented Dec 18, 2024

Why I did it

Fix issue: #20284

In 202405 and above, two extra steps are added before the start of every container which checks NUM_DPU and IS_DPU_DEVICE by parsing the platform.json file using the jq tool. This is only relevant for Smartswitch. However, this is adding some delay during the reconciliation phase of WR/FR resulting

Work item tracking
  • Microsoft ADO (number only):

How I did it

Set the environment variables for systemd by systemd-sonic-generator.

How to verify it

  1. Check there is no jq command under the swss.sh start and syncd.sh start from the sonic-bootchart
  2. Regression test:
echo '{"DPUS":{"dpu0":{}, "dpu1":{}}}' | sudo tee /usr/share/sonic/device/x86_64-kvm_x86_64-r0/platform.json
docker rm -f database
sudo reboot

...

admin@vlab-01:~$ sudo systemctl list-unit-files '*midplane*'
UNIT FILE                    STATE           PRESET
midplane-network-dpu.service disabled        enabled
midplane-network-npu.service enabled-runtime  enabled


# The DPU databases should be created
admin@vlab-01:~$ docker ps -a
CONTAINER ID   IMAGE                                COMMAND                  CREATED          STATUS         PORTS     NAMES
4455f7ab611c   docker-database:latest               "/usr/local/bin/dock…"   5 minutes ago    Up 5 minutes             databasedpu0
8e983b5beb27   docker-database:latest               "/usr/local/bin/dock…"   5 minutes ago    Up 5 minutes             databasedpu1

# Environment variables should be inserted to the service file
admin@vlab-01:~$ cat $(sudo find /etc -name 'swss*service')
[Unit]
Description=switch state service
[email protected]

[email protected]

[email protected]

[email protected]

Requires=config-setup.service
After=config-setup.service
BindsTo=sonic.target
After=sonic.target
Before=ntp-config.service
StartLimitIntervalSec=1200

StartLimitBurst=3

[Service]
Environment="NUM_DPU=2"
Environment="IS_DPU_DEVICE=false"
User=root
Environment=sonic_asic_platform=vs
ExecStartPre=/usr/local/bin/swss.sh start
ExecStart=/usr/local/bin/swss.sh wait
ExecStop=/usr/local/bin/swss.sh stop
RestartSec=30

[Install]
WantedBy=sonic.target

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@Pterosaur Pterosaur requested a review from lguohan as a code owner December 18, 2024 09:47
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Pterosaur
Copy link
Contributor Author

/azpw ms_conflict

@Pterosaur
Copy link
Contributor Author

/azpw ms_conflict

@liushilongbuaa
Copy link
Contributor

/azpw ms_conflict

env_vars["IS_DPU_DEVICE"] = (smart_switch_dpu ? "true" : "false");
env_vars["NUM_DPU"] = std::to_string(num_dpus);

for (const auto& [key, value] : env_vars) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think services files such as database, swss, syncd are enabled at build time.
Are we certain that, these will be started only after systemd-sonic-generator is run atleast once?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also verify the changes on smartswitch platform and make sure multiple database instances are created

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments.
Yes, the generator will be run only once and before all services started.
I did confirm the smartswitch scenario and pasted the results on the PR description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing


std::unordered_map<std::string, std::string> env_vars;
env_vars["IS_DPU_DEVICE"] = (smart_switch_dpu ? "true" : "false");
env_vars["NUM_DPU"] = std::to_string(num_dpus);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good approach to solve this issue but i see one setback. If we need to add new env variables, ssg code should be updated. which IMO is not very flexible. I have an idea for generic solution, let me know what you think

  1. Add a oneshot service very early in the boot. Read static env variables (Eg: $PLATFORM, $NUM_ASIC, $NUM_DPU, $SONIC_BOOT_TYPE etc) and write them to a common file /etc/sonic/static-env-variables
  2. We can leverage ssg to write EnvironmentFile=/etc/sonic/static-env-variables option to all the services. making the ssg code minimal and flexible.
  3. We can potentially clean a lot of code under docker_image_ctl with this approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I can have a try.

Copy link
Contributor Author

@Pterosaur Pterosaur Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it, but I feel it wasn't easy yet.

  1. The shell of oneshot service you mentioned might look like the following. But at this time, the database service hasn't ready, so the sonic-cfggen would not work and I have to parse the /host/machine.conf as same as what systemd-sonic-generator did if I would like to get the variable. The SSG has done this by an efficient function, C function, why should I do it again by a shell?
SYSTEMD_ENV_FILE="/etc/sonic/static_env"

# Load platform from sonic-cfggen

PLATFORM=${PLATFORM:-`sonic-cfggen -H -v DEVICE_METADATA.localhost.platform`}

echo "PLATFORM='${PLATFORM}'" >> "$SYSTEMD_ENV_FILE"

# Parse environment from platform.json

PLATFORM_JSON=/usr/share/sonic/device/$PLATFORM/platform.json

if [ -f "$PLATFORM_JSON" ]; then

# Environment variables for Smart Switch

    NUM_DPU=$(jq -r '.DPUS | length' $PLATFORM_JSON 2>/dev/null)
    if [[ -z "$NUM_DPU" ]]; then
        NUM_DPU=0
    fi

    jq -e '.DPU' $PLATFORM_JSON >/dev/null
    if [[ $? -eq 0 ]]; then
        IS_DPU_DEVICE="true"
    else
        IS_DPU_DEVICE="false"
    fi

    echo "NUM_DPU='${NUM_DPU}'" >> "$SYSTEMD_ENV_FILE"
    echo "IS_DPU_DEVICE='${IS_DPU_DEVICE}'" >> "$SYSTEMD_ENV_FILE"
fi
  1. As your proposal, if we want to add new env variables, we have to update the shell script. I don't see any difference or challenge in doing this via SSG. If you feel that C code isn't flexible, I have to say we had an old SSG with python code previously, But we discarded it and rewrote it via C due to the efficiency issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, script couldn't use DB just like SSG

Hmm, difference is we need to run the script once. Since the oneshot service runs before other services are even started, CPU should be fairly free and should be executed quickly. we do need to benchmark this solution to measure impact.

Advantage being load on SSG is less, all it does it to add EnvironmentFIle= and with some optimization we don't even need to edit and write to the .service file after the first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, i'm okay with the current solution. We might move to the generic oneshot service if required in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this PR looks good to you, could you please help to approve it.

env_vars["NUM_DPU"] = std::to_string(num_dpus);

for (const auto& [key, value] : env_vars) {
tmp_file << "Environment=\"" << key << "=" << value << "\"" << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add Environment= option if the file doesn't have[Service] section in the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that the Service section might be introduced by other functions or modules in the future. And I don't see any side-effects if I define some environment variables at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure, systemd won't throw an error if we add Environment= values after the last section without [Service]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no Service section, I will add it.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

@vivekrnv vivekrnv requested a review from Yakiv-Huryk December 20, 2024 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants