Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue/8653 fix quickstart failures #141

Merged
merged 91 commits into from
Feb 4, 2025

Conversation

Hugo-Inmanta
Copy link
Contributor

@Hugo-Inmanta Hugo-Inmanta commented Jan 20, 2025

Copy link
Contributor

@sanderr sanderr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary review. Reviewed all but the do_test_deployment_and_verify script for OSS.

Networking/SR Linux/ci/Jenkinsfile Outdated Show resolved Hide resolved
Networking/SR Linux/ci/Jenkinsfile Outdated Show resolved Hide resolved
Networking/SR Linux/ci/do_test_deployment_and_verify.py Outdated Show resolved Hide resolved
Networking/SR Linux/ci/Jenkinsfile Outdated Show resolved Hide resolved
Networking/SR Linux/containerlab/topology.yml Outdated Show resolved Hide resolved
Comment on lines 73 to 75
matcher = isoProductVersion =~ /dev/
if (matcher.matches()) {
return "code.inmanta.com:4567/solutions/containers/service-orchestrator:dev"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler to move this to the second check? i.e. change \d+ to \d*, and check if that part matched anything?

@sanderr sanderr self-requested a review January 24, 2025 10:57
@sanderr sanderr mentioned this pull request Jan 24, 2025
@Hugo-Inmanta
Copy link
Contributor Author

doc pr on core: inmanta/inmanta-core#8693

Networking/SR Linux/ci/do_test_deployment_and_verify.py Outdated Show resolved Hide resolved
Networking/SR Linux/ci/do_test_deployment_and_verify.py Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to additionally request review from someone of the solutions team for this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have anything specific in mind to pay attention to ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just whether this is a good approach to validate that everything was deployed. But looking it over once more, I don't think it necessarily has to be someone from the solutions team. I would add a second reviewer, but rather for the full PR, not just this file.

@Hugo-Inmanta
Copy link
Contributor Author

@arnaudsjs Marking you as a second reviewer since you already worked on this issue iirc

Networking/SR Linux/ci/Jenkinsfile Show resolved Hide resolved
print(stdout.decode())
print(stderr.decode())

async def check_successful_deploy(file: str, expected_resources: set[str]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this method is not very well chosen. It does more than just checking.

Comment on lines 140 to 153
subprocess.check_call(
"sudo docker logs clab-srlinux-inmanta-server >server.log", shell=True
)
subprocess.check_call(
"sudo docker logs clab-srlinux-postgres >postgres.log", shell=True
)
subprocess.check_call(
"sudo docker exec -i clab-srlinux-inmanta-server sh -c cat /var/log/inmanta/resource-*.log >resource-actions.log",
shell=True,
)
subprocess.check_call(
"sudo docker exec -i clab-srlinux-inmanta-server sh -c cat /var/log/inmanta/agent-*.log >agents.log",
shell=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit weird that this file creates sub-processes using both the blocking and the non-blocking API. I think we can rely on the blocking API only. (Not necessarily a change request)

Comment on lines 140 to 153
subprocess.check_call(
"sudo docker logs clab-srlinux-inmanta-server >server.log", shell=True
)
subprocess.check_call(
"sudo docker logs clab-srlinux-postgres >postgres.log", shell=True
)
subprocess.check_call(
"sudo docker exec -i clab-srlinux-inmanta-server sh -c cat /var/log/inmanta/resource-*.log >resource-actions.log",
shell=True,
)
subprocess.check_call(
"sudo docker exec -i clab-srlinux-inmanta-server sh -c cat /var/log/inmanta/agent-*.log >agents.log",
shell=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see we only collect these logs if the deployment was successful, which pretty much defeats the purpose of collecting them.

Comment on lines 121 to 130
return set(
[
f"yang::GnmiResource[spine,name=global],v={version}",
f"yang::GnmiResource[leaf2,name=global],v={version}",
f"yang::GnmiResource[leaf1,name=global],v={version}",
f"std::AgentConfig[internal,agentname=spine],v={version}",
f"std::AgentConfig[internal,agentname=leaf2],v={version}",
f"std::AgentConfig[internal,agentname=leaf1],v={version}",
]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return set(
[
f"yang::GnmiResource[spine,name=global],v={version}",
f"yang::GnmiResource[leaf2,name=global],v={version}",
f"yang::GnmiResource[leaf1,name=global],v={version}",
f"std::AgentConfig[internal,agentname=spine],v={version}",
f"std::AgentConfig[internal,agentname=leaf2],v={version}",
f"std::AgentConfig[internal,agentname=leaf1],v={version}",
]
)
return {
f"yang::GnmiResource[spine,name=global],v={version}",
f"yang::GnmiResource[leaf2,name=global],v={version}",
f"yang::GnmiResource[leaf1,name=global],v={version}",
f"std::AgentConfig[internal,agentname=spine],v={version}",
f"std::AgentConfig[internal,agentname=leaf2],v={version}",
f"std::AgentConfig[internal,agentname=leaf1],v={version}",
}

Comment on lines 103 to 104
async def done_deploying(expected_resources: set[str]) -> bool:
result = await client.resource_list(tid=environment_id, deploy_summary=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def done_deploying(expected_resources: set[str]) -> bool:
result = await client.resource_list(tid=environment_id, deploy_summary=True)
async def done_deploying(expected_resources: set[str]) -> bool:
if not expected_resources:
return True
result = await client.resource_list(tid=environment_id, deploy_summary=True)

Comment on lines 62 to 73
process = await asyncio.subprocess.create_subprocess_exec(
*cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE
)
try:
(stdout, stderr) = await asyncio.wait_for(process.communicate(), timeout=30)
except asyncio.TimeoutError as e:
process.kill()
(stdout, stderr) = await process.communicate()
raise e
finally:
print(stdout.decode())
print(stderr.decode())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why first buffer all output in-memory and then only print if when the process finished. This means that we:

  • Don't see the output live in the Jenkins output
  • It's more heavy on memory consumption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to use subprocess.check_call instead

print(stderr.decode())

async def done_deploying(expected_resources: set[str]) -> bool:
result = await client.resource_list(tid=environment_id, deploy_summary=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a race condition between the export command and this resource_list() command. The export command releases the version, but this happens asynchronously, so the resource_list() might be executed before the version is released. This way we might be checking an outdated state.

Copy link
Contributor Author

@Hugo-Inmanta Hugo-Inmanta Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something but I think this is currently acceptable since the done_deploying check is performed inside a retry_limited and the expected_resources parameter holds the version we want to check for ? ie

  1. we release version N
  2. we keep checking until resources for version N were correctly deployed

Copy link
Contributor

@arnaudsjs arnaudsjs Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the done_deploying method does an assert on the expected_resources, which means that it raises an exception in case of failure. The retry_limited doesn't catch this exception so it will crash. If we change this assertion to an if-condition that returns False in case of a mismatch, I agree with the above-mentioned statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof nice catch

@Hugo-Inmanta Hugo-Inmanta merged commit f66765b into master Feb 4, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

investigate quickstart failures
3 participants