Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalable async RPC #8048

Merged
merged 3 commits into from
Aug 8, 2024
Merged

Scalable async RPC #8048

merged 3 commits into from
Aug 8, 2024

Conversation

ElysaSrc
Copy link
Contributor

@ElysaSrc ElysaSrc commented Jul 14, 2024

Fixes #6117.

(This PR replaces the closed #7103 PR. Check the log of the previous PR for the reason).

This PR implements the Scalable Async RPC design.

Changes

  • Modify local environment to provide RabbitMQ
  • Implement OSRDyne
    • Implement Kubernetes driver
    • Implement Docker driver
    • Implement RabbitMQ driver
    • Implement logic
  • Modify Core to use RabbitMQ queues
  • Modify Editoast to use RabbitMQ queues

Tests

  • Against Docker
  • Against Kubernetes (we deployed a custom Kubernetes cluster to test it)

@ElysaSrc ElysaSrc requested review from a team as code owners July 14, 2024 20:10
@ElysaSrc ElysaSrc requested a review from Khoyo July 14, 2024 20:10
@ElysaSrc ElysaSrc marked this pull request as draft July 14, 2024 20:10
@codecov-commenter
Copy link

codecov-commenter commented Jul 14, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 5.31915% with 1691 lines in your changes missing coverage. Please review.

Project coverage is 37.95%. Comparing base (684fe3d) to head (58fbea5).
Report is 3 commits behind head on dev.

Files Patch % Lines
osrdyne/src/pool.rs 0.00% 404 Missing ⚠️
osrdyne/src/drivers/kubernetes.rs 0.00% 221 Missing ⚠️
editoast/src/core/mq_client.rs 0.00% 152 Missing ⚠️
osrdyne/src/drivers/docker.rs 0.00% 150 Missing ⚠️
...re/src/main/java/fr/sncf/osrd/cli/WorkerCommand.kt 0.00% 145 Missing ⚠️
osrdyne/src/queue_controller.rs 0.00% 129 Missing ⚠️
osrdyne/src/target_tracker.rs 0.00% 125 Missing ⚠️
osrdyne/src/target_tracker/actor.rs 0.00% 74 Missing ⚠️
osrdyne/src/api.rs 0.00% 69 Missing ⚠️
osrdyne/src/management_client.rs 50.51% 48 Missing ⚠️
... and 23 more

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff              @@
##                dev    #8048      +/-   ##
============================================
- Coverage     38.42%   37.95%   -0.47%     
  Complexity     2163     2163              
============================================
  Files          1242     1243       +1     
  Lines        113599   115008    +1409     
  Branches       3131     3145      +14     
============================================
+ Hits          43648    43652       +4     
- Misses        68107    69512    +1405     
  Partials       1844     1844              
Flag Coverage Δ
core 74.80% <1.35%> (-0.79%) ⬇️
editoast 66.35% <16.54%> (-0.53%) ⬇️
front 16.69% <ø> (ø)
gateway ?
osrdyne 2.71% <3.57%> (?)
railjson_generator 87.49% <ø> (ø)
tests 72.98% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ElysaSrc
Copy link
Contributor Author

This PR is still in draft because we have some rough edges and various fixes. I've opened the new early in order to facilitate early reviews.

Copy link
Contributor

@flomonster flomonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished my first review. But here are a few comments.

osrdyne/Dockerfile Show resolved Hide resolved
osrdyne/src/drivers/docker.rs Outdated Show resolved Hide resolved
osrdyne/src/drivers/docker.rs Outdated Show resolved Hide resolved
@flomonster
Copy link
Contributor

We don't have check_osrdyne_tests where we run tests, fmt and clippy.
There are currently several clippy warnings.

@woshilapin woshilapin self-assigned this Jul 16, 2024
Copy link
Contributor

@woshilapin woshilapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, great work. Here is a first iteration of general remarks.

It seems that even if we have an abstract worker driver API, each of the implementations provide a different set of features (I can only have on core worker per infra for Docker, but autoscaling allow for more than one in k8s, I can setup the machine capacity for k8s, but not setup any limit for Docker). It's like implementation details leaks out of this API. Maybe number of workers per infra and capacities of these nodes should be part of the API?

osrdyne/Cargo.toml Show resolved Hide resolved
core/src/main/java/fr/sncf/osrd/cli/WorkerCommand.kt Outdated Show resolved Hide resolved
docker-compose.yml Show resolved Hide resolved
docker/docker-bake.hcl Show resolved Hide resolved
osrdyne/src/config.rs Show resolved Hide resolved
osrdyne/src/drivers/docker.rs Outdated Show resolved Hide resolved
osrdyne/src/drivers/kubernetes.rs Outdated Show resolved Hide resolved
osrdyne/src/main.rs Outdated Show resolved Hide resolved
osrdyne/src/main.rs Outdated Show resolved Hide resolved
osrdyne/src/drivers/mod.rs Outdated Show resolved Hide resolved
@ElysaSrc
Copy link
Contributor Author

It seems that even if we have an abstract worker driver API, each of the implementations provide a different set of features (I can only have on core worker per infra for Docker, but autoscaling allow for more than one in k8s, I can setup the machine capacity for k8s, but not setup any limit for Docker). It's like implementation details leaks out of this API. Maybe number of workers per infra and capacities of these nodes should be part of the API?

I disagree on this : we've considered integrating the scaling of cores in osrdyne but rejected it. It was clearly not easy to determine when go up or down, and it requires taking account metrics that are not easy to handle properly.

From my perspective, it makes way more sense that for a Kubernetes deployment the scaling happens at the orchestrator level and not the applicative level. We will probably, in the near future, add an option to leverage Keda instead of the HPA.

The Kubernetes Driver exposes features of the orchestrator, and for a Docker deployment we can only work with a single host so horizontally scaling the core makes little sense. We could add a driver for docker swarm clusters but they are marginal (and not what we use).

To finish, I'll say that I do not feel like the API is leaking, it just allows to pass options to the driver.

@woshilapin
Copy link
Contributor

woshilapin commented Jul 16, 2024

From my perspective, it makes way more sense that for a Kubernetes deployment the scaling happens at the orchestrator level and not the applicative level.

This sentence did make me understand your point of view. And indeed, I think I do agree. It also made me realize that osrdyne doesn't even handle workers, only workers' pools groups. So let's make that even more obvious by renaming WorkerMetadata to WorkerPoolMetadata WorkerGroupMetadata and maybe the field core_id should be worker_pool_id worker_group_id (if I understand correctly, kubernetes returns the deployment, so one object per worker pool group)?

@Khoyo
Copy link
Contributor

Khoyo commented Jul 16, 2024

To be clear, we should use worker_group, not worker_pool (https://osrd.fr/en/docs/reference/design-docs/scalable-async-rpc/#conceps)

Currently osrdyne manages one worker pool, core, and starts worker groups (eg. core-1). The docker driver does that by directly starting a single worker, but the k8s driver can create autoscaled deployments.

(yes, fields referring to core are inherited from a previous design - core_controller - and should disappear)

@ElysaSrc
Copy link
Contributor Author

From my perspective, it makes way more sense that for a Kubernetes deployment the scaling happens at the orchestrator level and not the applicative level.

This sentence did make me understand your point of view. And indeed, I think I do agree. It also made me realize that osrdyne doesn't even handle workers, only workers' pools. So let's make that even more obvious by renaming WorkerMetadata to WorkerPoolMetadata and maybe the field core_id should be worker_pool_id (if I understand correctly, kubernetes returns the deployment, so one object per worker pool)?

Yes, that's a really good point that @flomonster also raised during its review so I think we're going to modify this. I'll commit something about this later today.

Copy link
Contributor

@leovalais leovalais left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the most part. Haven't tested yet (might be useful to check the dev setup on macOS where no host networking is available).

Thanks for guiding me through the review and answering my questions :)

editoast/src/core/mod.rs Outdated Show resolved Hide resolved
editoast/src/core/mq_client.rs Outdated Show resolved Hide resolved
editoast/src/core/mq_client.rs Show resolved Hide resolved
editoast/src/core/mq_client.rs Outdated Show resolved Hide resolved
osrdyne/src/main.rs Outdated Show resolved Hide resolved
@Khoyo Khoyo force-pushed the osrd-async branch 5 times, most recently from b6b8366 to 1487784 Compare July 23, 2024 13:08
Copy link
Contributor

@clarani clarani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, not tested ✅

front/public/locales/fr/errors.json Show resolved Hide resolved
Copy link
Contributor

@SharglutDev SharglutDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm (not tested)

Copy link
Contributor

@Erashin Erashin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. ApiServerCommand to be deleted at a later stage (+ hopefully at some point we'll drop the v1 endpoints).

osrdyne/Dockerfile Outdated Show resolved Hide resolved
Copy link
Contributor

@Khoyo Khoyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor

@bloussou bloussou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work !! Tested on the dev environment it works 🎉 !!!

Copy link
Contributor

@flomonster flomonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

editoast/src/core/mq_client.rs Show resolved Hide resolved
@ElysaSrc ElysaSrc added this pull request to the merge queue Aug 8, 2024
Merged via the queue into dev with commit 597260b Aug 8, 2024
22 checks passed
@ElysaSrc ElysaSrc deleted the osrd-async branch August 8, 2024 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[meta] Scalable async RPC between editoast and core