Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy core changes for reverse connections #37368

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

basundhara-c
Copy link

@basundhara-c basundhara-c commented Nov 26, 2024

Commit Message: This commit collates the envoy core changes for reverse connections, described in this github issue. A detailed description of reverse connections concepts and workflows is provided in both the github issue and in the examples section.

Additional Description: This PR involves several working components, that are added as part of the following extensions:

Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: Basundhara Chakrabarty [email protected]
Co-authored-by: Arun Vasudevan [email protected]
Co-authored-by: Tejas Sangol [email protected]
Co-authored-by: Aditya Jaltade [email protected]

Copy link

Hi @basundhara-c, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #37368 was opened by basundhara-c.

see: more, trace.

@basundhara-c basundhara-c force-pushed the reverse_conn_envoy_core branch from 00a138f to e8d32a5 Compare November 26, 2024 20:57
Commit Message: This commit collates the envoy core changes for reverse connections.
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

Signed-off-by: Basundhara Chakrabarty <[email protected]>
Co-authored-by: Arun Vasudevan <[email protected]>
Co-authored-by: Tejas Sangol <[email protected]>
Co-authored-by: Aditya Jaltade <[email protected]>
Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really glad you got this working! Tossed in a few high level comments but I'm going to assign a first pass reviewer for the rest

/**
* @return the cluster manager pointer.
*/
virtual Upstream::ClusterManager* getClusterManager() PURE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think cluster manager APIs belong in the dispatcher - I think it's worth finding a more clean way to get the cluster manager you need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@basundhara-c, could you pass the singleton ClusterManager() from server when creating the RCThreadLocalRegistry in your extension, so that the RCmanager has the thread_local_cluster() api, and we don't need to use this worker_.dispatcher().set/getClusterManager()?

* Provides filters access to connection handler to save outgoing connections as
* incoming connections for reverse tunnels
*/
virtual void setConnectionHandler(Network::ConnectionHandler* connection_handler) PURE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what this function has to do with the dispatcher - did you stash this here as a convenient way to get the handler local to your thread? I think you want to look into the thread local state (tls) getters and setters.

@@ -19,6 +21,26 @@
namespace Envoy {
namespace Network {

// The thread local registry.
class LocalRevConnRegistry {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you should be able to do this PR with minimal APIs in core Envoy, but instead in your extension directory

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alyssawilk The main set of changes is to support the creation of the ReverseConnectionManager and ReverseConnectionHandler thread-locally as implemented using a thread-local registry. The ReverseConnectionManager and ReverseConnectionHandler need to come up right after the workers are created because they are involved in accepting reverse connections from other initiating envoy instances even if the local envoy is not initiating any reverse connections, as long as the local envoy has the reverse connections bootstrap extension enabled.

So, currently, we check if the reverse connection bootstrap extension is enabled by checking if the singleton is present, and if so, right after workers are created, we post to each worker's dispatcher a functor to create the Thread Local Registries (creating the ReverseConnectionManager and ReverseConnectionHandler). This thread local registry is then parked with the Connection Handler and is accessed later wherever we need access to the ReverseConnectionManager and ReverseConnectionHandler like here,to initiate reverse conns, here in the reverse conn filter, which accepts reverse conns and in multiple other places where only the thread local dispatcher is otherwise available to us. Therefore, I have defined abstract classes for LocalRevConnRegistry, RevConnRegistry, etc so that these two entities can be stored in the form of a ThreadLocalRegistry and accessed later. We briefly touched upon this in this comment and on slack. In summary:

  1. We have created two thread local entities which get initialized right after the workers are created. They need to be parked somewhere so that they can be accessed later in code paths where only the thread local dispatcher is available. We have parked them with the thread local connection handler and therefore added the abstract classes under Envoy::Network.
  2. If the above is not the best approach, what would you suggest?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread local registry is then parked with the Connection Handler and is accessed later wherever we need access to the ReverseConnectionManager and ReverseConnectionHandler like here,to initiate reverse conns, here in the reverse conn filter, which accepts reverse conns and in multiple other places where only the thread local dispatcher is otherwise available to us

For filters we can get the slot from serverFactoryContext from Server::Configuration::FactoryContext& context during the filter chain creation, could you try to see if this can avoid the setConnectionHandler api changes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@botengyao I have implemented this in the last set of commits.

Copy link
Member

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @basundhara-c, very interesting feature!
Here is a first pass to kick off the process.

/**
* @return the cluster manager pointer.
*/
virtual Upstream::ClusterManager* getClusterManager() PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@basundhara-c, could you pass the singleton ClusterManager() from server when creating the RCThreadLocalRegistry in your extension, so that the RCmanager has the thread_local_cluster() api, and we don't need to use this worker_.dispatcher().set/getClusterManager()?

@@ -19,6 +21,26 @@
namespace Envoy {
namespace Network {

// The thread local registry.
class LocalRevConnRegistry {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread local registry is then parked with the Connection Handler and is accessed later wherever we need access to the ReverseConnectionManager and ReverseConnectionHandler like here,to initiate reverse conns, here in the reverse conn filter, which accepts reverse conns and in multiple other places where only the thread local dispatcher is otherwise available to us

For filters we can get the slot from serverFactoryContext from Server::Configuration::FactoryContext& context during the filter chain creation, could you try to see if this can avoid the setConnectionHandler api changes?

"Thread local rverse conn registry should not be null.");
}

void ConnectionHandlerImpl::saveUpstreamConnection(Network::ConnectionSocketPtr&& upstream_socket,
Copy link
Member

@botengyao botengyao Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to store the upstream socket to a dedicated REVESR_CLUSTER cluster so that it can be reused when public Envoy wants to establish connections to the on-perm one. And we also wan to save upstream connection here is just for PING and other handshake processes.

The request flow is like:

  1. on-prem Envoy -> public Envoy
  2. Then the public Envoy stores the downstream socket to the REVESR_CLUSTER,
  3. and also initialize the listener to do RPING
  4. And then the service behind the public Envoy can send request -> normal listener -> filster-chain / router -> the stored socket in REVERSE_CLUSTER.

Am I understanding correctly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@botengyao yes, most of the above is correct, the request flow is as follows:

Onprem -> Cloud envoy

  1. Onprem initiates reverse connections; it creates http connections to cloud envoy and sends a reverse connection initiation request through it.
  2. This request is intercepted by the reverse conn http filter and cloud envoy stores the sockets with the ReverseConnectionHandler. This ReverseConnectionHandler periodically sends RPINGs over all such cached sockets.
  3. On cloud envoy, a REVERSE_CONNECTION cluster type is defined and is used for all requests that need to be sent over a reverse connection. When a request arrives and this REVERSE_CONNECTION cluster is picked for the route, we interface with the ReverseConnectionHandler described above, get the cached socket and send the request over it.

@@ -263,6 +278,10 @@ void ConnectionImpl::setDetectedCloseType(DetectedCloseType close_type) {
}

void ConnectionImpl::closeSocket(ConnectionEvent close_type) {
if (connection_reused_ || !ConnectionImpl::ioHandle().isOpen()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When connection_reused_ is always true, how do we close the socket?

Copy link
Author

@basundhara-c basundhara-c Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@botengyao apologies for the late reply, we don't, in this case, since reverse connections are a very small number of long lived connections. However, we might be able to mark sockets dead in the Listener filter which will be passed these sockets (#37821), will test it out and update it there.

Copy link
Member

@botengyao botengyao Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could need a red_button to clean up the resources even connection_reused_ is true.

Copy link
Member

@botengyao botengyao Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, the changes in the connection socket with the customized connection reuse check seem hacky. It could be cleaner to create a dedicated close sequence only for the reused connection? Besides, the close can also be initialized by the peer, what will happen in this case? The socket will not close but the remote is closed, will we enter a dead loop when we find we cannot read/write data?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@botengyao I am trying to find a cleaner way of running the close sequence.

@alyssawilk alyssawilk added waiting and removed waiting labels Dec 5, 2024
@basundhara-c
Copy link
Author

@botengyao thanks a lot for the suggestion on obtaining the slot from the context! I am trying that out along with an attempt to move the code in extensions to contrib as much as possible and will be sharing the changes shortly!

…er to the RCManager

2. Deleting unwanted APIs and ReverseConnectionManager and Handler header files

Signed-off-by: Basundhara Chakrabarty <[email protected]>
Signed-off-by: Basundhara Chakrabarty <[email protected]>
Signed-off-by: Basundhara Chakrabarty <[email protected]>
@basundhara-c
Copy link
Author

@botengyao we have added a bunch of changes according to your suggestion, namely:

  • Removed the getters and setters for connection handler and cluster manager
  • The RCManager and RCHandler are now self contained within contrib/ and not parked with the Dispatcher. The filters and other entities obtain them through TLS getters. This has reduced the envoy core code.
  • Have added changes required to remove references to extensions code as the reverse connections code is now in contrib/
    Will await further suggestions on this PR, thank you in advance!

@alyssawilk
Copy link
Contributor

@botengyao can you take a look today?
I'm going to remove myself for now just to get it off PR notifier but please add me back once you're done with your pass.

@alyssawilk alyssawilk removed their assignment Jan 7, 2025
Copy link
Member

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Here is another pass, and some changes need at least unit tests to improve the coverage.

/wait

envoy/network/listener.h Outdated Show resolved Hide resolved
envoy/network/connection_handler.h Outdated Show resolved Hide resolved
source/common/listener_manager/connection_handler_impl.h Outdated Show resolved Hide resolved
source/common/json/json_loader.cc Show resolved Hide resolved
source/common/listener_manager/connection_handler_impl.cc Outdated Show resolved Hide resolved
config.name()));
}
// Reverse connection listener should not bind to port.
bind_to_port_ = false;
Copy link
Member

@botengyao botengyao Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to add the invalid config check here, and we can reject it if the config contains bind_to_port.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@botengyao The bind_to_port_ field is set to true by default in shouldBindToPort(config). The idea here is to set bind_to_port_ to false for any listener with a reverse connection config. Are you suggesting that we make it mandatory for bind_to_port to be explicitly set to false in the listener config for reverse connection listener?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I don't see there is reverse_connection_listener_config in the v3::Listener proto, is it missing? A dedicated sub-config for reverse connection is needed and if some conditions are needed we can update and validate the config.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reverse_connection_listener_config was indeed missing, my apologies! Added it in listener.proto

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an indicator that the test is missing, could you add a test to cover the config, please?

…eck for reverse conn cluster

Signed-off-by: Basundhara Chakrabarty <[email protected]>
…ptr and some minor nits

Signed-off-by: Basundhara Chakrabarty <[email protected]>
Signed-off-by: Basundhara Chakrabarty <[email protected]>
Copy link
Member

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better! Thanks for the contribution and here is another pass.

/wait

config.name()));
}
// Reverse connection listener should not bind to port.
bind_to_port_ = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I don't see there is reverse_connection_listener_config in the v3::Listener proto, is it missing? A dedicated sub-config for reverse connection is needed and if some conditions are needed we can update and validate the config.

source/common/upstream/upstream_impl.h Outdated Show resolved Hide resolved
@@ -263,6 +278,10 @@ void ConnectionImpl::setDetectedCloseType(DetectedCloseType close_type) {
}

void ConnectionImpl::closeSocket(ConnectionEvent close_type) {
if (connection_reused_ || !ConnectionImpl::ioHandle().isOpen()) {
Copy link
Member

@botengyao botengyao Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, the changes in the connection socket with the customized connection reuse check seem hacky. It could be cleaner to create a dedicated close sequence only for the reused connection? Besides, the close can also be initialized by the peer, what will happen in this case? The socket will not close but the remote is closed, will we enter a dead loop when we find we cannot read/write data?

@@ -765,6 +766,7 @@ absl::Status InstanceBase::initializeOrThrow(Network::Address::InstanceConstShar
ASSERT(config_.clusterManager());
xds_manager_ =
std::make_unique<Config::XdsManagerImpl>(*config_.clusterManager(), validation_context_);
listener_manager_->setClusterManagerForWorkers(config_.clusterManager());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is no setClusterManagerForWorkers() anymore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! Fixed some issues caused due to missed code including removal of this and addition of the reverse_connection_listener_config field in listener.proto
@botengyao

Copy link

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @mattklein123
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #37368 was synchronize by basundhara-c.

see: more, trace.

Signed-off-by: Basundhara Chakrabarty <[email protected]>
Copy link
Member

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @phlax, do you have insights why the ci doesn't work for this PR?
It is showing Run # Check if the merge commit SHA is not null Merge commit information not found for pull request 37368.

Copy link
Member

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the dedicated effort! Looks good in high level except the runtime guard and connection close sequence.

/wait

config.name()));
}
// Reverse connection listener should not bind to port.
bind_to_port_ = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an indicator that the test is missing, could you add a test to cover the config, please?

@@ -1000,10 +1000,12 @@ void DownstreamFilterManager::sendLocalReply(
// route refreshment in the response filter chain.
cb->route(nullptr);
}


bool reverse_conn_force_local_reply =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We perfer latching the runtime value to avoid the cost to get it every time sending a local reply.

// We only prepare a local reply to execute later if we're actively
// invoking filters to avoid re-entrant in filters.
if (state_.filter_call_state_ & FilterCallState::IsDecodingMask) {
if (!reverse_conn_force_local_reply && state_.filter_call_state_ & FilterCallState::IsDecodingMask) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually a runtime guard has a 6 month deletion policy, and it is usually used to guard the data plane behavior change. I think here you want a parameter to tell if this is going to be a local reply on a reverse connection. Could you point me where you detect this use case, and I assume it is in your HTTP filter extension.

@phlax
Copy link
Member

phlax commented Jan 15, 2025

@botengyao

This branch has conflicts that must be resolved

@jmarantz
Copy link
Contributor

/wait (Boteng has unresolved comments)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants