Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for accommodating DTR changes in UCP 1.1.1 #212

Open
vvb opened this issue May 31, 2016 · 14 comments
Open

Proposal for accommodating DTR changes in UCP 1.1.1 #212

vvb opened this issue May 31, 2016 · 14 comments
Milestone

Comments

@vvb
Copy link
Contributor

vvb commented May 31, 2016

UCP 1.1.1 tightly binds DTR with UCP. DTR uses UCP for certificates/Auth and runs as containers on UCP worker nodes. These are my initial thoughts on supporting it for contiv.

Image Alt

  • Presently we have a “service-master” host-group that installs UCP in master mode. We recommend using 3 service-master nodes, which form the UCP HA cluster.
  • It is recommended that DTR and it's replicas be installed on new nodes and not on one of the UCP controller replica nodes.
  • B-Series considerations:
    • Should DTR run on blade servers or rack servers ?
      • If we are going to use local storage for DTR image repository, then it should run on a rack server, as we could have more memory on it.
      • If we are going to use a volplugin based network storage (ceph/nfs), then it can run on blade servers as well. This would mean tying DTR with volplugin and enforcing that volplugin services come up before DTR.
  • ansible considerations:
    • we can use the existing service-master host-group for installing DTR.
    • DTR needs base, ucarp(?), docker, scheduler_stack (UCP). It might need volplugin based on what sort of a storage we want for image repository. contiv_network should not be required.
      • UCP install will need a change; today on a service-master, we bring up UCP with —replica mode by default for non-bootstrap master nodes (when scheduler = ucp). This should not be done when the node is dtr_bootstrap_node or a dtr_replica. UCP replicas and DTR replicas will be separate.
      • We might need a new host_capability string to make sure that some roles (contiv_network) are not run.
    • dtr_bootstrap_node_name - First node that should be brought up with DTR functionality
    • dtr_replica=True - there on a user can pass dtr_replica flag to control if a new node should be a DTR replica or not.
      • This is a slightly different approach from how we bring up UCP replicas, where we assume that every non-bootstrap master node will be a UCP replica. But since UCP and DTR nodes should not overlap, we should let the user choose which node should have what behaviour.
  • Load balancers for DTR:
    • TBD!

/cc @rkharya @mapuri @jainvipin

@rkharya
Copy link

rkharya commented May 31, 2016

  • Regarding Storage to DTR replicas - If storage is local and not shared between the replicas, who takes care of image repo consistency? I do not see any reference to this on Docker DTR Storage configuration resources. We may need to get this clarified with Docker if we want to use local storage on C-series servers for DTR replicas.

@vvb
Copy link
Contributor Author

vvb commented May 31, 2016

@rkharya From what I understand, the image repo consistency is left to the customer. Docker does not replicate images. The below points to using NFS or cloud storage for image repo HA.

By default, Docker Trusted Registry stores images on the filesystem of the host where it is running.

You can also configure DTR to using these cloud storage backends:

Amazon S3
OpenStack Swift
Microsoft Azure
For highly available installations, configure DTR to use a cloud storage backend or a network filesystem like NFS.

@mapuri
Copy link
Contributor

mapuri commented May 31, 2016

@vvb , thanks for outlining this.

Before jumping to my comments on the proposal, I thought I will clarify my understanding on a few use-cases for setting up a DTR in the cluster:

  • Is this DTR going to be used only for user container images or for storing infra service containers as well (like etcd, volplugin, netplugin etc)? I assume it is just the former. If so, is it fine for infra service images to come from docker-hub or yet another DTR? Again I am assuming answer is yes.
  • How are the images pushed to this DTR? Is it only from within the UCP cluster i.e. we expect the users to build the images and push to this DTR before being able to run their containers? I am assuming answer is yes.
  • Do we need to support a public DTR like docker-hub? I am assuming answer is yes for infra services at least.
  • Do we need to support a pre-existing on premise DTR from a different cluster? I am assuming answer is yes, and we may need to do handle this case.
  • Would DTR work correctly by itself if we were to set up multiple instances with local storage? I am assuming not without user intervention for every built image. If not, then it might mandate to have NFS setup. I don't think we can use ceph (see my comments below).

I have a few comments, but I think this proposal might also need a few tweaks based on some of the the use cases above.

It is recommended that DTR and it's replicas be installed on a new nodes and not on one of the nodes which are UCP controller replicas

While this may make sense in a cloud environment, but in a baremetal deployment I am wondering if this separation really needs to be physical. Won't we be wasting a lot of capacity by doing this and adding to the overall cost? Instead should we look into carving out the right amount resources on the physical host itself like setting up cgroups etc for UCP and DTR containers? BTW this applies to all infra services (and not just UCP and DTR) that run in a cluster i.e. they need to be given enough resources and protection from each other and user apps.

If we carve out resources correctly I think we will able get rid of some of the special handling for DTR nodes as proposed.

If we are going to use a volplugin based network storage (ceph/nfs), then it can run on blade servers as well.

I see a few concerns trying to use volplugin, but correct me if I am wrong:

  • One general issue I see with using infra services like volplugin, netplugin etc to provide for other infra services is that it can lead us to chicken and egg situation, especially if we going to pull the infra containers from the DTR as well. This is highly unlikely but putting it here, just so we can eliminate this possibility.
  • ceph might not be possible storage solution for DTR as we can't mount same volume across hosts. But nfs does looks promising, however AFAIK a lot of work for setting up nfs and creating nfs volumes happens outside volplugin i.e. volplugin will help with only mounting the volume into the container. So really I don't see much of value add atm of this added complexity.

dtr_bootstrap_node_name - First node that should be brought up with DTR functionality

I am not sure if we really need a bootstrap node for DTR, looking at DTR installation steps looks like all it needs is UCP to be up. In general, bootstrap nodes present a huge obstacle with multi node cluster bootstrap needing us to do tricks like we do for UCP. We should avoid having single bootstrap node if possible, but it might not always be in our hands ;) .

Load balancers for DTR:

The HA model for our cluster has been VIP based which is provided by UCARP (i.e. active/standby like) but we can possibly replace it with haproxy or equivalent. Also I think for docker containers, docker would use it's in built load balancing (coming soon).

@vvb
Copy link
Contributor Author

vvb commented May 31, 2016

@mapuri thanks for the comments.. all your assumptions seems correct to me.
The above proposal was based on uday mentioning that DTR should be on a separate node.

Instead should we look into carving out the right amount resources on the physical host itself like setting up cgroups etc for UCP and DTR containers?

I will look into how do we carve out CPU separation with-in the same host. If this is done, then we would not need much change from existing design. Only unknown that I see is the performance. Do you remember the scenario, where time taken to launch a container was(and I guess still is) really high (about 2 mins per container), by the mere existence of 2 UCP replicas. Lot of east-west traffic between them and so high CPU usage for etcd. we need to see if DTR is also so, in that case we might double up east-west traffic on the 3 master nodes. And so, needs some experimentation.

On the load balancer, let us continue using ucarp for now..

@vvb
Copy link
Contributor Author

vvb commented Jun 3, 2016

One more thing to consider is that both UCP and DTR have web UIs, that use 80/443 ports by default. so, if running DTR on the same node as UCP, it will need a different incoming port. Would a user be generally open to that? example, https://dtr:9002

@vvb
Copy link
Contributor Author

vvb commented Jun 3, 2016

DTR also supports an upgrade command,
docker/dtr install
docker/dtr upgrade

We need to discuss the general guideline to handle the upgrade scenario, if a user has DTR version x installed and running, how do we go to DTR version x+1. A user can manually do that on every node. Putting it out here to discuss if this should be a functionality in clusterctl. More like maintenance mode, but without bringing any other services down, only a single service(with no dependencies) gets upgraded.

@mapuri
Copy link
Contributor

mapuri commented Jun 3, 2016

@vvb, these are good considerations. In general I like to think of what cluster-management offers i.e. does it take care of installing and managing infra services v/s do some and leave some? I think it is former, however where applicable we may need to support easier integration with services that user might need to bring in. And DTR seems to be one of those services.

wrt separate UIs specifically I think it is more of a problem for UCP if it doesn't provide access to DTR's UI through unified UCP UI. I feel UCP would present consistent UI, but I would defer it to their good judgement and business call.

Would a user be generally open to that?

I think for us there are two cases:

  • user has their own DTR and want to use that with cluster, in that case may be it is better to have the UCP ports configurable (which they are today), so that user get's to control. Alss in this scenario user can choose to run DTR on separate machine or in a different cluster.
  • cluster-management takes care of setting up DTR and UCP. In this case we can take care of the port conflicts and again by exposing each port as configurable we do keep it in user's control.

I am not sure if port conflicts is a good reason to leave idle capacity in cluster or provision special nodes for each such service. But this might be just me.

We need to discuss the general guideline to handle the upgrade scenario, if a user has DTR version x installed and running, how do we go to DTR version x+1.

yes, this is where maintenance workflow of clustermgr can be used. Right now it is a no-op in cluster-manager. If one is ok taking a node temporarily out of cluster for maintenance, then a simple workflow could be cleanup followed by provision which will do this task of service upgrade. But I can see this might not always be desirable.

If a specific service needs to be upgraded then it will involve a bit of ansible design as well like tagging each service and just running cleanup/provision for that tag. But this needs to be thought through more. May be track it as a issue?

@erikh
Copy link
Contributor

erikh commented Jun 20, 2016

We could probably do something about that if the registry ran... on
contiv storage. :D

On 31 May 2016, at 2:21, Vikrant Balyan wrote:

@rkharya From what I understand, the image repo consistency is left to
the customer. Docker does not replicate images. The below points to
using NFS or cloud storage for image repo HA.

By default, Docker Trusted Registry stores images on the filesystem of 
the host where it is running.

You can also configure DTR to using these cloud storage backends:

Amazon S3
OpenStack Swift
Microsoft Azure
For highly available installations, configure DTR to use a cloud 
storage backend or a network filesystem like NFS.

You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#212 (comment)

@vvb
Copy link
Contributor Author

vvb commented Jul 13, 2016

@mapuri @rkharya @jainvipin @erikh There are a few things that have changed since we discussed this last..

DTR can only be installed on a UCP worker node now. Docker
blocks installation of DTR on any UCP controller/replica nodes.

We will need to rethink the design now,

At present we make every service-master a UCP controller/replica. So only service-worker nodes are UCP worker nodes.
Having an infra service run as service-worker does not fit very well into the model we want to have.

So, we should now let user define dtr_controller_node and dtr_replica_node booleans in the service-master flow itself. Based on these we will run UCP in worker mode on those nodes and install DTR on them.

Also,
In parallel I tried the DTR workflow and have the ansible bits partly ready,
I see some issues with DTR install in the final stages where it tries to create a internal overlay network for DTR communication.

ERRO[0003] Couldn't attach phase2 container to 2040230f54f575263f604ba7c6c17cb33076c5c25a33c6d9c638357ed7e27677: Error response from daemon: subnet sandbox join failed for "10.1.0.0/24": vxlan interface creation failed for subnet "10.1.0.0/24": failed to set link up: address already in use
FATA[0003] Error response from daemon: subnet sandbox join failed for "10.1.0.0/24": vxlan interface creation failed for subnet "10.1.0.0/24": failed to set link up: address already in use

I have let Uday know of this failure.
I remember seeing this error earlier also, while running docker overlay timing tests, when I tried to create a docker overlay network(not contiv overlay).

@mapuri
Copy link
Contributor

mapuri commented Jul 13, 2016

@vvb, thanks for looking into this

DTR can only be installed on a UCP worker node now. Docker blocks installation of DTR on any UCP controller/replica nodes.

hmm, this is interesting. It certainly makes configuration harder. Do you know why this restriction? May be we are better off understanding if this is interim or permanent behavior from Docker's perspective as it sounds to go against the shared infrastructure philosophy.

Having an infra service run as service-worker does not fit very well into the model we want to have.

can you explain a bit more, service-worker nodes do run infra services, but just their agent parts. I agree dtr components won't fit service-master role as is but if my understanding based on your comment is correct, from service-worker perspective looks like ucp and dtr agents can co-exist?

So, we should now let user define dtr_controller_node and dtr_replica_node booleans in the service-master flow itself. Based on these we will run UCP in worker mode on those nodes and install DTR on them.

Hmmm, this kind of makes the service-master role complex (in some case a master can run worker processes !)

Instead, not ideal but may be we could just solve this by allowing selective placement of master processes.
Say instead of booleans may be we could use the pre-existing node-capabilities variable to let user select nodes where ucp and dtr controlerr/replica's are placed. And optionally we can make dtr-controller and ucp-controller capabilities mutually exclusive as part of some basic checks. wdyt?

@vvb
Copy link
Contributor Author

vvb commented Jul 13, 2016

hmm, this is interesting. It certainly makes configuration harder. Do you know why this restriction? May be we are better off understanding if this interim or permanent behavior from Docker's perspective as it sounds to go against the shared infrastructure philosophy.

This seems to be the way docker is going. I have seen the extensive debug logs to verify that they are indeed specifically keeping track of all the UCP controller/replica nodes and adding a constraint not to use them for DTR. There is a extra check now, which checks if the node is suitable for DTR installation. I am not sure if there were any internal reasons for docker to separate them physically.

DEBU[0001] node constraints: [ucsb-blade4 ucsb-blade2 ucsb-blade3]

can you explain a bit more, service-worker nodes do run infra services, but just their agent parts. I agree dtr components won't fit service-master role as is but if my understanding based on your comment is correct, from service-worker perspective looks like ucp and dtr agents can co-exist?

Yes service-worker nodes just run the agents parts as of today. DTR controller and replica fit more into service-master type. A DTR node also needs to be a UCP service-worker node (as it stands today). So, there is a conflict, DTR controller/replica should ideally be on a service-master node, but today our ansbile design of making every service-master a UCP controller/replica restricts it.

Instead, may be we could just solve this by allowing selective placement of master processes.
Say instead of booleans may be we could use the pre-existing node-capabilities variable to let user select nodes where ucp and dtr controlerr/replica's are placed. And optionally we can make dtr-controller and ucp-controller capabilities mutually exclusive as part of some basic checks. wdyt?

I need to think more about this - at the end it needs to be simple for the user to be able to specify and so, I was thinking that a flag in extra_vars was probably simple, and we could do all checks internally in ucp.j2 and dtr.j2 to enforce mutual exclusion.

But yes I agree the same is possible via the host-capability strings too, where we would run dtr role only when host-capability specifies dtr_controller/dtr_replica.

Would you agree with the below statement - which is required to run DTR when runas == master?
ucp-swarm needs to run in join mode when capability matches dtr-controller/dtr-replica even when runas == master

As far as ensuring exclusion goes, Docker itself fails the operation if DTR/UCP controllers/replicas overlap. So, exclusion check can come later too.

@mapuri
Copy link
Contributor

mapuri commented Jul 13, 2016

@vvb , thanks for clarifying.

which is required to run DTR when runas == master?
ucp-swarm needs to run in join mode when capability matches dtr-controller/dtr-replica even when runas == master

I agree if we map DTR to master role, we could do what you mentioned here i.e. deal with selectively placing dtr and ucp.

Alternatively, reading through DTR installation (https://docs.docker.com/docker-trusted-registry/install/install-dtr/), looks like they don't have agent (or worker) mode for DTR i.e. it's just controller and replicas.

So may it will just be simpler to only install DTR on worker nodes always but to limit the number of workers by using a boolean flag or host capability variable. From user workflow perspective, they will just mark a node (at time of commission) to be a dtr node.

@vvb
Copy link
Contributor Author

vvb commented Jul 14, 2016

Alternatively, reading through DTR installation (https://docs.docker.com/docker-trusted-registry/install/install-dtr/), looks like they don't have agent (or worker) mode for DTR i.e. it's just controller and replicas.

Yes only DTR Controller and replicas. No DTR agent.

So may it will just be simpler to only install DTR on worker nodes always but to limit the number of workers by using a boolean flag or host capability variable. From user workflow perspective, they will just mark a node (at time of commission) to be a dtr node.

So, from what I understand, we need to know the following -

  1. Which service-worker node should be dtr_controller
  2. Which service-worker nodes should be replicas.

So, we would need, one of the below..

(1) is_dtr_controller - boolean - will need to be set for first worker node to bring it as dtr_controller
is_dtr_replica - boolean - will need to be set for subsequent worker nodes that should be replicas
Both these flags are mutually exclusive with is_dtr_controller taking precedence.
(or host capability version of the same)

OR

(2) dtr_bootstrap_node_name - string - node worker node_name for DTR controller node
is_dtr_replica - boolean - if the worker node should be a dtr_replica

OR

(3) is_dtr_node - boolean - keep track of DTR nodes internally in etcd state - which should track how many DTR nodes are already scheduled.. If it is the first node, then it should come up as controller or else as a replica

@mapuri
Copy link
Contributor

mapuri commented Jul 14, 2016

@vvb , I think 1 and 2 are similar in a way that they both give a way to identify dtr controller and then dtr nodes among all workers. 2 is closer to what we do for UCP, so might be preferable for consistency between plays. 3 requires a dependency on etcd and logic to populate/maintain state there which might be a tad too over engineering for this imo :)

let's go with 2 ? may be we can use host capabilities instead of a bool, but up to you

@mapuri mapuri added this to the 0.2 milestone Jul 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants