-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for accommodating DTR changes in UCP 1.1.1 #212
Comments
|
@rkharya From what I understand, the image repo consistency is left to the customer. Docker does not replicate images. The below points to using NFS or cloud storage for image repo HA.
|
@vvb , thanks for outlining this. Before jumping to my comments on the proposal, I thought I will clarify my understanding on a few use-cases for setting up a DTR in the cluster:
I have a few comments, but I think this proposal might also need a few tweaks based on some of the the use cases above.
While this may make sense in a cloud environment, but in a baremetal deployment I am wondering if this separation really needs to be physical. Won't we be wasting a lot of capacity by doing this and adding to the overall cost? Instead should we look into carving out the right amount resources on the physical host itself like setting up cgroups etc for UCP and DTR containers? BTW this applies to all infra services (and not just UCP and DTR) that run in a cluster i.e. they need to be given enough resources and protection from each other and user apps. If we carve out resources correctly I think we will able get rid of some of the special handling for DTR nodes as proposed.
I see a few concerns trying to use volplugin, but correct me if I am wrong:
I am not sure if we really need a bootstrap node for DTR, looking at DTR installation steps looks like all it needs is UCP to be up. In general, bootstrap nodes present a huge obstacle with multi node cluster bootstrap needing us to do tricks like we do for UCP. We should avoid having single bootstrap node if possible, but it might not always be in our hands ;) .
The HA model for our cluster has been VIP based which is provided by UCARP (i.e. active/standby like) but we can possibly replace it with haproxy or equivalent. Also I think for docker containers, docker would use it's in built load balancing (coming soon). |
@mapuri thanks for the comments.. all your assumptions seems correct to me.
I will look into how do we carve out CPU separation with-in the same host. If this is done, then we would not need much change from existing design. Only unknown that I see is the performance. Do you remember the scenario, where time taken to launch a container was(and I guess still is) really high (about 2 mins per container), by the mere existence of 2 UCP replicas. Lot of east-west traffic between them and so high CPU usage for etcd. we need to see if DTR is also so, in that case we might double up east-west traffic on the 3 master nodes. And so, needs some experimentation. On the load balancer, let us continue using ucarp for now.. |
One more thing to consider is that both UCP and DTR have web UIs, that use 80/443 ports by default. so, if running DTR on the same node as UCP, it will need a different incoming port. Would a user be generally open to that? example, |
DTR also supports an We need to discuss the general guideline to handle the upgrade scenario, if a user has DTR version x installed and running, how do we go to DTR version x+1. A user can manually do that on every node. Putting it out here to discuss if this should be a functionality in clusterctl. More like maintenance mode, but without bringing any other services down, only a single service(with no dependencies) gets upgraded. |
@vvb, these are good considerations. In general I like to think of what cluster-management offers i.e. does it take care of installing and managing infra services v/s do some and leave some? I think it is former, however where applicable we may need to support easier integration with services that user might need to bring in. And DTR seems to be one of those services. wrt separate UIs specifically I think it is more of a problem for UCP if it doesn't provide access to DTR's UI through unified UCP UI. I feel UCP would present consistent UI, but I would defer it to their good judgement and business call.
I think for us there are two cases:
I am not sure if port conflicts is a good reason to leave idle capacity in cluster or provision special nodes for each such service. But this might be just me.
yes, this is where maintenance workflow of clustermgr can be used. Right now it is a no-op in cluster-manager. If one is ok taking a node temporarily out of cluster for maintenance, then a simple workflow could be cleanup followed by provision which will do this task of service upgrade. But I can see this might not always be desirable. If a specific service needs to be upgraded then it will involve a bit of ansible design as well like tagging each service and just running cleanup/provision for that tag. But this needs to be thought through more. May be track it as a issue? |
We could probably do something about that if the registry ran... on On 31 May 2016, at 2:21, Vikrant Balyan wrote:
|
@mapuri @rkharya @jainvipin @erikh There are a few things that have changed since we discussed this last.. DTR can only be installed on a UCP worker node now. Docker We will need to rethink the design now, At present we make every So, we should now let user define Also,
I have let Uday know of this failure. |
@vvb, thanks for looking into this
hmm, this is interesting. It certainly makes configuration harder. Do you know why this restriction? May be we are better off understanding if this is interim or permanent behavior from Docker's perspective as it sounds to go against the shared infrastructure philosophy.
can you explain a bit more,
Hmmm, this kind of makes the Instead, not ideal but may be we could just solve this by allowing selective placement of master processes. |
This seems to be the way docker is going. I have seen the extensive debug logs to verify that they are indeed specifically keeping track of all the UCP controller/replica nodes and adding a constraint not to use them for DTR. There is a extra check now, which checks if the node is suitable for DTR installation. I am not sure if there were any internal reasons for docker to separate them physically.
Yes
I need to think more about this - at the end it needs to be simple for the user to be able to specify and so, I was thinking that a flag in extra_vars was probably simple, and we could do all checks internally in But yes I agree the same is possible via the host-capability strings too, where we would run Would you agree with the below statement - which is required to run DTR when As far as ensuring exclusion goes, Docker itself fails the operation if DTR/UCP controllers/replicas overlap. So, exclusion check can come later too. |
@vvb , thanks for clarifying.
I agree if we map DTR to master role, we could do what you mentioned here i.e. deal with selectively placing dtr and ucp. Alternatively, reading through DTR installation (https://docs.docker.com/docker-trusted-registry/install/install-dtr/), looks like they don't have agent (or worker) mode for DTR i.e. it's just controller and replicas. So may it will just be simpler to only install DTR on worker nodes always but to limit the number of workers by using a boolean flag or host capability variable. From user workflow perspective, they will just mark a node (at time of commission) to be a dtr node. |
Yes only DTR Controller and replicas. No DTR agent.
So, from what I understand, we need to know the following -
So, we would need, one of the below.. (1) OR (2) OR (3) |
@vvb , I think let's go with |
UCP 1.1.1 tightly binds DTR with UCP. DTR uses UCP for certificates/Auth and runs as containers on UCP worker nodes. These are my initial thoughts on supporting it for contiv.
volplugin
based network storage (ceph/nfs), then it can run on blade servers as well. This would mean tying DTR withvolplugin
and enforcing thatvolplugin
services come up before DTR.service-master
host-group for installing DTR.volplugin
based on what sort of a storage we want for image repository.contiv_network
should not be required.—replica
mode by default for non-bootstrap master nodes (when scheduler = ucp). This should not be done when the node isdtr_bootstrap_node
or adtr_replica
. UCP replicas and DTR replicas will be separate.host_capability
string to make sure that some roles (contiv_network
) are not run.dtr_bootstrap_node_name
- First node that should be brought up with DTR functionalitydtr_replica=True
- there on a user can passdtr_replica
flag to control if a new node should be a DTR replica or not./cc @rkharya @mapuri @jainvipin
The text was updated successfully, but these errors were encountered: