GRPC traffic is not load balanced correctly

**While running longevity test in PKS cluster with moderate IO (medium-scale) workload, Observing time lapses in the controller logs whenever there are more than one controller in the PKS cluster.**

**Environment details: PKS / K8 with medium cluster:**
```
3 master: xlarge: 4 CPU, 16 GB Ram, 32 GB Disk
5 worker: 2xlarge: 8 CPU, 32 GB Ram, 64 GB Disk
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend
```
```
Pravega version: zk-closed-client-issue-0.5.0-2161.60655bf
Zookeeper Operator : pravega/zookeeper-operator:0.2.1
Pravega Operator: adrianmo/pravega-operator:issue-134-5
```

**Steps to reproduce**
1. Deploy Pravega-operator and Pravega cluster with minimum 2 controller instances on PKS Cluster.
2. Start longevity test with moderate IO (medium-scale) workload and observe that both controllers are serving read/write requests from the longevity run.  Use `kubectl logs -f po/<controller-pod-name>` command to monitor the controller logs.
3. Induce one controller restart using `kubectl exec -it <controller1-pod-name> reboot` command.
4. After the reboot, when controller completes it's initial configuration stages like `ControllerServiceStarter STARTING` then it is not logging anything and not serving read/write requests from the client.
5. Restarted pod remains idle until the active controller pod gets some kind of error/event (I tested by triggering `reboot` on the active controller ) which makes the idle controller to resume it's services.
 
Snip of time lapses in controller log:-
```
2019-03-21 10:20:56,286 3023 [ControllerServiceStarter STARTING] INFO  i.p.c.s.ControllerServiceStarter - Awaiting start of controller event processors
2019-03-21 10:20:56,286 3023 [ControllerServiceStarter STARTING] INFO  i.p.c.s.ControllerServiceStarter - Awaiting start of controller cluster listener
2019-03-21 10:20:57,456 4193 [controllerpool-16] INFO  i.p.c.e.i.EventProcessorGroupImpl - Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[abortStreamReaders]
2019-03-21 10:20:57,456 4193 [controllerpool-19] INFO  i.p.c.e.i.EventProcessorGroupImpl - Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[scaleGroup]
2019-03-21 10:20:57,456 4193 [controllerpool-30] INFO  i.p.c.e.i.EventProcessorGroupImpl - Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[commitStreamReaders]
2019-03-21 11:53:37,489 5564226 [controllerpool-18] INFO  i.p.c.cluster.zkImpl.ClusterZKImpl - Node 172.24.4.6:9090:eff7c8dd-8794-4dc4-8dc0-c361cb2b2eb4 removed from cluster
2019-03-21 11:53:37,492 5564229 [controllerpool-18] INFO  i.p.c.f.ControllerClusterListener - Received controller cluster event: HOST_REMOVED for host: 172.24.4.6:9090:eff7c8dd-8794-4dc4-8dc0-c361cb2b2eb4
2019-03-21 11:53:37,494 5564231 [Curator-LeaderSelector-0] INFO  i.p.c.fault.SegmentMonitorLeader - Obtained leadership to monitor the Host to Segment Container Mapping
```

**Problem location**
My observation here is, it seems load balancing not happening properly when restarted/failed controller resumes it's operation if more than one controller in the cluster.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPC traffic is not load balanced correctly #142

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GRPC traffic is not load balanced correctly #142

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions