Skip to content

GRPC traffic is not load balanced correctly #142

@deenav

Description

@deenav

While running longevity test in PKS cluster with moderate IO (medium-scale) workload, Observing time lapses in the controller logs whenever there are more than one controller in the PKS cluster.

Environment details: PKS / K8 with medium cluster:

3 master: xlarge: 4 CPU, 16 GB Ram, 32 GB Disk
5 worker: 2xlarge: 8 CPU, 32 GB Ram, 64 GB Disk
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend
Pravega version: zk-closed-client-issue-0.5.0-2161.60655bf
Zookeeper Operator : pravega/zookeeper-operator:0.2.1
Pravega Operator: adrianmo/pravega-operator:issue-134-5

Steps to reproduce

  1. Deploy Pravega-operator and Pravega cluster with minimum 2 controller instances on PKS Cluster.
  2. Start longevity test with moderate IO (medium-scale) workload and observe that both controllers are serving read/write requests from the longevity run. Use kubectl logs -f po/<controller-pod-name> command to monitor the controller logs.
  3. Induce one controller restart using kubectl exec -it <controller1-pod-name> reboot command.
  4. After the reboot, when controller completes it's initial configuration stages like ControllerServiceStarter STARTING then it is not logging anything and not serving read/write requests from the client.
  5. Restarted pod remains idle until the active controller pod gets some kind of error/event (I tested by triggering reboot on the active controller ) which makes the idle controller to resume it's services.

Snip of time lapses in controller log:-

2019-03-21 10:20:56,286 3023 [ControllerServiceStarter STARTING] INFO  i.p.c.s.ControllerServiceStarter - Awaiting start of controller event processors
2019-03-21 10:20:56,286 3023 [ControllerServiceStarter STARTING] INFO  i.p.c.s.ControllerServiceStarter - Awaiting start of controller cluster listener
2019-03-21 10:20:57,456 4193 [controllerpool-16] INFO  i.p.c.e.i.EventProcessorGroupImpl - Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[abortStreamReaders]
2019-03-21 10:20:57,456 4193 [controllerpool-19] INFO  i.p.c.e.i.EventProcessorGroupImpl - Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[scaleGroup]
2019-03-21 10:20:57,456 4193 [controllerpool-30] INFO  i.p.c.e.i.EventProcessorGroupImpl - Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[commitStreamReaders]
2019-03-21 11:53:37,489 5564226 [controllerpool-18] INFO  i.p.c.cluster.zkImpl.ClusterZKImpl - Node 172.24.4.6:9090:eff7c8dd-8794-4dc4-8dc0-c361cb2b2eb4 removed from cluster
2019-03-21 11:53:37,492 5564229 [controllerpool-18] INFO  i.p.c.f.ControllerClusterListener - Received controller cluster event: HOST_REMOVED for host: 172.24.4.6:9090:eff7c8dd-8794-4dc4-8dc0-c361cb2b2eb4
2019-03-21 11:53:37,494 5564231 [Curator-LeaderSelector-0] INFO  i.p.c.fault.SegmentMonitorLeader - Obtained leadership to monitor the Host to Segment Container Mapping

Problem location
My observation here is, it seems load balancing not happening properly when restarted/failed controller resumes it's operation if more than one controller in the cluster.

Metadata

Metadata

Assignees

Labels

kind/enhancementEnhancement of an existing featurepriority/P2Slight inconvenience or annoyance to applications, system continues to functionstatus/needs-investigationFurther investigation is required

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions