Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate tigera status for all log storage controllers #3118

Merged
merged 4 commits into from
Jan 29, 2024

Conversation

vara2504
Copy link
Contributor

@vara2504 vara2504 commented Jan 12, 2024

Description

  status:
    conditions:
    - lastTransitionTime: "2024-01-25T22:17:31Z"
      message: ""
      reason: Unknown
      status: "False"
      type: Degraded
    - lastTransitionTime: "2024-01-25T22:17:31Z"
      message: All sub-controllers are available
      reason: AllObjectsAvailable
      status: "True"
      type: Ready
    - lastTransitionTime: "2024-01-25T22:17:31Z"
      message: ""
      reason: Unknown
      status: "False"
      type: Progressing

  status:
    conditions:
    - lastTransitionTime: "2024-01-25T22:55:42Z"
      message: ""
      reason: Unknown
      status: "False"
      type: Progressing
    - lastTransitionTime: "2024-01-25T23:02:49Z"
      message: 'The following sub-controllers are in this condition: [log-storage-access
        log-storage-elastic log-storage-esmetrics log-storage-kubecontrollers]'
      reason: ResourceNotReady
      status: "True"
      type: Degraded
    - lastTransitionTime: "2024-01-25T22:55:03Z"
      message: ""
      reason: ResourceNotReady
      status: "False"
      type: Ready

For PR author

  • Tests for change.
  • If changing pkg/apis/, run make gen-files
  • If changing versions, run make gen-versions

For PR reviewers

A note for code reviewers - all pull requests must have the following:

  • Milestone set according to targeted release.
  • Appropriate labels:
    • kind/bug if this is a bugfix.
    • kind/enhancement if this is a a new feature.
    • enterprise if this PR applies to Calico Enterprise only.

Copy link
Member

@caseydavenport caseydavenport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass - It's a bit of a tricky one, but I suggested some changes to the logic to simplify things. I think there are a couple of small bugs regarding how we set certain fields as well.


// Update tigera status conditions
switch {
case statusMap[operatorv1.ComponentDegraded]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these are mutually exclusive, so a switch statement doesn't quite work here since it will only ever handle one at a time. It's possible for a tigerastatus to be both degraded and progressing, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Okay, so what could be the combinations?

Available = true, only when aggregated Progressing and Degraded conditions are false.
When Progressing and/or Degraded is true: What if we have Available=true for one of the components? Do we need to set true with the message "the following components are available," or should Available be simply marked as false?"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked back at the semantics that we implement for our other TigeraStatus resources: https://github.com/tigera/operator-cloud/blob/9aabd16a94b6a5c7306cb8e1f7caf85f909b8159/pkg/controller/status/status.go#L377-L429

I believe that it is something like this:

  • Degraded if any sub-component/deployment is crashing, or explicitly marked degraded.
  • Progressing if any sub-component/deployment is progressing, and none are crashing.
  • Available if no sub-components are progressing, not explicitly marked degraded, and nothing is crashing.

For this particular controller, we don't really have the ability to distinguish between "explicitly marked degraded by the controller" and "some pods are crashing", so in practice it may actually be closer to mutually exclusive here.

I think it would be good to aggregate the Degraded and Progressing fields, and then Available will be true if every sub-controller is Available, and false otherwise.

for _, condition := range tsConditions {
for i := range aggTigeraStatusConditions {
if aggTigeraStatusConditions[i].Type == condition.Type {
aggTigeraStatusConditions[i].Status = condition.Status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite following this logic - we seem to unconditionally update aggTigeraStatusConditions here regardless of its current value. So, if the previous call to updateAggregatedConditions set degraded to "true", but this call sets it to "false", which one is correct?

It seems the code tracks this separately via the statusMap, which is then used to override whatever value is set here in setAndClearTigeraStatus, so I don't think the value set here on line 157 is ever used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like perhaps this loop should perhaps not set anything in aggTigeraStatusConditions and just populate the status map. Then we can use the statusMap to set the correct conditions.

statusMap[condition.Type] = statusMap[condition.Type] || (condition.Status == operatorv1.ConditionTrue)

// Set the most recent Observed Generation
if condition.ObservedGeneration > aggTigeraStatusConditions[i].ObservedGeneration {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be setting this to the oldest observed generation - that's the only generation that we're confident all of the sub-controllers have observed.

if aggTigeraStatusConditions[i].Type == condition.Type {
aggTigeraStatusConditions[i].Status = condition.Status
aggTigeraStatusConditions[i].Message = fmt.Sprintf("%s%s for %s;", aggTigeraStatusConditions[i].Message, condition.Message, logStorageType)
if aggTigeraStatusConditions[i].LastTransitionTime.Time.Before(condition.LastTransitionTime.Time) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the transition time should be done based on the aggregated condition, and not within this loop.

Instead of initializing the aggregated conditions to empty values, we should initialize them to whatever we set last time this controller executed. Then, whenever one of the conditions changes (i.e., in setAndClearTigeraStatus) we can update the transition time there.

For example, it's possible that one sub-controller is consistently Degraded, and other sub-controllers flicker between Available and Progressing and back again. In that case, the logic as written would update the transition time for Available and Progressing here even though the aggregated statuses never changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Updating the latest transition time here. However reverting it when there is no change in the object status in resetLastTransitionTimeStamp. So that i can compare the status and message once all the subcontrollers are aggregated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's confusing me a bit that we set the value here and then overwrite it elsewhere.

I also think that this muddies the water in terms of what the LastTransitionTime means - does it mean the last transition time of the the sub-controllers? Or does it mean the last transition time of the aggregate? It seems like it could mean either, depending on the situation.

I think we should just declare this to be the last time the field changed state, and not try to be too clever by merging in the last transition time from the sub status.

for i := range aggTigeraStatusConditions {
if aggTigeraStatusConditions[i].Type == condition.Type {
aggTigeraStatusConditions[i].Status = condition.Status
aggTigeraStatusConditions[i].Message = fmt.Sprintf("%s%s for %s;", aggTigeraStatusConditions[i].Message, condition.Message, logStorageType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good if we did something like this:

  • Track which conditions are true, and which sub-controllers they are true for.
  • Build a Message based on that after iterating all of them.

Then, we can produce a simple message like this:

"The following sub-controllers are degraded: log-storage-access, log-storage-users"

or

"The following-sub-controllers are progressing: log-storage-elastic"

@vara2504
Copy link
Contributor Author

vara2504 commented Jan 19, 2024

@caseydavenport Do the scenarios below look okay?

  1. When the TigeraStatus object contains only 1 item (e.g., Available):
    Currently, the LogStorage condition array will also have the same 1 item (e.g., Available). With the aggregation changes, the condition array will have all 3 items, marking the unavailable items (Progressing, Degraded) in TigeraStatus as false.

  2. When the TigeraStatus condition is empty:
    Currently, the LogStorage condition array will be empty. With the aggregation changes, the condition array will have all 3 items, marking Available and Progressing as false, and Degraded as true, with the following message.
    "The following sub-controllers are degraded:log-storage-access,log-storage-elastic,log-storage-secrets"

          {
              Type: "Ready",
              Status: "False",
              ObservedGeneration: 0,
              LastTransitionTime: {
                  Time: 2024-01-19T11:40:10-08:00,
              },
              Reason: "ResourceNotReady",
              Message: "",
          },
          {
              Type: "Progressing",
              Status: "False",
              ObservedGeneration: 0,
              LastTransitionTime: {
                  Time: 2024-01-19T11:40:10-08:00,
              },
              Reason: "ResourceNotReady",
              Message: "",
          },
          {
              Type: "Degraded",
              Status: "True",
              ObservedGeneration: 0,
              LastTransitionTime: {
                  Time: 2024-01-19T11:40:10-08:00,
              },
              Reason: "ResourceNotReady",
              Message: "The following sub-controllers are degraded:log-storage-access,log-storage-elastic,log-storage-secrets",
          },
          ```

@vara2504
Copy link
Contributor Author

@caseydavenport Incorporate the suggested changes. Please take a look

@vara2504 vara2504 marked this pull request as ready for review January 24, 2024 17:57
@vara2504 vara2504 requested a review from a team as a code owner January 24, 2024 17:57
if aggTigeraStatusConditions[i].Type == condition.Type {
aggTigeraStatusConditions[i].Status = condition.Status
aggTigeraStatusConditions[i].Message = fmt.Sprintf("%s%s for %s;", aggTigeraStatusConditions[i].Message, condition.Message, logStorageType)
if aggTigeraStatusConditions[i].LastTransitionTime.Time.Before(condition.LastTransitionTime.Time) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's confusing me a bit that we set the value here and then overwrite it elsewhere.

I also think that this muddies the water in terms of what the LastTransitionTime means - does it mean the last transition time of the the sub-controllers? Or does it mean the last transition time of the aggregate? It seems like it could mean either, depending on the situation.

I think we should just declare this to be the last time the field changed state, and not try to be too clever by merging in the last transition time from the sub status.

logStorageInstances = append(logStorageInstances, TigeraStatusLogStorageESMetrics, TigeraStatusLogStorageKubeController)
}

for _, logStorage := range logStorageInstances {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vara2504 I think this particular function can probably still be simplified. The thing I'd like to avoid is the many layers of helper function that modify the conditions. I think we should be able to factor this so that we build each condition exactly once in once place. Let me know what you think of logic like this:

// Keep track of which instances are in which state.
states := map[string][]string {
  ComponentDegraded: []string{},
  ComponentAvailable: []string{},
  ComponentProgressing: []string{},
}

// We also keep track of the oldest observed generation here so we can add it to the conditions later.
var observedGeneration int64

// Build up the lists of which components are in which state.
for _, instance := range expectedInstances {
  ts := &operatorv1.TigeraStatus{}
  if err := r.client.Get(ctx, namespacedName, ts); errors.IsNotFound(err) {
    // Expected status not found, treat it as an error.
    states[ComponentDegraded] = append(states[ComponentDegraded], instance.Name)
    continue
  } else if err != nil {
    return nil, err
  }

  // Found it - add to states map, which we use to track which TigeraStatuses are in which state(s).
  for _, condition := range ts.Status.Conditions {
    if condition.Status == ConditionTrue {
      states[condition.Status] = append(states[condition.Status], instance.Name)
    }
    if observedGeneration == 0 || condition.ObservedGeneration < observedGeneration {
      observedGeneration = condition.ObservedGeneration
    }
  }
}

// Convert the states map to a set of conditions.
conditions := map[string]metav1.Condition{}
for _, statusType, instances := range states {

  condition := metav1.Condition{
    Type: statusType, 
    ObservedGeneration: observedGeneration,
  }  

  if statusType != ComponentAvailable && len(instances) != 0 {
    // This condition is true.
    condition.Status = operatorv1.ConditionTrue
    conditoin.Reason = operatorv1.NotReady
    condition.Message = fmt.Sprintf("The following sub-controllers are in this condition: %+v", instances)
  } else if statusType != ConditionAvailable {
    // This condition is false.
    condition.Status = operatorv1.ConditionFalse
  } else {
    // This is the available condition, which is only true if all statuses are available.
    if len(instances) == len(expectedInstances) {
      condition.Status = operatorv1.ConditionTrue
      condition.Message = "All sub-controllers are available"
    } else {
      condition.Status = operatorv1.ConditionFalse
    }
  }

  // Store the condition.
  conditions[statusType] = condition
} 
return conditions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should remove the need for a number of the helper functions, like:

  • transformIntoLogStorageConditions
  • setAvailable, setProgressing, setDegraded
  • clearAvailable, clearProgressing, clearDegraded
  • mergeCondition

As always, I could be missing an edge case though so let me know if this is too simple for what we need!

Status: operatorv1.ConditionFalse,
Reason: string(operatorv1.ResourceNotReady),
Message: "",
ObservedGeneration: generation,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any tests that handle different ObservedGeneration values for different sub-controllers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope will add it.

@vara2504 vara2504 force-pushed the ls_ts branch 2 times, most recently from 7372dfe to a189f1d Compare January 27, 2024 00:56
condition := metav1.Condition{
Type: statusType,
ObservedGeneration: observedGeneration,
Reason: string(operatorv1.Unknown),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting Unknown as a default reason. Because It was straightforward while handling use case with multiple condition set as true.

eg: deg = true(one component), prog =false, available= true(all component).
Then , we would be setting as such in resultant ls storage conditions deg=true, prog=false and available=true.

If we need to set reason for all condition types as AllObjectsAvailable when available=true, we might need to handle deg!=true similarly for progressing.

I can add those check and set all reasons as "AllObjectsAvailable if it is a preferred option.

@caseydavenport caseydavenport merged commit f7c7aaf into tigera:master Jan 29, 2024
5 checks passed
@danudey danudey modified the milestones: v1.33.0, v1.34.0 Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants