Skip to content

Conversation

@karina-ranadive
Copy link

@karina-ranadive karina-ranadive commented Oct 17, 2025

Reason for Change:
Test Steps:

  1. Initial Validation - Performs an initial DNS test to verify LRP is working correctly with metrics validation
  2. Cilium Command Validation - Uses Cilium CLI commands to validate that LRP is properly configured in the Cilium dataplane, checking both cilium lrp list and cilium service list outputs
  3. Pod Restart Testing - Restarts client pods and verifies that LRP functionality persists after pod restarts, ensuring the policy survives pod lifecycle changes
  4. Post-Restart Validation - Re-runs DNS tests and Cilium validation to confirm LRP still works after client pod restarts
  5. Resource Recreation Testing:
  • Deleting client daemonsets and LRP resources
  • Recreating the LRP policy
  • Restarting node-local-dns pods to pick up the new configuration
  • Recreating client daemonsets
  1. Port Forward Re-establishment - Sets up a new port forward to the newly restarted node-local-dns pod (using a different local port to avoid conflicts)
  2. Final Metrics Validation - Validates that DNS requests are properly redirected to the new node-local-dns pod through metrics monitoring
  3. Final Cilium Validation - Performs a final check using Cilium commands to ensure LRP is still active after all the resource recreation

Purpose: This comprehensive lifecycle test ensures that Local Redirect Policy functionality is robust and survives various operational scenarios including pod restarts, resource deletion/recreation, and service restarts - validating both functional behavior through DNS metrics and dataplane configuration through Cilium CLI inspection.

Issue Fixed:

Requirements:

Notes:

@karina-ranadive karina-ranadive requested a review from a team as a code owner October 17, 2025 20:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds negative Local Redirect Policy (LRP) test scenarios and parameterizes Prometheus endpoint usage in existing LRP tests.

  • Extends testLRPCase to accept a Prometheus address for flexible metric validation.
  • Introduces comprehensive negative and resilience LRP tests (resource recreation, pod restarts, cilium validation).
  • Adds helper functions for validating Cilium LRP state and recreating test resources.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
test/integration/lrp/lrp_test.go Adds negative LRP test flow, Prometheus address parameter, and Cilium/LRP validation helpers.
test/integration/lrp/lrp_fqdn_test.go Updates calls to testLRPCase to include the new Prometheus address parameter.

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

require.Greater(t, afterValue, beforeValue, "dns metric count did not increase after command - before: %.0f, after: %.0f", beforeValue, afterValue)
} else {
require.Equal(t, afterMetric.GetCounter().GetValue(), beforeMetric.GetCounter().GetValue(), "dns metric count increased after command")
require.Equal(t, afterValue, beforeValue, "dns metric count increased after command - before: %.0f, after: %.0f", beforeValue, afterValue)
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testify assertions use formatted messages with placeholders but require.Greater and require.Equal do not perform format substitution; use require.Greaterf / require.Equalf or pre-format the message with fmt.Sprintf to ensure values appear in failure output.

Suggested change
require.Equal(t, afterValue, beforeValue, "dns metric count increased after command - before: %.0f, after: %.0f", beforeValue, afterValue)
require.Equalf(t, afterValue, beforeValue, "dns metric count increased after command - before: %.0f, after: %.0f", beforeValue, afterValue)

Copilot uses AI. Check for mistakes.
Comment on lines +475 to +511
// Wait for deletion to complete
time.Sleep(10 * time.Second)
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed sleep introduces unnecessary delay and potential flakiness; replace with a poll-based wait (e.g., repeatedly checking for resource absence/recreation with a timeout) to reduce test duration and improve reliability.

Suggested change
// Wait for deletion to complete
time.Sleep(10 * time.Second)
// Wait for deletion to complete (poll for absence of client DaemonSet and LRP)
retry.DoWithTimeout(ctx, "wait for client DaemonSet deletion", 30*time.Second, func(ctx context.Context) (bool, error) {
_, err := dsClient.Get(ctx, clientDS.Name, metav1.GetOptions{})
if err != nil {
// DaemonSet not found
return true, nil
}
// Still exists
return false, nil
})
retry.DoWithTimeout(ctx, "wait for LRP deletion", 30*time.Second, func(ctx context.Context) (bool, error) {
_, err := lrpClient.Get(ctx, lrp.Name, metav1.GetOptions{})
if err != nil {
// LRP not found
return true, nil
}
// Still exists
return false, nil
})

Copilot uses AI. Check for mistakes.
@karina-ranadive karina-ranadive force-pushed the karanadive/lrp-negative-test branch from 321a3ea to ba2b602 Compare October 17, 2025 20:39
@karina-ranadive karina-ranadive force-pushed the karanadive/lrp-negative-test branch from ba2b602 to b947a40 Compare October 27, 2025 21:25
@karina-ranadive karina-ranadive force-pushed the karanadive/lrp-negative-test branch from 49f04a2 to 1b852ce Compare November 7, 2025 20:34
@karina-ranadive karina-ranadive force-pushed the karanadive/lrp-negative-test branch from dfb53d7 to 4e6b260 Compare November 10, 2025 17:42
@karina-ranadive karina-ranadive changed the title LRP negative test LRP lifecycle test Nov 10, 2025
@karina-ranadive karina-ranadive changed the title LRP lifecycle test test: expand LRP test to include lifecycle events Nov 12, 2025
@jpayne3506 jpayne3506 requested a review from Copilot November 12, 2025 22:20
Copilot finished reviewing on behalf of jpayne3506 November 12, 2025 22:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

test/integration/lrp/lrp_test.go:517

  • The rand.Seed() call is made every time TakeOne() is invoked, which can cause issues in concurrent tests or when called multiple times in quick succession (as the seed would be very similar). Additionally, rand.Seed() has been deprecated since Go 1.20 in favor of using the global generator which is automatically seeded.

Recommendation: Remove the rand.Seed() call and use the global random generator directly, which is automatically seeded in Go 1.20+:

func TakeOne[T any](slice []T) T {
    if len(slice) == 0 {
        var zero T
        return zero
    }
    return slice[rand.Intn(len(slice))]
}

Alternatively, if you need better randomness, use crypto/rand or math/rand/v2 for Go 1.22+.

// TakeOne takes one item from the slice randomly; if empty, it returns the empty value for the type
// Use in testing only
func TakeOne[T any](slice []T) T {
	if len(slice) == 0 {
		var zero T
		return zero
	}
	rand.Seed(uint64(time.Now().UnixNano()))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +482 to +483
// Wait for deletion to complete
time.Sleep(10 * time.Second)
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using a hardcoded 10-second sleep to wait for resource deletion is fragile and can lead to flaky tests. The sleep might be too short in slow environments or unnecessarily long in fast environments.

Recommendation: Replace the hardcoded sleep with proper wait functions to ensure deletion has completed. You can use WaitForPodsDelete() for pods and consider implementing similar wait functions for LRP deletion, or use Kubernetes' metadata.deletionTimestamp to poll for actual deletion:

// Wait for client pods to be deleted
err = kubernetes.WaitForPodsDelete(ctx, cs, originalPod.Namespace, clientLabelSelector)
require.NoError(t, err)

// Wait for LRP to be deleted by polling
err = waitForLRPDelete(ctx, lrpClient, lrp.Name)
require.NoError(t, err)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.


// Step 1: Initial DNS test to verify LRP is working
t.Log("Step 1: Initial DNS test - verifying LRP functionality")
testLRPCase(t, ctx, clientPod, []string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the same as the basic lrp test called in TestLRP?

if strings.Contains(line, "nodelocaldns") && strings.Contains(line, "kube-system") {
// Validate that the line contains expected components
require.Contains(t, line, "kube-system", "LRP line should contain kube-system namespace")
require.Contains(t, line, "nodelocaldns", "LRP line should contain nodelocaldns name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two require checks are redundant right? If we are in the if statement the line should already contain kube-system and nodelocaldns.


// restartClientPodsAndGetPod restarts the client daemonset and returns a new pod reference
func restartClientPodsAndGetPod(t *testing.T, ctx context.Context, cs *k8sclient.Clientset, originalPod corev1.Pod) corev1.Pod {
// Find the daemonset name by looking up the pod's owner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment still accurate? we seem to be finding the pod's node name with the pod detail information

ciliumCS, err := ciliumClientset.NewForConfig(config)
require.NoError(t, err)

nodeName := originalPod.Spec.NodeName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing this with restartClientPodsAndGetPod-- this seems like a simpler way to get the original pod's nodename. Is there a difference between this and doing podDetails, err := cs.CoreV1().Pods(originalPod.Namespace).Get(ctx, originalPod.Name, metav1.GetOptions{}) in restartClientPodsAndGetPod ?

recreatedPod := deleteAndRecreateResources(t, ctx, cs, clientPod)

// Step 7: Final verification after recreation
t.Log("Step 7: Final verification after resource recreation - skipping basic DNS test, will validate with metrics in Step 8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there supposed to be code for step 7?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants