Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix continueAsNew with large snapshot(>4MB) #482

Merged
merged 13 commits into from
Dec 10, 2024
Merged

Fix continueAsNew with large snapshot(>4MB) #482

merged 13 commits into from
Dec 10, 2024

Conversation

longquanzheng
Copy link
Contributor

@longquanzheng longquanzheng commented Nov 15, 2024

Description

Checklist

  • Code compiles correctly
  • Tests for the changes have been added
  • All tests passing
  • This PR change is backwards-compatible
  • This PR CONTAINS a (planned) breaking change (it is not backwards compatible)

Related Issue

Closes #397

- name: Dump docker logs
if: always()
uses: jwalton/gh-docker-logs@v2
# TODO cadence sticky test is flaky
Copy link
Contributor Author

@longquanzheng longquanzheng Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @lwolczynski disabled this as you saw today that it started to be flaky.
Having the sticky disabled for Temporal should be enough for protecting determinism.

// NOTE: using DeterministicKeys so that the JSON snapshot for continueAsNew is stable for pagination
// TODO: we should use DeterministicKeys for every map iteration in interpreter for safety
// https://github.com/indeedeng/iwf/issues/510
for _, k := range DeterministicKeys(am.searchAttributes) {
Copy link
Contributor Author

@longquanzheng longquanzheng Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary otherwise the snapshot will have different order for different pages.
It wasn't a problem becase I didn't test more than one page for continueAsNew (and there wasn't prod use case running into this, luckily)

@longquanzheng longquanzheng requested review from ktrops and lwolczynski and removed request for ktrops December 6, 2024 22:24
@longquanzheng longquanzheng marked this pull request as ready for review December 6, 2024 22:24
StaleSkipTimerSignals: c.timerProcessor.Dump(),
}, nil
})
return c.provider.SetQueryHandler(ctx, service.ContinueAsNewDumpQueryType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time understanding how/where this pagination code (SetQueryHandler) is called when the workflow is queried. What is the code path?

Copy link
Contributor Author

@longquanzheng longquanzheng Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use IDE to browse the code path:

  • workflowImpl.go: if input.IsResumeFromContinueAsNew { LoadInternalsFromPreviousRun(...) }
  • continueAsNewer.go: provider.ExecuteActivity(&resp, false, ctx, DumpWorkflowInternal...
  • activityImpl.go: apiClient.DefaultApi.ApiV1WorkflowInternalDumpPost(ctx)
  • service.go: s.client.QueryWorkflow(ctx, &pageOfSnapshot, request.GetWorkflowId(), request.GetWorkflowRunId(), service.ContinueAsNewDumpQueryType, request)
  • continueAsNewer.go : the registered handler of SetQueryHandler(ctx, service.ContinueAsNewDumpQueryType,

@longquanzheng longquanzheng changed the title Fix continueAsNew with large snapshot Fix continueAsNew with large snapshot(>4MB) Dec 7, 2024
assertions := assert.New(t)

// start test workflow server
wfHandler := signal.NewHandler()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Wouldn't it be better to create a handler class for this specific test?

Copy link
Contributor Author

@longquanzheng longquanzheng Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like using the existing is better. The test only need a workflow to do nothing, so that we can set some large DAs. The existing signal workflow is exactly like that. I would just copy the code if so.

str, err := json.Marshal(req)
if len(str) > 1000 {
str = str[0:1000]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this logging went to Datadog, is there a specific reason why we need to limit its size? I don't see a reason to limit this if it's not going to Temporal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the request object. Logging will be too expensive too if the req is too big (>100 bytes).

This one is actually not going to Datadog because we have another internal handler (see handler.go in internal repo).

I wanted to limit this otherwise my local dev will look pretty bad when sending the 1MB request.

if request.PageSizeInBytes > 0 {
pageSize = request.PageSizeInBytes
}
lenInDouble := float64(len(wholeData))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: no need to declare this since it's only used in line 151. I also believe they could be float32s since you are really dividing ints and just doing type conversion

Copy link
Contributor Author

@longquanzheng longquanzheng Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having two lines is better for reading. and easier to debug when we want to set a breakpoint.

The compiler should be smart enough to optimize this to save a local varable.

float64 is required by the library. Golang doesn't allow int32 to divided by float like Java.

Screenshot 2024-12-09 at 3 42 27 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2024-12-10 at 9 53 45 AM

Okay, I see that Ceil requires float64, interesting... TIL

bump sleep

add test

rename

fix static check

tweak config

done

minor fix

done

fix

comment

fix test

disable cadence test

reduce test load

reduce test load

reduce test load

fix test

remove test

remove test

use copy
@longquanzheng longquanzheng merged commit 1bed2dc into main Dec 10, 2024
8 checks passed
@longquanzheng longquanzheng deleted the IWF-130 branch December 10, 2024 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reliability enhancement: move dump internal pagination to query method (IWF-130)
4 participants