Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-569611 Make channel close future idempotent #162

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sfc-gh-kkloudas
Copy link
Contributor

@sfc-gh-kkloudas sfc-gh-kkloudas commented Apr 11, 2022

SNOW-569611

@sfc-gh-tzhang I also left a question as a comment so that we can discuss it either in the PR or we can chat.

@@ -65,7 +65,7 @@ void removeChannelIfSequencersMatch(SnowflakeStreamingIngestChannelInternal chan
// We need to compare the channel sequencer in case the old channel was already been
// removed
return channelInCache != null
&& channelInCache.getChannelSequencer() == channel.getChannelSequencer()
&& channelInCache.getChannelSequencer().equals(channel.getChannelSequencer())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change out of curiosity? These should be long

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use Long and i believe we discussed this long time ago, basically we want to avoid the case when the server side sends us a bad request without the channel sequencer and we translate it into 0 if using long instead of NULL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and the reason is that == on objects checks if the object is the same while equals() (if correctly implemented) checks if the value is the same, irrespective of the object that caries it. So this could lead to bugs.

@@ -65,6 +66,8 @@ class SnowflakeStreamingIngestChannelInternal implements SnowflakeStreamingInges
// ON_ERROR option for this channel
private final OpenChannelRequest.OnErrorOption onErrorOption;

private final CompletableFuture<Void> terminationFuture;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, add one-liner explaining what this is.

@@ -112,6 +115,9 @@ class SnowflakeStreamingIngestChannelInternal implements SnowflakeStreamingInges
this.encryptionKey = encryptionKey;
this.encryptionKeyId = encryptionKeyId;
this.onErrorOption = onErrorOption;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, remove space.

* Mark the channel as closed atomically.
* @return {@code true} if the channel was already closed, {@code false} otherwise.
*/
boolean markClosed() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, could this be a synchronized method?

Copy link
Contributor

@sfc-gh-tjones sfc-gh-tjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, let's have Toby or Matt give their approval before landing.

Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change, two questions:

  1. Could you update SnowflakeStreamingIngestClientInternal.close as well?
  2. I think your change makes close() idempotent, but it's still possible that others call isClosed() before the close() finishes, right? So the same issue would still be possible.

@@ -65,7 +65,7 @@ void removeChannelIfSequencersMatch(SnowflakeStreamingIngestChannelInternal chan
// We need to compare the channel sequencer in case the old channel was already been
// removed
return channelInCache != null
&& channelInCache.getChannelSequencer() == channel.getChannelSequencer()
&& channelInCache.getChannelSequencer().equals(channel.getChannelSequencer())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use Long and i believe we discussed this long time ago, basically we want to avoid the case when the server side sends us a bad request without the channel sequencer and we translate it into 0 if using long instead of NULL

if (isClosed()) {
return CompletableFuture.completedFuture(null);
}
if (!markClosed()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function name might be confusing, might better be markAndGetClosed?

Comment on lines +322 to +329
// here we flush even if removal fails because, for some reason,
// the client sequencers did not match.
// This does not seem like the desirable behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand what you mean here, could you elaborate? But I notice a potential wrong result issue from your question.... please take a look if you're interested. PR: https://github.com/snowflakedb/snowflake-ingest-java/pull/163/files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-tzhang The removeChannelIfSequencersMatch() removes the channel from the cache if 1) it was there in the first place and 2) if the client seq matches. If all those conditions are satisfied, then it makes sense to call flush(), upload and register the created blob. If any of these conditions does not match, and the removal fails (e.g. the client seq is old), then should we flush and upload? or should we just invalidate the channel and move on?

I am not sure if this scenario is realistic but I just wanted to point this out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it makes sense, the validation check is actually in checkValidation() at the top of this function. If the channel is not in the channel cache, it must be invalid, you could add a logWarn to verify this logic

@@ -45,7 +46,7 @@ class SnowflakeStreamingIngestChannelInternal implements SnowflakeStreamingInges
private volatile boolean isValid;

// Indicates whether the channel is closed
private volatile boolean isClosed;
private final AtomicReference<Boolean> isClosed;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between AtomicReference and volatile in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we only use the functionality that it allows to atomically set the value of the ref and get the previous value. This allows us to be sure that do the state transitioning at most once and we perform the actions that should follow the state transition at most once.

Another nice thing about AtomicReference is that it allows you to do an atomic compare-and-swap. So this is useful for state machine implementation.

Before, we had:

 if (isClosed()) {
      return CompletableFuture.completedFuture(null);
 }
 markClosed();

which could have led to two (or more) threads calling close() and both of them seeing isClosed() == false due to timing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't AtomicReference<Boolean> same as AtomicBoolean?

@sfc-gh-kkloudas
Copy link
Contributor Author

sfc-gh-kkloudas commented Apr 12, 2022

@sfc-gh-tzhang

Thanks for the change, two questions:

  1. Could you update SnowflakeStreamingIngestClientInternal.close as well?

The reason I did not change this is because I do not know what do we want the semantics of SnowflakeStreamingIngestClientInternal.close() to be? Judging by the signature, we want it to wait for all channels to have closed before exiting, right? I will implement that and we can discuss later if we want something else.

  1. I think your change makes close() idempotent, but it's still possible that others call isClosed() before the close() finishes, right? So the same issue would still be possible.

Could you please clarify what do you mean by the second point? If someone calls close() then everyone else will see the channel as closed. The isClosed() does not perform any action. It simply reads the value.

@sfc-gh-kkloudas
Copy link
Contributor Author

Actually I think that for the client.close() I think that we should chat because I have some questions that I need to clarify before figuring out a solution. For example, we do not seem to close this.arrowBuffer.close(); for each channel and the javadoc of the method says that it releases resources. Do we risk of memory leaks given that in there we also call Utils.closeAllocator(this.allocator);?

Also we do not do any verifyChannelsAreFullyCommitted() or remove the channel from the cache. Are these things something we should add?

Finally, should we have for each channel a close-and-flush() method which is the close() now, and a close-and-do-nothing which will be called by all the close methods and gives us the ChannelData to put in the blob?

@sfc-gh-tzhang
Copy link
Contributor

@sfc-gh-tzhang

Thanks for the change, two questions:

  1. Could you update SnowflakeStreamingIngestClientInternal.close as well?

The reason I did not change this is because I do not know what do we want the semantics of SnowflakeStreamingIngestClientInternal.close() to be? Judging by the signature, we want it to wait for all channels to have closed before exiting, right? I will implement that and we can discuss later if we want something else.

  1. I think your change makes close() idempotent, but it's still possible that others call isClosed() before the close() finishes, right? So the same issue would still be possible.

Could you please clarify what do you mean by the second point? If someone calls close() then everyone else will see the channel as closed. The isClosed() does not perform any action. It simply reads the value.

I think we could just make close() synchronized, so any subsequent call to it will wait for the first one to finish. I believed the issue you mentioned to me the other day was that with the current implementation, the first call to channel.close() will mark the channel as closed and flush, then subsequent call to channel.close() or channel.isClose() will simply return and it's possible that customer does something with it before the first call finish, and that's why we want to make this change, does that sound right?

@sfc-gh-kkloudas
Copy link
Contributor Author

@sfc-gh-tzhang

Thanks for the change, two questions:

  1. Could you update SnowflakeStreamingIngestClientInternal.close as well?

The reason I did not change this is because I do not know what do we want the semantics of SnowflakeStreamingIngestClientInternal.close() to be? Judging by the signature, we want it to wait for all channels to have closed before exiting, right? I will implement that and we can discuss later if we want something else.

  1. I think your change makes close() idempotent, but it's still possible that others call isClosed() before the close() finishes, right? So the same issue would still be possible.

Could you please clarify what do you mean by the second point? If someone calls close() then everyone else will see the channel as closed. The isClosed() does not perform any action. It simply reads the value.

I think we could just make close() synchronized, so any subsequent call to it will wait for the first one to finish. I believed the issue you mentioned to me the other day was that with the current implementation, the first call to channel.close() will mark the channel as closed and flush, then subsequent call to channel.close() or channel.isClose() will simply return and it's possible that customer does something with it before the first call finish, and that's why we want to make this change, does that sound right?

You are right that the problem was that the first call to channel.close() will mark the channel as closed and flush, then subsequent call to channel.close() will simply return and it's possible that customer does something with it before the first call finish. The isClosed() will return always true during the process of closing because we first set the state to closed and then start. But now, every channel.close() will wait for the first one to finish.

Did you have a chance to see my other questions about the client.close()?

@sfc-gh-tzhang
Copy link
Contributor

sfc-gh-tzhang commented Apr 12, 2022

Actually I think that for the client.close() I think that we should chat because I have some questions that I need to clarify before figuring out a solution. For example, we do not seem to close this.arrowBuffer.close(); for each channel and the javadoc of the method says that it releases resources. Do we risk of memory leaks given that in there we also call Utils.closeAllocator(this.allocator);?

Also we do not do any verifyChannelsAreFullyCommitted() or remove the channel from the cache. Are these things something we should add?

Finally, should we have for each channel a close-and-flush() method which is the close() now, and a close-and-do-nothing which will be called by all the close methods and gives us the ChannelData to put in the blob?

Allocator.close should be sufficient since we use it the parent allocator when creating the arrow buffers. We were actually calling verifyChannelsAreFullyCommitted() on all the channels before, but I removed it as part of #117, and the reason is that we ran into issues like a channel is still in the cache and being inactive for a long time, it will be cleaned up at server side and causes false alarm. Also I feel like checking all the channels in cache during close doesn't make sense, so the recommendation is to always use channel.close. I was actual thinking about the same thing like you suggested -- support multiple close function with different functionalities, but got pushed back from folks that let's keep the APIs simple for now. I think it makes sense to add them in longer term (maybe PuPr or GA), based on customer's feedback.

@sfc-gh-tzhang
Copy link
Contributor

sfc-gh-tzhang commented Apr 12, 2022

@sfc-gh-tzhang

Thanks for the change, two questions:

  1. Could you update SnowflakeStreamingIngestClientInternal.close as well?

The reason I did not change this is because I do not know what do we want the semantics of SnowflakeStreamingIngestClientInternal.close() to be? Judging by the signature, we want it to wait for all channels to have closed before exiting, right? I will implement that and we can discuss later if we want something else.

  1. I think your change makes close() idempotent, but it's still possible that others call isClosed() before the close() finishes, right? So the same issue would still be possible.

Could you please clarify what do you mean by the second point? If someone calls close() then everyone else will see the channel as closed. The isClosed() does not perform any action. It simply reads the value.

I think we could just make close() synchronized, so any subsequent call to it will wait for the first one to finish. I believed the issue you mentioned to me the other day was that with the current implementation, the first call to channel.close() will mark the channel as closed and flush, then subsequent call to channel.close() or channel.isClose() will simply return and it's possible that customer does something with it before the first call finish, and that's why we want to make this change, does that sound right?

You are right that the problem was that the first call to channel.close() will mark the channel as closed and flush, then subsequent call to channel.close() will simply return and it's possible that customer does something with it before the first call finish. The isClosed() will return always true during the process of closing because we first set the state to closed and then start. But now, every channel.close() will wait for the first one to finish.

Did you have a chance to see my other questions about the client.close()?

Correct, then it seems like a problem that subsequent close() and isClose() behaves differently? close() will wait but isClose() doesn't, customer could use isClose() and still have the same issue you mentioned. I think we want to keep the behavior consistent, right?

@sfc-gh-kkloudas
Copy link
Contributor Author

Oh now I see your point.

I think that the best solution to this is to not expose the isClosed() method at all and either make the close() blocking (which I do not really like), or leave it as it (asynchronous) and only let the user check if the channel is closed by checking if the returned future is completed. From the two, I like more the solution of having the close() asynchronous (as is).

If we still want to expose the isClosed() we can always make it:

public boolean isClosed() {
    return this.terminationFuture.isDone();
  }

I did this in the PR btw but I still believe that we should not expose the isClosed() to avoid such confusion. I think that it is better to give users only one way to do things. It saves us some trouble later on.

@sfc-gh-kkloudas
Copy link
Contributor Author

Actually I think that for the client.close() I think that we should chat because I have some questions that I need to clarify before figuring out a solution. For example, we do not seem to close this.arrowBuffer.close(); for each channel and the javadoc of the method says that it releases resources. Do we risk of memory leaks given that in there we also call Utils.closeAllocator(this.allocator);?
Also we do not do any verifyChannelsAreFullyCommitted() or remove the channel from the cache. Are these things something we should add?
Finally, should we have for each channel a close-and-flush() method which is the close() now, and a close-and-do-nothing which will be called by all the close methods and gives us the ChannelData to put in the blob?

Allocator.close should be sufficient since we use it the parent allocator when creating the arrow buffers. We were actually calling verifyChannelsAreFullyCommitted() on all the channels before, but I removed it as part of #117, and the reason is that we ran into issues like a channel is still in the cache and being inactive for a long time, it will be cleaned up at server side and causes false alarm. Also I feel like checking all the channels in cache during close doesn't make sense, so the recommendation is to always use channel.close. I was actual thinking about the same thing like you suggested -- support multiple close function with different functionalities, but got pushed back from folks that let's keep the APIs simple for now. I think it makes sense to add them in longer term (maybe PuPr or GA), based on customer's feedback.

I am not suggesting to expose those "close" methods to the user. The only method the user will see is the "close-and-flush". The others will be internal. I have to think a bit about this and come back to you about the client.close().

@sfc-gh-tzhang
Copy link
Contributor

Oh now I see your point.

I think that the best solution to this is to not expose the isClosed() method at all and either make the close() blocking (which I do not really like), or leave it as it (asynchronous) and only let the user check if the channel is closed by checking if the returned future is completed. From the two, I like more the solution of having the close() asynchronous (as is).

If we still want to expose the isClosed() we can always make it:

public boolean isClosed() {
    return this.terminationFuture.isDone();
  }

I did this in the PR btw but I still believe that we should not expose the isClosed() to avoid such confusion. I think that it is better to give users only one way to do things. It saves us some trouble later on.

Makes sense to me, let's do that! @sfc-gh-japatel Are you using the isClosed() function in any KC logic?

@sfc-gh-tzhang
Copy link
Contributor

Actually I think that for the client.close() I think that we should chat because I have some questions that I need to clarify before figuring out a solution. For example, we do not seem to close this.arrowBuffer.close(); for each channel and the javadoc of the method says that it releases resources. Do we risk of memory leaks given that in there we also call Utils.closeAllocator(this.allocator);?
Also we do not do any verifyChannelsAreFullyCommitted() or remove the channel from the cache. Are these things something we should add?
Finally, should we have for each channel a close-and-flush() method which is the close() now, and a close-and-do-nothing which will be called by all the close methods and gives us the ChannelData to put in the blob?

Allocator.close should be sufficient since we use it the parent allocator when creating the arrow buffers. We were actually calling verifyChannelsAreFullyCommitted() on all the channels before, but I removed it as part of #117, and the reason is that we ran into issues like a channel is still in the cache and being inactive for a long time, it will be cleaned up at server side and causes false alarm. Also I feel like checking all the channels in cache during close doesn't make sense, so the recommendation is to always use channel.close. I was actual thinking about the same thing like you suggested -- support multiple close function with different functionalities, but got pushed back from folks that let's keep the APIs simple for now. I think it makes sense to add them in longer term (maybe PuPr or GA), based on customer's feedback.

I am not suggesting to expose those "close" methods to the user. The only method the user will see is the "close-and-flush". The others will be internal. I have to think a bit about this and come back to you about the client.close().

I might miss something, but if we make the function synchronized then we should be good?

@sfc-gh-japatel
Copy link
Collaborator

Oh now I see your point.
I think that the best solution to this is to not expose the isClosed() method at all and either make the close() blocking (which I do not really like), or leave it as it (asynchronous) and only let the user check if the channel is closed by checking if the returned future is completed. From the two, I like more the solution of having the close() asynchronous (as is).
If we still want to expose the isClosed() we can always make it:

public boolean isClosed() {
    return this.terminationFuture.isDone();
  }

I did this in the PR btw but I still believe that we should not expose the isClosed() to avoid such confusion. I think that it is better to give users only one way to do things. It saves us some trouble later on.

Makes sense to me, let's do that! @sfc-gh-japatel Are you using the isClosed() function in any KC logic?

I am using channel.isClosed() to check before reopening a channel.

sfc-gh-tzhang
sfc-gh-tzhang previously approved these changes Apr 13, 2022
Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please make sure you pass all the merge gate, as well as StreamingIngestIT.java

@@ -258,17 +264,23 @@ void invalidate() {
/** @return a boolean to indicate whether the channel is closed or not */
@Override
public boolean isClosed() {
return this.isClosed;
return this.terminationFuture.isDone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like KC is using it, I'm fine with both approach depends on whether KC has a workaround

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, then I can leave it given that people are using it and also now it is safer :)

@sfc-gh-kkloudas
Copy link
Contributor Author

For the tests now,

LGTM, please make sure you pass all the merge gate, as well as StreamingIngestIT.java

Now that we changed the semantics of the isClosed() to only return true when the channel.close() succeeds, the tests fail because the markClosed() simply changes the state of the channel to closed, but does not complete the termination future or anything. I am wondering what should be now the semantics.

@sfc-gh-tzhang
Copy link
Contributor

For the tests now,

LGTM, please make sure you pass all the merge gate, as well as StreamingIngestIT.java

Now that we changed the semantics of the isClosed() to only return true when the channel.close() succeeds, the tests fail because the markClosed() simply changes the state of the channel to closed, but does not complete the termination future or anything. I am wondering what should be now the semantics.

You could modify the test to call .close() first, it shouldn't be an issue

For the tests now,

LGTM, please make sure you pass all the merge gate, as well as StreamingIngestIT.java

Now that we changed the semantics of the isClosed() to only return true when the channel.close() succeeds, the tests fail because the markClosed() simply changes the state of the channel to closed, but does not complete the termination future or anything. I am wondering what should be now the semantics.

Ah, yes! i think we have a bigger issue (wrong result)! We need isClose to return true right after markClose is called, otherwise we may still insert rows into the buffer even after we mark the channel as closed and start flushing, so we need to change the isClose back. I don't have a good idea now, we could just merge your change to fix part of the idempotent issue, or we could just keep my old logic, both options work for me.

@sfc-gh-kkloudas sfc-gh-kkloudas force-pushed the kkloudas-SNOW-569611-client-close branch 2 times, most recently from ed34e34 to 1ea2ac3 Compare April 22, 2022 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants