Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete: remove the metafile check #209

Closed
wants to merge 1 commit into from
Closed

Conversation

lxbsz
Copy link
Collaborator

@lxbsz lxbsz commented Apr 16, 2019

What does this PR achieve? Why do we need it?

For some reasons, after updating the metafile then when read it
immediately, we could find it won't be updated in time, then though
the deletions are all successful, but it will still return as failure.

Currently to check the ->exit status is enough.

For the case:

  • When gluster-block delete was executed for the first time, on this
    node, deletion was successful. But before it can send the response,
    gluster-blockd died (addr meta status would be CLEANUPFAIL)

But for this case we can also check this from ->exit status, not need
to check it from the metafile.

The deletion failure has also be seen in customer cases.

Does this PR fix issues?

Fixes: #204

For some reasons, after updating the metafile then when read it
immediately, we could find it won't be updated in time, then though
the deletions are all successful, but it will still return as failure.

Currently to check the ->exit status is enough.

For the case:
* When gluster-block delete was executed for the first time, on this
  node, deletion was successful. But before it can send the response,
  gluster-blockd died (addr meta status would be CLEANUPFAIL)

But for this case we can also check this from ->exit status, not need
to check it from the metafile.

The deletion failure has also be seen in customer cases.

Fixes: gluster#204
Signed-off-by: Xiubo Li <[email protected]>
@ghost ghost assigned lxbsz Apr 16, 2019
@ghost ghost added the in progress label Apr 16, 2019
@lxbsz
Copy link
Collaborator Author

lxbsz commented Apr 16, 2019

@amarts @pkalever
Please review, thanks.

Copy link
Contributor

@pkalever pkalever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reasons, after updating the metafile then when read it
immediately, we could find it won't be updated in time, then though
the deletions are all successful, but it will still return as failure.

@lxbsz This shouldn't happen, please see the implementation of GB_METAUPDATE_OR_GOTO()
We basically open the metafile and close it in-place, which should flush the data to metafile.

Can you check if we where using the previously read values or deferring calling blockGetMetaInfo() after update ?

For the case:

  • When gluster-block delete was executed for the first time, on this
    node, deletion was successful. But before it can send the response,
    gluster-blockd died (addr meta status would be CLEANUPFAIL)

But for this case we can also check this from ->exit status, not need
to check it from the metafile.

Yes, probably this was achieved by 0a65c01 ?

So with the above PR, we can probably remove this extra read.

Will test this case and get back, for now this PR looks good for me.

Thanks!

@lxbsz
Copy link
Collaborator Author

lxbsz commented Apr 16, 2019

For some reasons, after updating the metafile then when read it
immediately, we could find it won't be updated in time, then though
the deletions are all successful, but it will still return as failure.

@lxbsz This shouldn't happen, please see the implementation of GB_METAUPDATE_OR_GOTO()
We basically open the metafile and close it in-place, which should flush the data to metafile.

Yeah, I saw that, there has the SYNC flag, but it seems not working here in gluster.

Can you check if we where using the previously read values or deferring calling blockGetMetaInfo() after update ?

I checked this already, it will call the blockGetMetaInfo again and allocate one new info struct.
From my test case is that the delete is not always failing, only sometime and I have add some logs and my own code to read the metafile again, but still couldn't see the new data.

Do you see the code I removed in this PR ? This is the code where it is doing this.

And also for the modify size case. And in tcmu-runner I have add some retry by waiting 5 seconds and it can work well. This is the same with the delete case.

I can reproduce this by using the tests/basic.t script in RHEL 7 very easily.

Thanks.

@lxbsz
Copy link
Collaborator Author

lxbsz commented Apr 16, 2019

The test cases:


[root@rhel3 gluster-block]# ./tests/basic.t 
TEST : gluster vol create hosting-volume 192.168.195.164:/brick force
volume create: hosting-volume: success: please start the volume to access data
line 69 : OK

TEST : gluster vol start hosting-volume
volume start: hosting-volume: success
line 72 : OK

TEST : ./tests/gfapi-test hosting-volume 192.168.195.164
Test works as expected!
line 77 : OK

TEST : systemctl restart gluster-block-target.service gluster-blockd.service
line 83 : OK

TEST : gluster-block create hosting-volume/block-volume 192.168.195.164 1MiB
IQN: iqn.2016-12.org.gluster-block:c08d016e-3776-4c62-9dc6-bb0695ceb3c6
PORTAL(S):  192.168.195.164:3260
RESULT: SUCCESS
line 90 : OK

TEST : gluster-block delete hosting-volume/block-volume
SUCCESSFUL ON:   192.168.195.164
RESULT: SUCCESS
line 91 : OK

TEST : gluster-block create hosting-volume/block-volume auth enable 192.168.195.164 1MiB
IQN: iqn.2016-12.org.gluster-block:933a0039-48be-4b2e-881d-afa395e7144f
USERNAME: 933a0039-48be-4b2e-881d-afa395e7144f
PASSWORD: ea3bb915-84b8-4f14-96a9-66540909e7c3
PORTAL(S):  192.168.195.164:3260
RESULT: SUCCESS
line 94 : OK

TEST : gluster-block delete hosting-volume/block-volume
SUCCESSFUL ON:   192.168.195.164
RESULT: SUCCESS
line 95 : OK

TEST : gluster-block create hosting-volume/block-volume ha 1 192.168.195.164 1MiB
IQN: iqn.2016-12.org.gluster-block:58045d91-b0c7-441a-9e3d-e2e05e0375ce
PORTAL(S):  192.168.195.164:3260
RESULT: SUCCESS
line 98 : OK

TEST : gluster-block delete hosting-volume/block-volume
SUCCESSFUL ON:   192.168.195.164
RESULT: SUCCESS
line 99 : OK

TEST : gluster-block create hosting-volume/block-volume prealloc full 192.168.195.164 1MiB
IQN: iqn.2016-12.org.gluster-block:52c4025c-03b9-417f-9b49-eedd3dbd19a3
PORTAL(S):  192.168.195.164:3260
RESULT: SUCCESS
line 102 : OK

TEST : gluster-block delete hosting-volume/block-volume
SUCCESSFUL ON:   192.168.195.164
RESULT: SUCCESS
line 103 : OK

TEST : gluster-block create hosting-volume/block-volume ring-buffer 32 192.168.195.164 1MiB
IQN: iqn.2016-12.org.gluster-block:47a771ad-88cd-4a11-9a52-057afb60cd5e
PORTAL(S):  192.168.195.164:3260
RESULT: SUCCESS
line 106 : OK

TEST : gluster-block delete hosting-volume/block-volume
SUCCESSFUL ON:   192.168.195.164
RESULT: FAIL
line 107 : NOT OK

[root@rhel3 gluster-block]# 

The logs:


[2019-04-16 07:36:04.219071] INFO: delete cli request, volume=hosting-volume blockname=block-volume [at block_svc_routines.c+4612 :<block_delete_cli_1_svc_st>]
[2019-04-16 07:36:04.232500] INFO: delete request, blockname=block-volume filename=47a771ad-88cd-4a11-9a52-057afb60cd5e [at block_svc_routines.c+4729 :<block_delete_1_svc_st>]
[2019-04-16 07:36:04.976347] INFO: command exit code, 0 [at block_svc_routines.c+4760 :<block_delete_1_svc_st>]
[2019-04-16 07:36:04.980828] WARNING: glusterBlockDeleteRemoteAsync: return -1 failed in remote async delete for block block-volume on volume hosting-volume [at block_svc_routines.c+2970 :<glusterBlockCleanUp>]
[2019-04-16 07:36:04.980873] WARNING: glusterBlockCleanUp: return -1 on block block-volume for volume hosting-volume [at block_svc_routines.c+4686 :<block_delete_cli_1_svc_st>]

From the mountpoint:

[root@rhel3 mnt]# cat block-meta/block-volume 
VOLUME: hosting-volume
GBID: 47a771ad-88cd-4a11-9a52-057afb60cd5e
HA: 1
ENTRYCREATE: INPROGRESS
PRIOPATH: 192.168.195.164
SIZE: 1048576
RINGBUFFER: 32
ENTRYCREATE: SUCCESS
192.168.195.164: CONFIGINPROGRESS
192.168.195.164: CONFIGSUCCESS
192.168.195.164: CLEANUPINPROGRESS
192.168.195.164: CLEANUPSUCCESS
[root@rhel3 mnt]# 


@pkalever
Copy link
Contributor

@pranithk are you aware of any similar bug in gusterfs api ?

Many Thanks!

@pkalever
Copy link
Contributor

@lxbsz BTW I'm using latest master gluster-block (+ PR#205) with glusterfs-api-5.5-1.fc29.x86_64 on Fedora29, 5.0.3-200.fc29.x86_64 kernel.

And run this test atleast 25 times and still not hit this issue.

Can you confirm your glusterfs-api version and see if the above mentioned version works for you as well ?

@lxbsz
Copy link
Collaborator Author

lxbsz commented Apr 16, 2019

@lxbsz BTW I'm using latest master gluster-block (+ PR#205) with glusterfs-api-5.5-1.fc29.x86_64 on Fedora29, 5.0.3-200.fc29.x86_64 kernel.

And run this test atleast 25 times and still not hit this issue.

Can you confirm your glusterfs-api version and see if the above mentioned version works for you as well ?

I am useing the RHEL 7, glusterfs-api-devel-6.0-0.4.rc1.fc29.x86_64 and gluster-block/tcmu-runner upstream code.

Will try this later.

@pranithk
Copy link
Member

@pranithk are you aware of any similar bug in gusterfs api ?

Many Thanks!

Sorry, didn't get the context. Which bug in glusterfs api are you referring to?

@pkalever
Copy link
Contributor

Sorry, didn't get the context. Which bug in glusterfs api are you referring to?

@pranithk Xiubo had hit a case where he gets old block metadata with blockGetMetaInfo() even after updating the block metadata with GB_METAUPDATE_OR_GOTO()

If you remember, in GB_METAUPDATE_OR_GOTO,
We basically open the metafile, write the new details and close it in-place, which should flush the data to metafile. But looks like in glusterfs-api-devel-6.0-0.4.rc1.fc29.x86_64 it is not happening.

I have tested the same with glusterfs-api-5.5-1.fc29.x86_64 and it works for me as expected.

@pranithk so wanted to check with you, if you are aware of any such bugs.

@lxbsz Hacking like open-iscsi/tcmu-runner#546 is not a preferred. Instead we need to report and get this fixed in fs itself.

Thanks!

@lxbsz
Copy link
Collaborator Author

lxbsz commented Apr 16, 2019

Sorry, didn't get the context. Which bug in glusterfs api are you referring to?

@pranithk Xiubo had hit a case where he gets old block metadata with blockGetMetaInfo() even after updating the block metadata with GB_METAUPDATE_OR_GOTO()

If you remember, in GB_METAUPDATE_OR_GOTO,
We basically open the metafile, write the new details and close it in-place, which should flush the data to metafile. But looks like in glusterfs-api-devel-6.0-0.4.rc1.fc29.x86_64 it is not happening.

I have tested the same with glusterfs-api-5.5-1.fc29.x86_64 and it works for me as expected.

@pranithk so wanted to check with you, if you are aware of any such bugs.

@lxbsz Hacking like open-iscsi/tcmu-runner#546 is not a preferred. Instead we need to report and get this fixed in fs itself.

Yeah, agree. But we cannot make sure that users or customers will be aware of this, in tcmu-runner we just try to avoid it if they are using the buggy glusterfs version, but it is not fixing it totally.

Make sense ?

Thanks!

@pranithk
Copy link
Member

Sorry, didn't get the context. Which bug in glusterfs api are you referring to?

@pranithk Xiubo had hit a case where he gets old block metadata with blockGetMetaInfo() even after updating the block metadata with GB_METAUPDATE_OR_GOTO()

If you remember, in GB_METAUPDATE_OR_GOTO,
We basically open the metafile, write the new details and close it in-place, which should flush the data to metafile. But looks like in glusterfs-api-devel-6.0-0.4.rc1.fc29.x86_64 it is not happening.

I have tested the same with glusterfs-api-5.5-1.fc29.x86_64 and it works for me as expected.

@pranithk so wanted to check with you, if you are aware of any such bugs.

@lxbsz Hacking like open-iscsi/tcmu-runner#546 is not a preferred. Instead we need to report and get this fixed in fs itself.

Thanks!

@pkalever / @lxbsz Okay, makes sense. Could you raise a bug with the steps on gluster bugzilla with component as 'core', I am not sure which xlator could have lead to this problem. I will need to git-bisect and figure out and may have to reassign the bug to another developer based on that info. Please make sure to give the steps exactly. It may get assigned to a component where the developer may not know enough gluster-block. If you have automated test/script, that would be the best.

@lxbsz
Copy link
Collaborator Author

lxbsz commented Apr 16, 2019

Sorry, didn't get the context. Which bug in glusterfs api are you referring to?

@pranithk Xiubo had hit a case where he gets old block metadata with blockGetMetaInfo() even after updating the block metadata with GB_METAUPDATE_OR_GOTO()
If you remember, in GB_METAUPDATE_OR_GOTO,
We basically open the metafile, write the new details and close it in-place, which should flush the data to metafile. But looks like in glusterfs-api-devel-6.0-0.4.rc1.fc29.x86_64 it is not happening.
I have tested the same with glusterfs-api-5.5-1.fc29.x86_64 and it works for me as expected.
@pranithk so wanted to check with you, if you are aware of any such bugs.
@lxbsz Hacking like open-iscsi/tcmu-runner#546 is not a preferred. Instead we need to report and get this fixed in fs itself.
Thanks!

@pkalever / @lxbsz Okay, makes sense. Could you raise a bug with the steps on gluster bugzilla with component as 'core', I am not sure which xlator could have lead to this problem. I will need to git-bisect and figure out and may have to reassign the bug to another developer based on that info. Please make sure to give the steps exactly. It may get assigned to a component where the developer may not know enough gluster-block. If you have automated test/script, that would be the best.

@pranithk Yeah, make sense. I will raise one bz for this today.

Thanks.

@pkalever
Copy link
Contributor

@pranithk thanks for quick turnaround.

Yes, xiubo seem to have easily hit it just by running ./tests/basic.t, hopefully that should be enough.

@lxbsz please makesure to point our installation guide from the ReadMe in the BZ description to help gluster core developers quickly come-up to speed with the setup.

And we will be happy to further assist on the BZ in case any other info is required.

Thanks!

@lxbsz
Copy link
Collaborator Author

lxbsz commented Apr 16, 2019

@pranithk thanks for quick turnaround.

Yes, xiubo seem to have easily hit it just by running ./tests/basic.t, hopefully that should be enough.

@lxbsz please makesure to point our installation guide from the ReadMe in the BZ description to help gluster core developers quickly come-up to speed with the setup.

And we will be happy to further assist on the BZ in case any other info is required.

Sure.

The bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1700295

Thanks.

Thanks!

@pkalever
Copy link
Contributor

@lxbsz From the BZ, looks like you were not using the recommended settings on the block hosting volume.

If that is the case, can we close it?

Thanks!

@lxbsz
Copy link
Collaborator Author

lxbsz commented May 15, 2020

Yeah, this issue has been resolved with the recommended settings.
I will close it.

@lxbsz lxbsz closed this May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to delete the blocks due to the data couldn't flushed in time
3 participants