Delete: remove the metafile check #209

lxbsz · 2019-04-16T02:31:43Z

What does this PR achieve? Why do we need it?

For some reasons, after updating the metafile then when read it
immediately, we could find it won't be updated in time, then though
the deletions are all successful, but it will still return as failure.

Currently to check the ->exit status is enough.

For the case:

When gluster-block delete was executed for the first time, on this
node, deletion was successful. But before it can send the response,
gluster-blockd died (addr meta status would be CLEANUPFAIL)

But for this case we can also check this from ->exit status, not need
to check it from the metafile.

The deletion failure has also be seen in customer cases.

Does this PR fix issues?

Fixes: #204

For some reasons, after updating the metafile then when read it immediately, we could find it won't be updated in time, then though the deletions are all successful, but it will still return as failure. Currently to check the ->exit status is enough. For the case: * When gluster-block delete was executed for the first time, on this node, deletion was successful. But before it can send the response, gluster-blockd died (addr meta status would be CLEANUPFAIL) But for this case we can also check this from ->exit status, not need to check it from the metafile. The deletion failure has also be seen in customer cases. Fixes: gluster#204 Signed-off-by: Xiubo Li <[email protected]>

lxbsz · 2019-04-16T02:32:06Z

@amarts @pkalever
Please review, thanks.

pkalever

For some reasons, after updating the metafile then when read it
immediately, we could find it won't be updated in time, then though
the deletions are all successful, but it will still return as failure.

@lxbsz This shouldn't happen, please see the implementation of GB_METAUPDATE_OR_GOTO()
We basically open the metafile and close it in-place, which should flush the data to metafile.

Can you check if we where using the previously read values or deferring calling blockGetMetaInfo() after update ?

For the case:

When gluster-block delete was executed for the first time, on this
node, deletion was successful. But before it can send the response,
gluster-blockd died (addr meta status would be CLEANUPFAIL)

But for this case we can also check this from ->exit status, not need
to check it from the metafile.

Yes, probably this was achieved by 0a65c01 ?

So with the above PR, we can probably remove this extra read.

Will test this case and get back, for now this PR looks good for me.

Thanks!

lxbsz · 2019-04-16T07:32:44Z

For some reasons, after updating the metafile then when read it
immediately, we could find it won't be updated in time, then though
the deletions are all successful, but it will still return as failure.

@lxbsz This shouldn't happen, please see the implementation of GB_METAUPDATE_OR_GOTO()
We basically open the metafile and close it in-place, which should flush the data to metafile.

Yeah, I saw that, there has the SYNC flag, but it seems not working here in gluster.

Can you check if we where using the previously read values or deferring calling blockGetMetaInfo() after update ?

I checked this already, it will call the blockGetMetaInfo again and allocate one new info struct.
From my test case is that the delete is not always failing, only sometime and I have add some logs and my own code to read the metafile again, but still couldn't see the new data.

Do you see the code I removed in this PR ? This is the code where it is doing this.

And also for the modify size case. And in tcmu-runner I have add some retry by waiting 5 seconds and it can work well. This is the same with the delete case.

I can reproduce this by using the tests/basic.t script in RHEL 7 very easily.

Thanks.

lxbsz · 2019-04-16T07:37:31Z

The test cases:


[root@rhel3 gluster-block]# ./tests/basic.t 
TEST : gluster vol create hosting-volume 192.168.195.164:/brick force
volume create: hosting-volume: success: please start the volume to access data
line 69 : OK

TEST : gluster vol start hosting-volume
volume start: hosting-volume: success
line 72 : OK

TEST : ./tests/gfapi-test hosting-volume 192.168.195.164
Test works as expected!
line 77 : OK

TEST : systemctl restart gluster-block-target.service gluster-blockd.service
line 83 : OK

TEST : gluster-block create hosting-volume/block-volume 192.168.195.164 1MiB
IQN: iqn.2016-12.org.gluster-block:c08d016e-3776-4c62-9dc6-bb0695ceb3c6
PORTAL(S):  192.168.195.164:3260
RESULT: SUCCESS
line 90 : OK

TEST : gluster-block delete hosting-volume/block-volume
SUCCESSFUL ON:   192.168.195.164
RESULT: SUCCESS
line 91 : OK

TEST : gluster-block create hosting-volume/block-volume auth enable 192.168.195.164 1MiB
IQN: iqn.2016-12.org.gluster-block:933a0039-48be-4b2e-881d-afa395e7144f
USERNAME: 933a0039-48be-4b2e-881d-afa395e7144f
PASSWORD: ea3bb915-84b8-4f14-96a9-66540909e7c3
PORTAL(S):  192.168.195.164:3260
RESULT: SUCCESS
line 94 : OK

TEST : gluster-block delete hosting-volume/block-volume
SUCCESSFUL ON:   192.168.195.164
RESULT: SUCCESS
line 95 : OK

TEST : gluster-block create hosting-volume/block-volume ha 1 192.168.195.164 1MiB
IQN: iqn.2016-12.org.gluster-block:58045d91-b0c7-441a-9e3d-e2e05e0375ce
PORTAL(S):  192.168.195.164:3260
RESULT: SUCCESS
line 98 : OK

TEST : gluster-block delete hosting-volume/block-volume
SUCCESSFUL ON:   192.168.195.164
RESULT: SUCCESS
line 99 : OK

TEST : gluster-block create hosting-volume/block-volume prealloc full 192.168.195.164 1MiB
IQN: iqn.2016-12.org.gluster-block:52c4025c-03b9-417f-9b49-eedd3dbd19a3
PORTAL(S):  192.168.195.164:3260
RESULT: SUCCESS
line 102 : OK

TEST : gluster-block delete hosting-volume/block-volume
SUCCESSFUL ON:   192.168.195.164
RESULT: SUCCESS
line 103 : OK

TEST : gluster-block create hosting-volume/block-volume ring-buffer 32 192.168.195.164 1MiB
IQN: iqn.2016-12.org.gluster-block:47a771ad-88cd-4a11-9a52-057afb60cd5e
PORTAL(S):  192.168.195.164:3260
RESULT: SUCCESS
line 106 : OK

TEST : gluster-block delete hosting-volume/block-volume
SUCCESSFUL ON:   192.168.195.164
RESULT: FAIL
line 107 : NOT OK

[root@rhel3 gluster-block]#

The logs:


[2019-04-16 07:36:04.219071] INFO: delete cli request, volume=hosting-volume blockname=block-volume [at block_svc_routines.c+4612 :<block_delete_cli_1_svc_st>]
[2019-04-16 07:36:04.232500] INFO: delete request, blockname=block-volume filename=47a771ad-88cd-4a11-9a52-057afb60cd5e [at block_svc_routines.c+4729 :<block_delete_1_svc_st>]
[2019-04-16 07:36:04.976347] INFO: command exit code, 0 [at block_svc_routines.c+4760 :<block_delete_1_svc_st>]
[2019-04-16 07:36:04.980828] WARNING: glusterBlockDeleteRemoteAsync: return -1 failed in remote async delete for block block-volume on volume hosting-volume [at block_svc_routines.c+2970 :<glusterBlockCleanUp>]
[2019-04-16 07:36:04.980873] WARNING: glusterBlockCleanUp: return -1 on block block-volume for volume hosting-volume [at block_svc_routines.c+4686 :<block_delete_cli_1_svc_st>]

From the mountpoint:

[root@rhel3 mnt]# cat block-meta/block-volume 
VOLUME: hosting-volume
GBID: 47a771ad-88cd-4a11-9a52-057afb60cd5e
HA: 1
ENTRYCREATE: INPROGRESS
PRIOPATH: 192.168.195.164
SIZE: 1048576
RINGBUFFER: 32
ENTRYCREATE: SUCCESS
192.168.195.164: CONFIGINPROGRESS
192.168.195.164: CONFIGSUCCESS
192.168.195.164: CLEANUPINPROGRESS
192.168.195.164: CLEANUPSUCCESS
[root@rhel3 mnt]#

pkalever · 2019-04-16T07:39:11Z

@pranithk are you aware of any similar bug in gusterfs api ?

Many Thanks!

pkalever · 2019-04-16T08:22:33Z

@lxbsz BTW I'm using latest master gluster-block (+ PR#205) with glusterfs-api-5.5-1.fc29.x86_64 on Fedora29, 5.0.3-200.fc29.x86_64 kernel.

And run this test atleast 25 times and still not hit this issue.

Can you confirm your glusterfs-api version and see if the above mentioned version works for you as well ?

lxbsz · 2019-04-16T08:28:32Z

@lxbsz BTW I'm using latest master gluster-block (+ PR#205) with glusterfs-api-5.5-1.fc29.x86_64 on Fedora29, 5.0.3-200.fc29.x86_64 kernel.

And run this test atleast 25 times and still not hit this issue.

Can you confirm your glusterfs-api version and see if the above mentioned version works for you as well ?

I am useing the RHEL 7, glusterfs-api-devel-6.0-0.4.rc1.fc29.x86_64 and gluster-block/tcmu-runner upstream code.

Will try this later.

pranithk · 2019-04-16T08:31:25Z

@pranithk are you aware of any similar bug in gusterfs api ?

Many Thanks!

Sorry, didn't get the context. Which bug in glusterfs api are you referring to?

pkalever · 2019-04-16T08:54:00Z

Sorry, didn't get the context. Which bug in glusterfs api are you referring to?

@pranithk Xiubo had hit a case where he gets old block metadata with blockGetMetaInfo() even after updating the block metadata with GB_METAUPDATE_OR_GOTO()

If you remember, in GB_METAUPDATE_OR_GOTO,
We basically open the metafile, write the new details and close it in-place, which should flush the data to metafile. But looks like in glusterfs-api-devel-6.0-0.4.rc1.fc29.x86_64 it is not happening.

I have tested the same with glusterfs-api-5.5-1.fc29.x86_64 and it works for me as expected.

@pranithk so wanted to check with you, if you are aware of any such bugs.

@lxbsz Hacking like open-iscsi/tcmu-runner#546 is not a preferred. Instead we need to report and get this fixed in fs itself.

Thanks!

lxbsz · 2019-04-16T09:00:28Z

Sorry, didn't get the context. Which bug in glusterfs api are you referring to?

@pranithk Xiubo had hit a case where he gets old block metadata with blockGetMetaInfo() even after updating the block metadata with GB_METAUPDATE_OR_GOTO()

If you remember, in GB_METAUPDATE_OR_GOTO,
We basically open the metafile, write the new details and close it in-place, which should flush the data to metafile. But looks like in glusterfs-api-devel-6.0-0.4.rc1.fc29.x86_64 it is not happening.

I have tested the same with glusterfs-api-5.5-1.fc29.x86_64 and it works for me as expected.

@pranithk so wanted to check with you, if you are aware of any such bugs.

@lxbsz Hacking like open-iscsi/tcmu-runner#546 is not a preferred. Instead we need to report and get this fixed in fs itself.

Yeah, agree. But we cannot make sure that users or customers will be aware of this, in tcmu-runner we just try to avoid it if they are using the buggy glusterfs version, but it is not fixing it totally.

Make sense ?

Thanks!

pranithk · 2019-04-16T09:05:14Z

Sorry, didn't get the context. Which bug in glusterfs api are you referring to?

@pranithk Xiubo had hit a case where he gets old block metadata with blockGetMetaInfo() even after updating the block metadata with GB_METAUPDATE_OR_GOTO()

If you remember, in GB_METAUPDATE_OR_GOTO,
We basically open the metafile, write the new details and close it in-place, which should flush the data to metafile. But looks like in glusterfs-api-devel-6.0-0.4.rc1.fc29.x86_64 it is not happening.

I have tested the same with glusterfs-api-5.5-1.fc29.x86_64 and it works for me as expected.

@pranithk so wanted to check with you, if you are aware of any such bugs.

@lxbsz Hacking like open-iscsi/tcmu-runner#546 is not a preferred. Instead we need to report and get this fixed in fs itself.

Thanks!

@pkalever / @lxbsz Okay, makes sense. Could you raise a bug with the steps on gluster bugzilla with component as 'core', I am not sure which xlator could have lead to this problem. I will need to git-bisect and figure out and may have to reassign the bug to another developer based on that info. Please make sure to give the steps exactly. It may get assigned to a component where the developer may not know enough gluster-block. If you have automated test/script, that would be the best.

lxbsz · 2019-04-16T09:10:48Z

Sorry, didn't get the context. Which bug in glusterfs api are you referring to?

@pranithk Xiubo had hit a case where he gets old block metadata with blockGetMetaInfo() even after updating the block metadata with GB_METAUPDATE_OR_GOTO()
If you remember, in GB_METAUPDATE_OR_GOTO,
We basically open the metafile, write the new details and close it in-place, which should flush the data to metafile. But looks like in glusterfs-api-devel-6.0-0.4.rc1.fc29.x86_64 it is not happening.
I have tested the same with glusterfs-api-5.5-1.fc29.x86_64 and it works for me as expected.
@pranithk so wanted to check with you, if you are aware of any such bugs.
@lxbsz Hacking like open-iscsi/tcmu-runner#546 is not a preferred. Instead we need to report and get this fixed in fs itself.
Thanks!

@pkalever / @lxbsz Okay, makes sense. Could you raise a bug with the steps on gluster bugzilla with component as 'core', I am not sure which xlator could have lead to this problem. I will need to git-bisect and figure out and may have to reassign the bug to another developer based on that info. Please make sure to give the steps exactly. It may get assigned to a component where the developer may not know enough gluster-block. If you have automated test/script, that would be the best.

@pranithk Yeah, make sense. I will raise one bz for this today.

Thanks.

pkalever · 2019-04-16T09:24:17Z

@pranithk thanks for quick turnaround.

Yes, xiubo seem to have easily hit it just by running ./tests/basic.t, hopefully that should be enough.

@lxbsz please makesure to point our installation guide from the ReadMe in the BZ description to help gluster core developers quickly come-up to speed with the setup.

And we will be happy to further assist on the BZ in case any other info is required.

Thanks!

lxbsz · 2019-04-16T09:28:16Z

@pranithk thanks for quick turnaround.

Yes, xiubo seem to have easily hit it just by running ./tests/basic.t, hopefully that should be enough.

@lxbsz please makesure to point our installation guide from the ReadMe in the BZ description to help gluster core developers quickly come-up to speed with the setup.

And we will be happy to further assist on the BZ in case any other info is required.

Sure.

The bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1700295

Thanks.

Thanks!

pkalever · 2020-05-15T09:20:59Z

@lxbsz From the BZ, looks like you were not using the recommended settings on the block hosting volume.

If that is the case, can we close it?

Thanks!

lxbsz · 2020-05-15T09:27:18Z

Yeah, this issue has been resolved with the recommended settings.
I will close it.

ghost assigned lxbsz Apr 16, 2019

ghost added the in progress label Apr 16, 2019

pkalever reviewed Apr 16, 2019

View reviewed changes

This was referenced Apr 18, 2019

glfs: retry 5 times to wait the meta data updated successfully open-iscsi/tcmu-runner#546

Open

gluster-block create failure #187

Closed

lxbsz closed this May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete: remove the metafile check #209

Delete: remove the metafile check #209

lxbsz commented Apr 16, 2019

lxbsz commented Apr 16, 2019

pkalever left a comment

lxbsz commented Apr 16, 2019

lxbsz commented Apr 16, 2019 •

edited

Loading

pkalever commented Apr 16, 2019

pkalever commented Apr 16, 2019

lxbsz commented Apr 16, 2019

pranithk commented Apr 16, 2019

pkalever commented Apr 16, 2019

lxbsz commented Apr 16, 2019

pranithk commented Apr 16, 2019

lxbsz commented Apr 16, 2019

pkalever commented Apr 16, 2019

lxbsz commented Apr 16, 2019

pkalever commented May 15, 2020

lxbsz commented May 15, 2020

Delete: remove the metafile check #209

Delete: remove the metafile check #209

Conversation

lxbsz commented Apr 16, 2019

What does this PR achieve? Why do we need it?

Does this PR fix issues?

lxbsz commented Apr 16, 2019

pkalever left a comment

Choose a reason for hiding this comment

lxbsz commented Apr 16, 2019

lxbsz commented Apr 16, 2019 • edited Loading

pkalever commented Apr 16, 2019

pkalever commented Apr 16, 2019

lxbsz commented Apr 16, 2019

pranithk commented Apr 16, 2019

pkalever commented Apr 16, 2019

lxbsz commented Apr 16, 2019

pranithk commented Apr 16, 2019

lxbsz commented Apr 16, 2019

pkalever commented Apr 16, 2019

lxbsz commented Apr 16, 2019

pkalever commented May 15, 2020

lxbsz commented May 15, 2020

lxbsz commented Apr 16, 2019 •

edited

Loading