Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task manager: task id is reported as not found when the task id is available in system.tablets table #21975

Open
asias opened this issue Dec 19, 2024 · 1 comment · May be fixed by #21995
Open
Assignees
Labels
backport/none Backport is not required

Comments

@asias
Copy link
Contributor

asias commented Dec 19, 2024

  • Start tablet repair
  • Check the task id is available in system.tablets table
  • Abort the task id
  • The abort_task reports error that the task id is not found.

Can be reproduced with the test code in PR: #21973.
Remove the get_task_status loop below and abort as soon as the tablet_task_id is obtained.

@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_tablet_repair_error_delete(manager: ManagerClient):
    servers, cql, hosts, table_id = await create_table_insert_data_for_repair(manager)

    token = -1
    async def repair_task():
        await inject_error_on(manager, "repair_tablet_fail_on_rpc_call", servers)
        # Check failed repair request can be deleted
        await manager.api.tablet_repair(servers[0].ip_addr, "test", "test", token)

    async def del_repair_task():
        tablet_task_id = None
        while tablet_task_id == None:
            tablet_task_id = await get_tablet_task_id(cql, hosts[0], table_id, token)
        status = None
        while status == None:
            try:
                status = await manager.api.get_task_status(servers[0].ip_addr, tablet_task_id)
            except:
                status == None
        await manager.api.abort_task(servers[0].ip_addr, tablet_task_id)

    await asyncio.gather(repair_task(), del_repair_task());
    await inject_error_off(manager, "repair_tablet_fail_on_rpc_call", servers)
@asias asias changed the title task manger: task id is reported as not found when the task id is available in system.tablets table task manager: task id is reported as not found when the task id is available in system.tablets table Dec 19, 2024
@asias
Copy link
Contributor Author

asias commented Dec 19, 2024

The reproducer can be found here:

[email protected]:asias/scylla.git reproducer_for_21975

            if allow_failed:                                                                                       
                return await resp.json()                                                                           
            if resp.status != 200:                                                                                                                                                                                                     
                text = await resp.text()                                                                                                                                                                                               
>               raise HTTPError(uri, resp.status, params, json, text)       
E               test.pylib.rest_client.HTTPError: HTTP error 400, uri: http://127.226.137.1:10000/task_manager/abort_task/26614b60-bddc-11ef-8112-adb2f382c014, params: None, json: None, body:                                        E               {"message": "task with id 26614b60-bddc-11ef-8112-adb2f382c014 not found", "code": 400}                                                                                                                                

Deexie added a commit to Deexie/scylla that referenced this issue Dec 19, 2024
Currently, when the tablet repair is started, info regarding
the operation is kept in the system.tablets. The new tablet states
are reflected in memory after load_topology_state is called.
Before that, the data in the table and the memory aren't consistent.

To check the supported operations, tablet_virtual_task uses in-memory
tablet_metadata. Hence, it may not see the operation, even though
its info is already kept in system.tablets table.

Run read barrier in tablet_virtual_task::contains to ensure it will
see the latest data. Add a test to check it.

Fixes: scylladb#21975.
@github-actions github-actions bot added the backport/none Backport is not required label Dec 19, 2024
Deexie added a commit to Deexie/scylla that referenced this issue Dec 20, 2024
Currently, when the tablet repair is started, info regarding
the operation is kept in the system.tablets. The new tablet states
are reflected in memory after load_topology_state is called.
Before that, the data in the table and the memory aren't consistent.

To check the supported operations, tablet_virtual_task uses in-memory
tablet_metadata. Hence, it may not see the operation, even though
its info is already kept in system.tablets table.

Run read barrier in tablet_virtual_task::contains to ensure it will
see the latest data. Add a test to check it.

Fixes: scylladb#21975.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/none Backport is not required
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants