CI failed with `database is locked` #6066

link2xt · 2024-10-19T15:10:40Z

Here is a failure log:

         snapshot = group.get_basic_snapshot()
        assert snapshot.name == "test group"
        group.set_name("new name")
        snapshot = group.get_full_snapshot()
        assert snapshot.name == "new name"
    
>       msg = group.send_message(text="hi")

tests/test_something.py:221: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.tox/py/lib/python3.13/site-packages/deltachat_rpc_client/chat.py:144: in send_message
    msg_id = self._rpc.send_msg(self.account.id, self.id, draft)
.tox/py/lib/python3.13/site-packages/deltachat_rpc_client/rpc.py:42: in __call__
    return future()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <deltachat_rpc_client.rpc.RpcFuture object at 0x7f6be33ff230>

    def __call__(self):
        self.event.wait()
        response = self.rpc.request_results.pop(self.request_id)
        if "error" in response:
>           raise JsonRpcError(response["error"])
E           deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'database is locked\n\nCaused by:\n    Error code 5: The database file is locked'}

.tox/py/lib/python3.13/site-packages/deltachat_rpc_client/rpc.py:28: JsonRpcError
------------------------------ Captured log call -------------------------------

We should at least add more context to this error to know which SQL query failed. But if it is an SQLITE_LOCKED, this should never happen and it indicates there is some problem with converting read transaction to write transaction. Maybe some INSERT with SELECT that should be wrapped into immediate transaction to avoid SELECT getting a read lock that needs to be later converted to write lock or something like this?

The text was updated successfully, but these errors were encountered:

iequidoo · 2024-10-19T15:57:38Z

5 is SQLITE_BUSY, but it indeed should never happen, probably write_lock() is forgotten for a write transaction somewhere or so.

link2xt · 2024-10-19T20:49:34Z

Another failure (on ubuntu-latest), this time about remove_contact:

         assert (msg.get_snapshot()).text == "hi"
        group.forward_messages([msg])
    
        group.set_draft(text="test draft")
        draft = group.get_draft()
        assert draft.text == "test draft"
        group.remove_draft()
        assert not group.get_draft()
    
        assert group.get_messages()
        group.get_fresh_message_count()
        group.mark_noticed()
        assert group.get_contacts()
>       group.remove_contact(alice_chat_bob)

tests/test_something.py:235: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.tox/py/lib/python3.7/site-packages/deltachat_rpc_client/chat.py:231: in remove_contact
    self._rpc.remove_contact_from_chat(self.account.id, self.id, contact_id)
.tox/py/lib/python3.7/site-packages/deltachat_rpc_client/rpc.py:42: in __call__
    return future()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <deltachat_rpc_client.rpc.RpcFuture object at 0x7f661c481650>

    def __call__(self):
        self.event.wait()
        response = self.rpc.request_results.pop(self.request_id)
        if "error" in response:
>           raise JsonRpcError(response["error"])
E           deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'database is locked\n\nCaused by:\n    Error code 5: The database file is locked'}

It sends a message internally, so could be the same problem.

link2xt · 2024-10-20T08:27:33Z

Running while pytest -s tests/test_something.py::test_chat ; do :;done locally in 3 terminals in parallel eventually fails. Probably one is enough as two failed, running in parallel just makes it fail faster.

link2xt · 2024-10-20T11:18:54Z

With much more logging I collected these errors:

deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to remove contact from chats_contacts table: Failed to EXECUTE SQL query "DELETE FROM chats_contacts WHERE chat_id=? AND contact_id=?": database is locked: Error code 5: The database file is locked'}

deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to send created message: send_msg_inner: prepare_send_msg: Failed to create send jobs: Failed to update message subject: Failed to EXECUTE SQL query "UPDATE msgs SET subject=? WHERE id=?;": database is locked: Error code 5: The database file is locked'}

deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to create group: Failed to INSERT SQL query "INSERT INTO chats\\n        (type, name, grpid, param, created_timestamp)\\n        VALUES(?, ?, ?, \'U=1\', ?);": database is locked: Error code 5: The database file is locked'}

deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to add contact to chat: Transaction body failed: database is locked: Error code 5: The database file is locked'}

So all sorts of JSON-RPC requests fail, even plain INSERT.

link2xt · 2024-10-20T11:24:57Z

WAL documentation has this list of reasons SQLITE_BUSY may be returned:
https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode
Seems the list is non-exhaustive. Maybe some connection merges the WAL into the database or something like this.

link2xt · 2024-10-20T11:50:16Z

probably write_lock() is forgotten for a write transaction somewhere or so.

We have a very limited API in sql.rs that does not allow to just get a connection without write lock and start writing there. It's possible to execute something like INSERT or UPDATE inside query_row (and we actually have two RETURNING statements that do so, but in a transaction), but I don't think this is what happens here.

Currently I plan to try:

~~Replacing prepare_cached with prepare to avoid long-lived prepared statements.~~ (edit: did not help)
Try reducing the testcase

link2xt · 2024-10-20T12:42:36Z

I tried setting PRAGMA wal_autocheckpoint=0; next to busy_timeout = 0 and it did not help, so it is not about some connection doing WAL checkpointing.

link2xt · 2024-10-20T12:50:48Z

Will try to set pragma query_only=0 or pragma query_only=1 when I get a write or read-only connection to make sure we never write into read-only connection.

link2xt · 2024-10-20T13:51:54Z

5 is SQLITE_BUSY, but it indeed should never happen, probably write_lock() is forgotten for a write transaction somewhere or so.

I made a PR that rules out the possibility of us accidentally writing to a read connection: #6076

link2xt · 2024-10-20T17:38:13Z

I cannot reproduce it anymore, but probably because of the network conditions change or something like that. Don't think it is fixed for real, but without the ability to reproduce it cannot debug it further.

link2xt · 2024-10-21T10:56:57Z

Now that we have query_only argument to Pool.get() anyway since #6076, maybe we should indeed dedicate a single write connection in the Pool and instead of having write mutex in SQL make Pool.get() wait until write connection is returned. Pool locks connections behind mutex anyway, can as well have a write_connection: Mutex<Option<Connection>>.

If it does not negatively affect performance, having a separate connection instead of toggling query_only all the time looks a bit cleaner.

Even without dedicated write connection, write mutex is better moved inside the Pool so it never returns two query_only=0 connections at the same time. EDIT: made a PR #6084 implementing it.

See discussion at #6076 (comment)

iequidoo · 2024-10-21T18:20:40Z

I tried setting PRAGMA wal_autocheckpoint=0; next to busy_timeout = 0 and it did not help, so it is not about some connection doing WAL checkpointing.

Still, seems we need to be prepared for this. Just in case citing https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode here to not forget about it:

When the last connection to a particular database is closing, that connection will acquire an exclusive lock for a short time while it cleans up the WAL and shared-memory files. If a separate attempt is made to open and query the database while the first connection is still in the middle of its cleanup process, the second connection might get an SQLITE_BUSY error.

link2xt · 2024-10-21T18:48:06Z

But we are never closing connections, so how can this happen? Connections returned to the pool are not closed.

iequidoo · 2024-10-21T19:16:10Z

You're right, but then there's another problem that WAL swells indefinitely. E.g. my WAL now is ~200M which is much more than 1000 pages (SQLITE_DEFAULT_WAL_AUTOCHECKPOINT macro).

link2xt · 2024-10-21T19:26:50Z

Strange, the largest WAL I have here is 8.7M.

iequidoo · 2024-10-21T19:36:49Z

My DC Desktop is running since Sep 21, maybe this is the reason?

link2xt · 2024-10-21T20:27:05Z

I think I found the reason: we run PRAGMA incremental_vacuum via query_row_optional that uses read-only connection. With the recent changes it will always fail:

sqlite> PRAGMA query_only=1;
sqlite> PRAGMA incremental_vacuum;
Runtime error: attempt to write a readonly database (8)

Here is the code that should run from call_write or similar:

deltachat-core-rust/src/sql.rs

Lines 718 to 725 in 61fd0d4

    
           match context 
        
               .sql 
        
               .query_row_optional("PRAGMA incremental_vacuum", (), |_row| Ok(())) 
        
               .await 
        
           { 
        
               Err(err) => { 
        
                   warn!(context, "Failed to run incremental vacuum: {err:#}."); 
        
               }

This is also what happens in recent CI failures:

11.66 [events-ac4] INFO src/sql.rs:843: Start housekeeping...
11.66 [events-ac4] INFO src/sql.rs:882: 3 files in use.
11.66 [events-ac4] WARNING src/sql.rs:769: Failed to run incremental vacuum: attempt to write a readonly database: Error code 8: Attempt to write a readonly database.
11.66 [events-ac4] INFO src/sql.rs:824: Housekeeping done.

We run housekeeping once on just created contexts, so it makes sense that this affects the tests.

link2xt · 2024-10-21T20:46:50Z

We can already close this issue as query_only prevents locking the database with a read connection, but for PRAGMA incremental_vacuum to work I made a PR #6087.

link2xt · 2024-10-21T21:09:01Z

Regarding WAL I opened an issue #6089

link2xt · 2024-10-22T23:35:13Z

#6087 is merged but there was another failure:
#6087 (comment)

link2xt · 2024-10-25T19:34:07Z

Let's close this until it fails again, at least now failures should be greatly reduced as incremental_vacuum does not happen on read connection immediately after account creation.

link2xt changed the title ~~CI failed with 'database is locked~~ CI failed with database is locked Oct 19, 2024

link2xt added the bug Something is not working label Oct 19, 2024

link2xt mentioned this issue Oct 19, 2024

feat: add more context to send_msg errors #6067

Merged

link2xt mentioned this issue Oct 19, 2024

More context for remove_contact_from_chat #6071

Draft

link2xt self-assigned this Oct 20, 2024

link2xt mentioned this issue Oct 20, 2024

feat(sql): set PRAGMA query_only to avoid writing on read-only connections #6076

Merged

link2xt removed their assignment Oct 20, 2024

link2xt self-assigned this Oct 21, 2024

This was referenced Oct 21, 2024

refactor(sql): move write mutex into connection pool #6084

Merged

test_markseen_message_and_mdn failed with "message could not be send, does chat exist?" #6078

Closed

link2xt mentioned this issue Oct 21, 2024

fix(sql): run PRAGMA incremental_vacuum on a write connection #6087

Merged

link2xt mentioned this issue Oct 21, 2024

Run wal_checkpoint during housekeeping #6089

Open

link2xt closed this as completed Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI failed with `database is locked` #6066

CI failed with `database is locked` #6066

link2xt commented Oct 19, 2024

iequidoo commented Oct 19, 2024

link2xt commented Oct 19, 2024 •

edited

Loading

link2xt commented Oct 20, 2024

link2xt commented Oct 20, 2024

link2xt commented Oct 20, 2024 •

edited

Loading

link2xt commented Oct 20, 2024 •

edited

Loading

link2xt commented Oct 20, 2024

link2xt commented Oct 20, 2024

link2xt commented Oct 20, 2024

link2xt commented Oct 20, 2024

link2xt commented Oct 21, 2024 •

edited

Loading

iequidoo commented Oct 21, 2024

link2xt commented Oct 21, 2024

iequidoo commented Oct 21, 2024

link2xt commented Oct 21, 2024

iequidoo commented Oct 21, 2024

link2xt commented Oct 21, 2024 •

edited

Loading

link2xt commented Oct 21, 2024

link2xt commented Oct 21, 2024

link2xt commented Oct 22, 2024

link2xt commented Oct 25, 2024

CI failed with database is locked #6066

CI failed with database is locked #6066

Comments

link2xt commented Oct 19, 2024

iequidoo commented Oct 19, 2024

link2xt commented Oct 19, 2024 • edited Loading

link2xt commented Oct 20, 2024

link2xt commented Oct 20, 2024

link2xt commented Oct 20, 2024 • edited Loading

link2xt commented Oct 20, 2024 • edited Loading

link2xt commented Oct 20, 2024

link2xt commented Oct 20, 2024

link2xt commented Oct 20, 2024

link2xt commented Oct 20, 2024

link2xt commented Oct 21, 2024 • edited Loading

iequidoo commented Oct 21, 2024

link2xt commented Oct 21, 2024

iequidoo commented Oct 21, 2024

link2xt commented Oct 21, 2024

iequidoo commented Oct 21, 2024

link2xt commented Oct 21, 2024 • edited Loading

link2xt commented Oct 21, 2024

link2xt commented Oct 21, 2024

link2xt commented Oct 22, 2024

link2xt commented Oct 25, 2024

CI failed with `database is locked` #6066

CI failed with `database is locked` #6066

link2xt commented Oct 19, 2024 •

edited

Loading

link2xt commented Oct 20, 2024 •

edited

Loading

link2xt commented Oct 20, 2024 •

edited

Loading

link2xt commented Oct 21, 2024 •

edited

Loading

link2xt commented Oct 21, 2024 •

edited

Loading