Replies: 16 comments 2 replies
-
Thank you for testing it!
Even when Elasticsearch is used, Stalwart needs to fetch the message and decode the text parts to be indexed.
Do you have the possibility to recompile Stalwart to test this? If so, you can edit pub async fn fts_index_queued(&self) {
return; If you are still seeing slow insertion speeds after this change then it might be related to the RocksDB merge operator on bitmaps. In any case I am going to test this in detail once I run the benchmarks. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
The change disables full-text indexing for all FTS backends. That function parses the message and then calls the
In this case, since you mentioned it getting slower after a few inserts, I suspect this might be related to the If this is really due to the bitmaps then it is a tradeoff between read and write performance. On Recently I discussed the pros and cons of working with RoaringBitmaps on key-value stores with the developers behind SurrealDB. They've chosen to go with the fast reads and slow inserts approach as each time a bitmap is updated they need to retrieve it, modify it and store it. On Stalwart a per-backend approach was chosen: on |
Beta Was this translation helpful? Give feedback.
-
I observe similar behavior with sqlite3, thus I doubt that RocksDB is really the problem. Even with RoaringBitmaps, it cannot (well, should not) be so bad - all my tests were with in-RAM database, I would say this is insanely slow - to spend almost one hour to process 2.5G of data in RAM (where any fsync is a no-op), especially on high-end system. There should be a better way (or better storage). |
Beta Was this translation helpful? Give feedback.
-
All SQL databases are much slower than a key value store such as RocksDB and FoundationDB even at the beginning of the inserts.
If you have time you could try importing the messages using the CLI tool, because another suspicion I have is that insertions over IMAP are slower because the UIDs need to be calculated constantly when using |
Beta Was this translation helpful? Give feedback.
-
Well, since my original account is stored in dovecot, and it is neither Maildir or mbox, I could not use cli tool directly, my only option would be to export it first to compatible format. This would work once or twice, for cases like mine, but it does not scale well. On the other hand, importing the same data to dovecot (+ xapian FTS) over IMAP is 5(!) times faster and consumes less CPU:
10 minutes and we are done, no excessive resource usage, and this is not even RAM-fs. Somehow dovecot is not affected by UIDs calculation, so why stalwart is? |
Beta Was this translation helpful? Give feedback.
-
Ok, it looks like dovecot was not so quick after all - indexer took additionally 20 minutes to finish, but at least it was done in the background (and not consuming several CPU cores). Still, it is faster than stalwart's FTS. Thus, 1st optimization tip - don't do it "inline", send to the queue and do it later (at least independently from storing) :) Everything else which is related to the storage itself - on such a small scale (1 message at a time) there is no significant difference which storage backend is in use - key-value or SQL, storing of a single message (even with metadata parsing/generation) takes at worst few milliseconds, not hundredths (especially on SSD or even RAM). Now back to UIDs... RFC 9051 says:
So, there is no need to have a complicated machinery to "calculate" UIDs - we start with 1 for every new account, increase by 1 with every new message created, that's all. In any case, this cannot be the cause of extreme CPU usage, unless you mean something different when you say "UID", but even if we would calculate SHA512 has on every message, it would not noticeably affect processing speed. |
Beta Was this translation helpful? Give feedback.
-
It is not done inline. There is actually a queue and messages are indexed sequentially in the background. Also, since you mentioned it was slow also after disabling FTS, this seems to be something else.
There is a indeed big difference because Stalwart is designed to work with key value stores. Using an SQL server as a key value store is much slower than RocksDB or FoundationDB, about 10 times for concurrent inserts on the same mailbox.
Stalwart is a JMAP and IMAP server and needs to do more work than Dovecot when inserting a message. It needs to find a JMAP
There is no complicated machinery, UIDs are assigned sequentially. If this is related to UIDs it might be due to concurrency and transaction retries, not because of incrementing a counter. Anyway, I will profile this once I run the benchmarks to find out what is causing it. |
Beta Was this translation helpful? Give feedback.
-
Not really, JMAP needs both the threadId and Id when accessing the messages. Doing it when the user opens the mailbox (in case the background task isn't done) is a bad user experience.
It's not about size or caching. An unique threadId needs to be generated taking into the account all the message ids that the message references as well as its subject. So each time a message is inserted it is necessary to find all messages that reference the same message ids and have the same thread name (base subject) as the new message. Then if those matching messages in the store have a unique threadId, then that threadId is used. However, if messages arrive out of order (or are inserted out of order by imapsync), it is necessary to perform a thread merge operation which involves finding the most common threadId and moving all messages to that threadId. This requires multiple write operations and every moved message needs to be logged in JMAP's changelog. The algorithm is briefly explained in the JMAP Mail RFC. |
Beta Was this translation helpful? Give feedback.
-
Even this should be nearly instant on such a small volume, especially while the volume is low (initially). Insertions slow down significantly after about first ~10K messages. I wonder what would happen on a mailbox with 200K-500K messages (like archives of mailing lists) - if it takes seconds to insert just one then it will be unusable.
Why multiple? One (modified) message = one write. The RFC says nothing about implementation, surely it could be optimized as you see fit. I did a small test of this "theory" with simple python script which creates 10K unique messages. Each message created with a unique subject and message-id, so no threading should take place, at least no updates to already existing messages: import imaplib
import socket
from email.message import Message
conn = imaplib.IMAP4('localhost')
conn.login('admin', 'adminpass')
conn.socket().setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
for n in range(10000):
msg = Message()
msg['From'] = '[email protected]'
msg['To'] = '[email protected]'
msg['Message-Id'] = f'unique.message.id.{n}@nowhere'
msg['Subject'] = f"This is message #{n}"
msg.set_payload('...nothing...')
conn.append('INBOX', '', None, str(msg).encode('utf-8'))
if n % 100 == 0:
print(f'{n:6}', end='\r', flush=True)
conn.logout() ... and I was surprised again. It took 809 seconds to create 10K message (12 msg/s on average) when there are no updates (to already existing messages) are expected to happen (assuming that was the problem). The CPU usage profile was actually more intriguing: (at least half of this time is kernel time, by the way) Feels like something has at least O(n) complexity, which should not be the case for indexed operations (assuming messaged ids and subjects are hashed and indexed, of course). But even the linear search over 10K objects is near-instant on modern hardware. |
Beta Was this translation helpful? Give feedback.
-
After disabling both full-text indexing and |
Beta Was this translation helpful? Give feedback.
-
Most likely not only RocksDB but somewhere before storage layer. I re-tested it with sqlite3 (all in tmpfs), with
CPU usage was constant and stalwart consumed all 4 available cores during the whole process, also as before half of the time was spent in the kernel mode (sycalls?). The resulting database size is ca. 21M - nothing to talk about. Regardless of how bad SQL databases could be as key/value store, in no-sync mode they are really fast, at least tens of thousands inserts/updates per second (even with indexes), in RAM-db even faster - I believe the problem is somewhere else. |
Beta Was this translation helpful? Give feedback.
-
Sorry if I said something completely wrong. But... could this be related to the other discussion? @infrequently mentioned that during import/export, "1500 DNS requests per minute were submitted." If the IMAP proxy and JMAP server run on the same machine, I would expect only one DNS syscall every few minutes. @aldem aldem said, "half of the time was spent in kernel mode (syscalls?)." Does it sound like a same pattern? |
Beta Was this translation helpful? Give feedback.
-
No, those are different import methods:
I have identified what is causing the slowdown on IMAP and there are multiple factors:
So, to summarize:
|
Beta Was this translation helpful? Give feedback.
-
Ok, I did a quick test test with
In the latter case though there is a huge write amplification - to import 10K message with less that 200 bytes each it produced 2 GB(!) of disk writes. I guess anything else will be on the same range - thus some control of commit/fsync interval is needed or SSDs will be killed fast on big imports or high-traffic installs.
This is negligible comparing to processing overhead and storage latency (FTS index building etc), especially in low-concurrency scenarios (few users). Pure inserts/updates in sqlite3 are way faster - yes, maybe incomparable to RocksDB but definitely faster that FoundationDB (as in the latter case we have again protocol and network overhead). With simple tables (key/value) sqlite3 is just a bit worse than anything else, maybe even better if used in WAL mode and when we mostly append, with updates things become a bit more complicated but still WAL to the rescue. The only case when key-value store wins is when most of the data is read/written entirely in RAM (like redis), once we hit the disk (any disk, even SSD) - we are bound to the disk which is order of magnitude slower than RAM. And since you commit (fsync) after (at least) each and every message appended/modified - this kills any practical advantages of "pure" key/value stores, at least while we are writing data. |
Beta Was this translation helpful? Give feedback.
-
Lowering pool size to 1-2 improves results, and judging from the fact that sqlite's inserts take only 67s when run on tmpfs (on my system), I believe that the main issue with "6 times slower" could be attributed to fs latency and bandwidth. rocksdb's inserts in my test take 28s (also in tmpfs) - thus sqlite is "just" 2.4 times slower (but still good enough with more than 1000 op/s). Anyway, even in your test this means > 500 op/s - which is quite good result at least for single account without concurrency, but what we observe when inserting mails (over IMAP) is several times less. I have to study the code to understand what exactly is going on when message is ingested, maybe there are some ways to improve it. By the way, a bit different performance issue - export of my account (48K messages, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The release looks great, a lot of interesting features - feels very promising - thank you!
But... there is always "but..." :)
I installed it with sqlite3 backend first and tried to import existing account (using imapsync). The import rate quickly dropped to less than 20 msg/s, and even after setting "pragma synchronous=off" (had to recompile) it didn't improve much.
Next attempt was with rocksdb - it started quick (> 50 msg/s) but again quickly (after ca. 12K messages) dropped to less than 20 msg/s (it dropped to ca ~10 msg/s close to the end).
My IMAP account has ~ 48K messages, importing all of them is a pain - takes ages. And this is on high-end Ryzen 9 5900X and PCIe x4 NVME SSD.
I have an impression that FTS indexing is the bottleneck - stalwart's CPU usage is > 300% (3 cores fully utilized) while disk activity is very low. I even experimented with rocksdb on tmpfs - well, this also didn't change a thing, so the disk or fsync is definitely not a bottleneck (only with sqlite3 it makes difference).
I tried also to use elastisearch - well, it didn't change a thing, and judging from stalwart's CPU load it still does internal indexing (while elastic does almost nothing).
On new setups, assuming low traffic scenarios, this probably would not be a problem, but anything that will have more that 50 msg/s will most likely die on spot or will require high-end clusters to operate.
Unfortunately, I didn't find a way to disable FTS indexing (or at least postpone it) so I could not test if this is really the issue. Internal FTS is quite limited anyway as it does not allow partial (sub-string) search.
Background indexing, or indexing on demand probably would improve things, but as long as as it is done on-the-fly, it needs enormous resources. And to be honest, I am really surprised that there is no (documented) way to disable FTS indexing completely.
Nevertheless, once everything was imported - it feels very fast on message access over IMAPv4, read rates/speeds are very impressive, only the (mass)import (and high CPU usage while it runs) is extremely disappointing.
If there is something that I could/should try to improve results - I am all eyes :)
Beta Was this translation helpful? Give feedback.
All reactions