-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
The string_pool
is maintained as a sorted list of offsets, which when create_shared_string
is called are searched using slice::binary_search_by
If a match is not found, then the string is stored using create_byte_string
and the index is inserted into string_pool
using Vec::insert
So the happy path is an O(LogN) binary search
The unhappy path is an O(N) vector insertion
Every happy path requires a preceding unhappy path, so if you have N unique strings and M duplicates, then the cost of all calls to create_shared_string
becomes O(N^2) + O(MLogN) or simply O(N^2)
I have a stress test that is writing a flatbuffer with ~500,000 unique strings, 80% of our serialization time is in create_shared_string
I can reduce this to ~30% of our serialization time (total serialization time goes from to 20.4s to 4.1s) by
- changing the type of
string_pool
toHashMap<String, WIPOffset<&'fbb str>>
, and - changing the implementation of
create_shared_string
to
#[inline]
pub fn create_shared_string<'a: 'b, 'b>(&'a mut self, s: &'b str) -> WIPOffset<&'fbb str> {
self.assert_not_nested(
"create_shared_string can not be called when a table or vector is under construction",
);
match self.strings_pool.get(s) {
Some(address) => address,
None => {
let address = WIPOffset::new(self.create_byte_string(s.as_bytes()).value());
self.strings_pool.insert(s.to_owned(), address);
address
}
}
}