Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RocksDB Python API #991

Closed
wants to merge 10 commits into from
Closed

RocksDB Python API #991

wants to merge 10 commits into from

Conversation

joe-iddon
Copy link
Collaborator

@joe-iddon joe-iddon commented Oct 23, 2023

Reference Issues/PRs

Closes #913
Extends #945 (C++ tests) and #961 (Conda mac problems)

What does this implement or fix?

Implements the RocksDB Python API. Update: We will not advertise this yet, since conda support is not there, and the documentation still needs writing.

Since the C++ code for RocksDB is excluded on conda, a static readonly property is attached to LibraryManager in the python_bindings.cpp there so that the Python code knows whether to offer up the rocksdb_library_adapter.py, or not.

I.e.

>>> arcticdb_ext.storage.LibraryManager.rocksdb_support
False # On conda. On PyPi (windows + Linux), this will give True

Still to do:

Any other comments?

Removed the std::make_unique anti-pattern that was introduced to storage_factory.cpp in #625 which required Storage implementations to be move-constructable. This was highlighted to me since I defined a custom destructor for the RocksDBStorage class which destroyed its default move constructor, so gave compiler errors.

Ideas for Profiling + Improving Performance / Configuring RocksDB

Profiling ideas:

  • Need to compile in release not debug mode, since RocksDB should behave better in this mode
  • Flamegraph+perf
  • Enable performance tracing with #define. Note might want to add more ARCTICDB_SAMPLE macros for this to rocksdb_storage.cpp.
  • Use py-spy with PID input and speedscope output using options --native to profile C++ code, and --idle to include time spent waiting on I/O operations etc
  • Edit: We now have a Wiki page to explain our profiling options.
  • Ideas for graphs/parameters to vary in comparing LMDB and RocksDB:
    • write+read for different data sizes, varying segment size (rows+columns) since RocksDB's LSM tree should out-perform LMDB's BTree when there are a large number of key/value pairs.
    • batch_write and batch_read operations to compare parallel performance
    • append and update as these will produce smaller segments, so may favor RocksDB as more key/value pairs

Performance ideas

  • As noted in the code, should switch from Slice to PinnableSlice. This is only possible once Add a mechanism to extend storage transaction lifetime to lifetime of… #975 has been merged, but will enable us to eliminate a copy in creating the Segment from the bytes during a read.
  • Writes cannot be optimised in this way. There is a necessary copy, which unlike in LMDB, cannot be avoided.
  • Optimise the BloomFilter (already enabled, but could be tuned).
  • Add block caching
  • Configure number of SST files in L0, L1, etc.
  • Configure how many I/O threads to give RocksDB. Currently set to 16 from default 1 using IncreaseParallelism, but don't want too many, because then might try too hard to distribute work and harm performance from context switching.
  • Implement WriteBatch when deleting/putting multiple key/values, rather than Put and Delete one-by-one.
  • Add a WaitForCompact before closing as this will mean the DB can be opened faster next time.

Documentation and reference material on RocksDB:

Random other idea:

  • The C++ test parameterization in test_embedded.cpp could be made neater in two ways: either just spread the list definition over multiple lines, then can literally #define out the one RocksDB if we are on conda; OR make the BackendGenerator class have a constexpr constructor and then should be able to use a constexpr version of emplace_back ... except not on C++20, so would have to write a constexpr version of emplace_back.

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@joe-iddon joe-iddon added the enhancement New feature or request label Oct 23, 2023
This was referenced Oct 23, 2023
Base automatically changed from initial_rocksdb_work to master October 24, 2023 13:44
@joe-iddon joe-iddon changed the title WIP: RocksDB Python API RocksDB Python API Nov 1, 2023
@joe-iddon joe-iddon marked this pull request as ready for review November 1, 2023 10:02
@@ -38,6 +38,14 @@ namespace arcticdb::storage {

[[nodiscard]] bool has_library(const LibraryPath& path) const;

[[nodiscard]] static inline bool rocksdb_support() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just hang thing off the module like here

storage.attr("CONFIG_LIBRARY_NAME") = py::str(arcticdb::storage::CONFIG_LIBRARY_NAME);
can't we, no need for this to be tied to the library_manager?

(fg::from(ks.as_range()) | fg::move | fg::groupBy(grouper)).foreach([&](auto &&group) {
auto key_type_name = fmt::format("{}", group.key());
auto handle = handles_by_key_type_.at(key_type_name);
for (const auto &k : group.values()) {
std::string k_str = to_serialized_key(k);
std::string value;
// TODO: Once PR: 975 has been merged we can use ::rocksdb::PinnableSlice to avoid the copy in
// the consturction of the segment
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// the consturction of the segment
// the construction of the segment

storage = std::make_unique<s3::S3Storage>(
s3::S3Storage(library_path, mode, s3_config)
);
storage = std::make_unique<s3::S3Storage>(library_path, mode, s3_config);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah wow, nice cleanup, much better.

import re
import os

# from dataclasses import dataclass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm



class RocksDBLibraryAdapter(ArcticLibraryAdapter):
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document in here that this is a work in progress, beta API and should not be relied upon?
And also log some warnings if anyone does actually use the RocksDB adapter so they see at runtime we're not making any promises about it.

@poodlewars poodlewars closed this Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RocksDB Backend
2 participants