Skip to content

drexlerd/tree-compression

Repository files navigation

Dynamic Tree Databases

This repository contains a C++20 library for dynamic tree databases. A tree database is a data structure D for representing sequences of elements over a finite alphabet with two operations: 1) insert_sequence(z) inserts a sequence z of length k into D and returns a lightweight handle i, and 2) read_sequence(i) returns the sequence z corresponding to the handle i, both aiming to be efficient at amortized O(k) time. Tree databases encode a sequence z = z1, z2, ..., zk as a perfectly balanced binary tree, where subtrees may be shared among trees. The best-case compression of tree databases over approaches that store sequences independently of each other can approach the information-theoretic lower bound of one word.

As an example application, tree databases can compress the set of generated states in explicit-state search, particularly in model checking and automated planning.

Tree databases were initially proposed as statically sized data structures, requiring the user to estimate the amount of memory needed. Our dynamic tree databases allocate memory dynamically as needed. Therefore, they do not rely on user estimates of the required memory, but have the same desired properties. The available implementations are based on modern Swiss tables and fixed word sizes. However, more memory-efficient implementations based on succinct data structures and compact hashing are possible, but are also significantly more challenging to implement efficiently at runtime. We are currently working on implementations in this direction.

API

The following example illustrates the API of the library. First, it instantiates several indexed hash sets for deduplications. Second, it inserts a sequence over uint32_t that precisely matches the data type used for node pointers. Last, it inserts a sequence over double that requires the additional leaf table to internally map the sequence to a sequence over uint32_t.

#include <valla/valla.hpp>

/* Common setup */

// Deduplicate root nodes, i.e., pair of uint32_t representing pointer to inner node and sequence length, by bijectively mapping it to uint32_t
auto root_table = valla::IndexedHashSet<valla::Slot<uint32_t>, uint32_t>();

// Insert the root representing the empty sequence.
const auto z_empty_index = root_table.insert(valla::Slot<uint32_t>());
assert(z_empty_index == 0);

// Deduplicate inner nodes, i.e., pair of uint32_t representing pointer to left and right child node, by bijectively mapping it to uint32_t
auto inner_table = valla::IndexedHashSet<valla::Slot<uint32_t>, uint32_t>();

// Deduplicate doubles by bijectively mapping it to uint32_t
auto leaf_table = valla::IndexedHashSet<double, uint32_t>();

/* Sequence of uint32_t (special case) */
{
    // Create a sequence over uint32_t
    const auto z = std::vector<uint32_t> { 3, 1, 2, 4, 7, 10 };

    // Insert the sequence that satisfies the std::input_range concept
    const auto z_root = valla::insert_sequence(z, inner_table);

    // Read the sequence from the given handle z_root.
    auto z_out = std::vector<uint32_t>{};
    valla::read_sequence(z_root, inner_table, std::back_inserter(z_out));

    // Optional: deduplicate the sequence to obtain an indexing scheme
    const auto z_index = root_table.insert(valla::Slot<uint32_t>(z_root, z.size()));
    assert(z_index == 1);
}

/* Sequence of double (general case) */
{
    // Create a sequence over double (requires the leaf table for deduplication of elements from the alphabet)
    const auto z = std::vector<double> { 4.2, 1.7, 3, 4, 7.7, 0.41 };

    // Translate the sequence over double to sequence over uint32_t.
    auto z_tmp = std::vector<uint32_t>{};
    valla::encode_as_unsigned_integrals(z, leaf_table, std::back_inserter(z_tmp));

    // Insert the sequence that satisfies the std::input_range concept
    const auto z_root = valla::insert_sequence(z_tmp, inner_table);

    // Read the sequence over uint32_t from the given handle z_root.
    auto z_tmp_out = std::vector<uint32_t>{};
    valla::read_sequence(z_root, inner_table, std::back_inserter(z_tmp_out));

    // Translate the sequence over uint32_t to sequence over double.
    auto z_out = std::vector<double>{};
    valla::decode_from_unsigned_integrals(z_tmp_out, leaf_table, std::back_inserter(z_out));

    // Optional: deduplicate the sequence to obtain an indexing scheme
    const auto z_index = root_table.insert(valla::Slot<uint32_t>(z_root, z.size()));
    assert(z_index == 2);  // We could store roots for sequences over doubles and uint32_t separately as well.
}

Getting Started

Installing the Dependencies

  1. Configure the dependencies CMake project with the desired installation path:
cmake -S dependencies -B dependencies/build -DCMAKE_INSTALL_PREFIX=dependencies/installs -DCMAKE_PREFIX_PATH=$PWD/dependencies/installs
  1. Download, build, and install all dependencies:
cmake --build dependencies/build -j$(nproc)

Building the Project

  1. Configure
cmake -S . -B build -DCMAKE_PREFIX_PATH=${PWD}/dependencies/installs
  1. Build
cmake --build build -j$(nproc)
  1. (Optional) Install the project from the build directory to the desired installation prefix directory:
cmake --install build --prefix=<path/to/installation-directory>

Including the Project

When the project is installed, it can easily be included in a cmake project using find_package, for example with the following code:

find_package(valla CONFIG REQUIRED PATHS ${CMAKE_PREFIX_PATH} NO_DEFAULT_PATH)
if(valla_FOUND)
  message(STATUS "Found valla: ${valla_DIR} (found version ${valla_VERSION})")
endif()

It is also possible to omit the CMAKE_PREFIX_PATH if the installation directory is visible to the compiler.

Testing

The testing framework depends on GoogleTest and requires the additional compile flag -DBUILD_TESTS=ON to be set in the cmake configure step. The tests can be run by executing the following commands from the root.

./build/tests/unit/dtdb_h
./build/tests/unit/dtdb_s

Benchmarking

The benchmark framework depends on GoogleBenchmark and requires the additional compile flag -DBUILD_PROFILING=ON to be set in the cmake configure step. The benchmarks can be run by executing the following commands from the root.

./build/benchmark/plain_swiss_insert
./build/benchmark/plain_swiss_read
./build/benchmark/plain_swiss_iterator
./build/benchmark/plain_hash_id_map_insert
./build/benchmark/plain_hash_id_map_read
./build/benchmark/plain_hash_id_map_iterator

The results from the GitHub action can be viewed here.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •