Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct lexicographic ordering of items from the local vocab #1390

Merged
merged 64 commits into from
Jul 27, 2024
Merged
Show file tree
Hide file tree
Changes from 58 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
78ec2b2
A rough draft. Working: Joins with local vocab.
joka921 Jul 5, 2024
79d90b9
The sorting now works between
joka921 Jul 5, 2024
9c73759
A lot of cases work now.
joka921 Jul 5, 2024
05089b7
Everything compiles.
joka921 Jul 5, 2024
aab72ec
It seems to work, now fix and add tests.
joka921 Jul 8, 2024
e763d98
Fixing more tests...
joka921 Jul 10, 2024
1f38f43
All tests working.
joka921 Jul 10, 2024
13fd1ae
NexistePasMore stuff.
joka921 Jul 10, 2024
eae3725
Tests not working at all, but the local vocab is messy....
joka921 Jul 10, 2024
4329a4c
Another bug...
joka921 Jul 10, 2024
aa2bfad
A lot of stuff not working with the external vocabulary...
joka921 Jul 10, 2024
62a0800
In the middle of a review....
joka921 Jul 11, 2024
634400a
Fix a small bug in the pattern creation.
joka921 Jul 11, 2024
33c64b6
Fix the unit test.
joka921 Jul 12, 2024
59b64b0
Some fixes.
joka921 Jul 12, 2024
6796af1
Merge branch 'refs/heads/small-bug-pattern-creation' into correct-com…
joka921 Jul 12, 2024
0d34669
Fix the unit tests, so that we can work alongside our beloved tools f…
joka921 Jul 12, 2024
3072377
Fix the unit test.
joka921 Jul 12, 2024
baa30f1
Fixed unused variables.
joka921 Jul 15, 2024
ef940e8
Add some comments.
joka921 Jul 15, 2024
fa50324
Try to fix the E2E tests by using different settings...
joka921 Jul 15, 2024
e550067
A first draft of a correctly interleaved vocab.
joka921 Jul 17, 2024
7145b80
Start integrating into the vocabulary.
joka921 Jul 17, 2024
849b3d4
TODO:
joka921 Jul 17, 2024
9d5d2fa
TODO<joka921> pass On ethe information about compression to the inter…
joka921 Jul 17, 2024
0d59e1d
This compiles...
joka921 Jul 17, 2024
bfb6be9
We are finding bugs through unit tests.
joka921 Jul 17, 2024
0a93ebb
More unit tests passing, everything compiles.
joka921 Jul 18, 2024
1ab304d
Implement caching.
joka921 Jul 18, 2024
1caf71c
Fix bug.
joka921 Jul 18, 2024
97d2052
Smaller default milestone distance.
joka921 Jul 18, 2024
4cb95b7
Make gcc happy
joka921 Jul 18, 2024
ac67c7a
Unnecessary complexity in the comparison.
joka921 Jul 18, 2024
a92c93c
We are getting there, need to prove some things.
joka921 Jul 18, 2024
7541116
It seems to work, we improve one after the other.
joka921 Jul 18, 2024
b97e782
Let the unit tests pass again....
joka921 Jul 18, 2024
72a3cdf
This is really getting into shape.
joka921 Jul 18, 2024
002354e
Refactoring etc.
joka921 Jul 18, 2024
e741338
Move to anonymous namespace for testing.
joka921 Jul 18, 2024
b8a1daf
Move to anonymous namespace for testing.
joka921 Jul 18, 2024
4054973
Merge branch 'refs/heads/master' into correct-vocabulary
joka921 Jul 18, 2024
c241eca
Merge branch 'refs/heads/master' into correct-comparison-local-vocab
joka921 Jul 18, 2024
7ffe7df
A round of reviews with Hannah.
joka921 Jul 18, 2024
ce94613
Completely refactor the wordAndIndex class.
joka921 Jul 19, 2024
79eb1f5
This works.
joka921 Jul 19, 2024
fc9b073
More cleanups.
joka921 Jul 19, 2024
949f1f6
Might this word?
joka921 Jul 19, 2024
9b9b85c
Further cleanup for sonarcloud
joka921 Jul 19, 2024
dd97717
Bugfix
joka921 Jul 19, 2024
b2227f5
Further cleanups.
joka921 Jul 19, 2024
d16bbac
Further further cleanups.
joka921 Jul 19, 2024
89f496b
A review.
joka921 Jul 25, 2024
9582b19
Tiny last round of reviews.
joka921 Jul 25, 2024
edfeb81
Merge branch 'refs/heads/master' into correct-comparison-local-vocab
joka921 Jul 26, 2024
694cffd
Merge branch 'refs/heads/correct-vocabulary' into correct-comparison-…
joka921 Jul 26, 2024
118f50b
Fix all the unit tests....
joka921 Jul 26, 2024
af036c4
Merge branch 'refs/heads/master' into correct-comparison-local-vocab
joka921 Jul 26, 2024
d933852
Fix some sonarcloud issues.
joka921 Jul 26, 2024
0636b6e
Changes from a review.
joka921 Jul 26, 2024
1ed9ab2
Tiny final reviews...
joka921 Jul 26, 2024
e3dbc9f
Add member for explicit slicing (to make SonarCloud happy)
Jul 26, 2024
745c521
Remove std::move in CopyableAtomic move constructor
Jul 27, 2024
f4b2a12
Also removing the std::move in the move assignment
Jul 27, 2024
7dd4906
Also make the noecxept unconditional
Jul 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -408,7 +408,8 @@ add_subdirectory(test)

# Add the library with the constants declared in `CompilationInfo.h` and defined
# in `CompilationInfo.cpp` created by `CompilationInfo.cmake`.
add_library(compilationInfo ${CMAKE_CURRENT_BINARY_DIR}/CompilationInfo.cpp)
add_library(compilationInfo ${CMAKE_CURRENT_BINARY_DIR}/CompilationInfo.cpp
src/global/VocabIndex.h)

add_executable(IndexBuilderMain src/index/IndexBuilderMain.cpp)
qlever_target_link_libraries(IndexBuilderMain index ${CMAKE_THREAD_LIBS_INIT} Boost::program_options)
Expand Down
2 changes: 1 addition & 1 deletion benchmark/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,6 @@ addAndLinkBenchmark(JoinAlgorithmBenchmark testUtil memorySize)

addAndLinkBenchmark(IdTableCompressedWriterBenchmark engine testUtil)

addAndLinkBenchmark(ParallelMergeBenchmark)
addAndLinkBenchmark(ParallelMergeBenchmark testUtil)

addAndLinkBenchmark(GroupByHashMapBenchmark engine testUtil gtest gmock)
4 changes: 3 additions & 1 deletion e2e/e2e-build-settings.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
{
"num-triples-per-batch" : 40000,
"parser-batch-size" : 1000,
"ascii-prefixes-only":false
"ascii-prefixes-only":false,
"languages-internal": [""],
"prefixes-external": [""]
}
3 changes: 2 additions & 1 deletion src/engine/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,6 @@ add_library(engine
VariableToColumnMap.cpp ExportQueryExecutionTrees.cpp
CartesianProductJoin.cpp TextIndexScanForWord.cpp TextIndexScanForEntity.cpp
TextLimit.cpp
idTable/CompressedExternalIdTable.h)
idTable/CompressedExternalIdTable.h
LocalVocabEntry.cpp)
qlever_target_link_libraries(engine util index parser sparqlExpressions http SortPerformanceEstimator Boost::iostreams)
3 changes: 2 additions & 1 deletion src/engine/ExportQueryExecutionTrees.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,8 @@ ExportQueryExecutionTrees::getLiteralOrIriFromVocabIndex(
using LiteralOrIri = ad_utility::triple_component::LiteralOrIri;
switch (id.getDatatype()) {
case Datatype::LocalVocabIndex:
return localVocab.getWord(id.getLocalVocabIndex());
return static_cast<LiteralOrIri>(
localVocab.getWord(id.getLocalVocabIndex()));
case Datatype::VocabIndex: {
auto entity = index.indexToString(id.getVocabIndex());
return LiteralOrIri::fromStringRepresentation(entity);
Expand Down
1 change: 1 addition & 0 deletions src/engine/LocalVocab.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
#include "absl/strings/str_cat.h"
#include "global/Id.h"
#include "global/ValueId.h"
#include "index/IndexImpl.h"

// _____________________________________________________________________________
LocalVocab LocalVocab::clone() const {
Expand Down
4 changes: 3 additions & 1 deletion src/engine/LocalVocab.h
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,11 @@
// is meant for words that are not part of the normal vocabulary (constructed
// from the input data at indexing time).
//

class LocalVocab {
private:
using LiteralOrIri = ad_utility::triple_component::LiteralOrIri;
using Entry = LocalVocabEntry;
using LiteralOrIri = LocalVocabEntry;
// A map of the words in the local vocabulary to their local IDs. This is a
// node hash map because we need the addresses of the words (which are of type
// `LiteralOrIri`) to remain stable over their lifetime in the hash map
Expand Down
31 changes: 31 additions & 0 deletions src/engine/LocalVocabEntry.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
// Copyright 2024, University of Freiburg,
// Chair of Algorithms and Data Structures.
// Author: Johannes Kalmbach <[email protected]>

#include "LocalVocabEntry.h"

#include "index/IndexImpl.h"

// TODO<joka921> Consider moving the cheap case (if precomputed) into the
// header.
auto LocalVocabEntry::positionInVocab() const -> PositionInVocab {
// Immediately return if we have previously computed and cached the position.
if (positionInVocabKnown_.load(std::memory_order_acquire)) {
return {lowerBoundInVocab_.load(std::memory_order_relaxed),
upperBoundInVocab_.load(std::memory_order_relaxed)};
}

// Lookup the lower and upper bound from the vocabulary of the index,
// cache and return them.
const IndexImpl& index = IndexImpl::staticGlobalSingletonIndex();
PositionInVocab positionInVocab;
const auto& vocab = index.getVocab();
positionInVocab.lowerBound_ = vocab.lower_bound(toStringRepresentation());
positionInVocab.upperBound_ = vocab.upper_bound(toStringRepresentation());
lowerBoundInVocab_.store(positionInVocab.lowerBound_,
std::memory_order_relaxed);
upperBoundInVocab_.store(positionInVocab.upperBound_,
std::memory_order_relaxed);
positionInVocabKnown_.store(true, std::memory_order_release);
return positionInVocab;
}
70 changes: 70 additions & 0 deletions src/engine/LocalVocabEntry.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
// Copyright 2024, University of Freiburg,
// Chair of Algorithms and Data Structures.
// Author: Johannes Kalmbach <[email protected]>

#pragma once

#include <atomic>

#include "global/VocabIndex.h"
#include "parser/LiteralOrIri.h"
#include "util/CopyableSynchronization.h"

// This is the type we use to store literals and IRIs in the `LocalVocab`.
// It consists of a `LiteralOrIri` and a cache to store the position, where
// the entry would be in the global vocabulary of the Index. This position is
// used for efficient comparisons between entries in the local and global
// vocabulary because we only have to lookup the position once per
// `LocalVocabEntry`, and all subsequent comparisons are cheap.
class alignas(16) LocalVocabEntry
: public ad_utility::triple_component::LiteralOrIri {
private:
using Base = ad_utility::triple_component::LiteralOrIri;

// The cache for the position in the vocabulary. As usual, the `lowerBound` is
// inclusive, the `upperBound` is not, so if `lowerBound == upperBound`, then
// the entry is not part of the globalVocabulary, an `lowerBound` points to
// the first *larger* word in the vocabulary. Note: we store the cache as
// three separate atomics to avoid mutexes. The downside is, that in parallel
// code multiple threads might look up the position concurrently, which wastes
// a bit of resources. We however don't consider this case to be likely.
mutable ad_utility::CopyableAtomic<VocabIndex> lowerBoundInVocab_;
mutable ad_utility::CopyableAtomic<VocabIndex> upperBoundInVocab_;
mutable ad_utility::CopyableAtomic<bool> positionInVocabKnown_ = false;

public:
// Inherit the constructors from `LiteralOrIri`
using Base::Base;

// Deliberately allow implicit conversion from `LiteralOrIri`.
explicit(false) LocalVocabEntry(const Base& base) : Base{base} {}
explicit(false) LocalVocabEntry(Base&& base) noexcept
: Base{std::move(base)} {}

// Return the position in the vocabulary. If it is not already cached, then
// the call to `positionInVocab()` first computes the position and then
// caches it.
// Note:: We use `lower` and `upperBound` because depending on the Local
// settings there might be a range of words that are considered equal for the
// purposes of comparing and sorting them.
struct PositionInVocab {
VocabIndex lowerBound_;
VocabIndex upperBound_;
};
PositionInVocab positionInVocab() const;

// It suffices to hash the base class `LiteralOrIri` as the position in the
// vocab is redundant for those purposes.
template <typename H>
friend H AbslHashValue(H h, const std::same_as<LocalVocabEntry> auto& entry) {
return AbslHashValue(std::move(h), static_cast<const Base&>(entry));
}

// Comparison between two entries could in theory also be sped up using the
// cached `position` if it has previously been computed for both of the
// entries, but it is currently questionable whether this gains much
// performance.
auto operator<=>(const LocalVocabEntry& rhs) const {
return static_cast<const Base&>(*this) <=> static_cast<const Base&>(rhs);
}
};
33 changes: 9 additions & 24 deletions src/engine/sparqlExpressions/RelationalExpressionHelpers.h
Original file line number Diff line number Diff line change
Expand Up @@ -130,38 +130,27 @@ inline std::pair<ValueId, ValueId> getRangeFromVocab(
// consecutive range of IDs. For its usage see below.
template <typename S>
concept StoresStringOrId =
ad_utility::SimilarToAny<S, ValueId,
ad_utility::triple_component::LiteralOrIri,
IdOrLiteralOrIri, std::pair<Id, Id>>;
ad_utility::SimilarToAny<S, ValueId, LocalVocabEntry, IdOrLiteralOrIri,
std::pair<Id, Id>>;
// Convert a string or `IdOrLiteralOrIri` value into the (possibly empty) range
// of corresponding `ValueIds` (denoted by a `std::pair<Id, Id>`, see
// `getRangeFromVocab` above for details). This function also takes `ValueId`s
// and `pair<ValuedId, ValueId>` which are simply returned unchanged. This makes
// the usage of this function easier.
template <StoresStringOrId S>
auto makeValueId(const S& value, const EvaluationContext* context) {
if constexpr (ad_utility::isSimilar<S, ValueId>) {
return value;
} else if constexpr (ad_utility::isSimilar<S, std::pair<Id, Id>>) {
if constexpr (ad_utility::SimilarToAny<S, ValueId, std::pair<Id, Id>>) {
return value;
} else if constexpr (ad_utility::isSimilar<S, IdOrLiteralOrIri>) {
auto visitor = [context](const auto& x) {
auto res = makeValueId(x, context);
if constexpr (ad_utility::isSimilar<decltype(res), Id>) {
// We need the same return type on all cases when visiting a variant, so
// we need to return a pair here. As the second element is an upper
// bound, we have to increment it by one.
return std::pair{res, ValueId::fromBits(res.getBits() + 1)};
} else {
return res;
}
return res;
};
return std::visit(visitor, value);

} else {
static_assert(
ad_utility::isSimilar<S, ad_utility::triple_component::LiteralOrIri>);
return getRangeFromVocab(value, context);
static_assert(ad_utility::isSimilar<S, LocalVocabEntry>);
return Id::makeFromLocalVocabIndex(&value);
}
};

Expand All @@ -176,13 +165,9 @@ inline const auto compareIdsOrStrings =
[]<StoresStringOrId T, StoresStringOrId U>(
const T& a, const U& b,
const EvaluationContext* ctx) -> valueIdComparators::ComparisonResult {
using LiteralOrIri = ad_utility::triple_component::LiteralOrIri;
if constexpr (ad_utility::isSimilar<LiteralOrIri, T> &&
ad_utility::isSimilar<LiteralOrIri, U>) {
// TODO<joka921> integrate comparison via ICU and proper handling for
// IRIs/ Literals/etc.
return valueIdComparators::fromBool(applyComparison<Comp>(
a.toStringRepresentation(), b.toStringRepresentation()));
if constexpr (ad_utility::isSimilar<LocalVocabEntry, T> &&
ad_utility::isSimilar<LocalVocabEntry, U>) {
return valueIdComparators::fromBool(applyComparison<Comp>(a, b));
} else {
auto x = makeValueId(a, ctx);
auto y = makeValueId(b, ctx);
Expand Down
6 changes: 2 additions & 4 deletions src/engine/sparqlExpressions/SparqlExpressionTypes.h
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,7 @@ static_assert(!std::default_initializable<VectorWithMemoryLimit<int>>);
// A class to store the results of expressions that can yield strings or IDs as
// their result (for example IF and COALESCE). It is also used for expressions
// that can only yield strings.
using IdOrLiteralOrIri =
std::variant<ValueId, ad_utility::triple_component::LiteralOrIri>;
using IdOrLiteralOrIri = std::variant<ValueId, LocalVocabEntry>;
// Printing for GTest.
void PrintTo(const IdOrLiteralOrIri& var, std::ostream* os);

Expand Down Expand Up @@ -223,8 +222,7 @@ Id constantExpressionResultToId(T&& result, LocalVocabT& localVocab) {
} else if constexpr (ad_utility::isSimilar<T, IdOrLiteralOrIri>) {
return std::visit(
[&localVocab]<typename R>(R&& el) mutable {
if constexpr (ad_utility::isSimilar<
R, ad_utility::triple_component::LiteralOrIri>) {
if constexpr (ad_utility::isSimilar<R, LocalVocabEntry>) {
return Id::makeFromLocalVocabIndex(
localVocab.getIndexAndAddIfNotContained(AD_FWD(el)));
} else {
Expand Down
15 changes: 4 additions & 11 deletions src/global/IndexTypes.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,15 @@
// Chair of Algorithms and Data Structures.
// Author: Johannes Kalmbach <[email protected]>

#ifndef QLEVER_INDEXTYPES_H
#define QLEVER_INDEXTYPES_H
#pragma once

#include "./TypedIndex.h"
#include "parser/LiteralOrIri.h"
#include "./VocabIndex.h"
#include "engine/LocalVocabEntry.h"

// Typedefs for several kinds of typed indices that are used across QLever.

// TODO<joka921> Rename `VocabIndex` to `RdfVocabIndex` at a point in time where
// this (very intrusive) renaming doesn't interfere with too many open pull
// requests.
using VocabIndex = ad_utility::TypedIndex<uint64_t, "VocabIndex">;

using LocalVocabIndex = const ad_utility::triple_component::LiteralOrIri*;
using LocalVocabIndex = const LocalVocabEntry*;
using TextRecordIndex = ad_utility::TypedIndex<uint64_t, "TextRecordIndex">;
using WordVocabIndex = ad_utility::TypedIndex<uint64_t, "WordVocabIndex">;
using BlankNodeIndex = ad_utility::TypedIndex<uint64_t, "BlankNodeIndex">;

#endif // QLEVER_INDEXTYPES_H
64 changes: 47 additions & 17 deletions src/global/ValueId.h
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#include <bit>
#include <cstdint>
#include <functional>
#include <iostream>
#include <limits>

#include "global/IndexTypes.h"
Expand Down Expand Up @@ -88,6 +89,19 @@ class ValueId {

// The largest representable integer value.
static constexpr int64_t maxInt = IntegerType::max();
// These types store strings. When sorting IDs, then these types are still
// interleaved (meaning that there is a consecutive range of IDs that contains
// VocabIndex and LocalVocabIndex, but those two datatypes do not necessarily
// form contiguous ranges).
static constexpr std::array<Datatype, 2> stringTypes_{
Datatype::VocabIndex, Datatype::LocalVocabIndex};
// Assert, that the `stringTypes_` are directly adjacent, this is required
// to make the comparison of IDs in `ValueIdComparators.h` work.
static constexpr Datatype maxStringType_ = std::ranges::max(stringTypes_);
static constexpr Datatype minStringType_ = std::ranges::min(stringTypes_);
static_assert(static_cast<size_t>(maxStringType_) -
static_cast<size_t>(minStringType_) + 1 ==
stringTypes_.size());

/// This exception is thrown if we try to store a value of an index type
/// (VocabIndex, LocalVocabIndex, TextRecordIndex) that is larger than
Expand Down Expand Up @@ -120,21 +134,6 @@ class ValueId {
/// Default construction of an uninitialized id.
ValueId() = default;

/// Equality comparison is performed directly on the underlying
/// representation.
// NOTE: (Also for the operator<=> below: These comparisons only work
// correctly if we only store entries in the local vocab that are NOT part of
// the vocabulary. This is currently not true for expression results from
// GROUP BY and BIND operations (for performance reasons). So a join with such
// results will currently lead to wrong results.
constexpr bool operator==(const ValueId& other) const {
if (getDatatype() == Datatype::LocalVocabIndex &&
other.getDatatype() == Datatype::LocalVocabIndex) [[unlikely]] {
return *getLocalVocabIndex() == *other.getLocalVocabIndex();
}
return _bits == other._bits;
}

/// Comparison is performed directly on the underlying representation. Note
/// that because the type bits are the most significant bits, all values of
/// the same `Datatype` will be adjacent to each other. Unsigned index types
Expand All @@ -144,13 +143,44 @@ class ValueId {
/// doubles in reversed order. This is a direct consequence of comparing the
/// bit representation of these values as unsigned integers.
constexpr auto operator<=>(const ValueId& other) const {
if (getDatatype() == Datatype::LocalVocabIndex &&
other.getDatatype() == Datatype::LocalVocabIndex) [[unlikely]] {
using enum Datatype;
auto type = getDatatype();
auto otherType = other.getDatatype();
if (type != LocalVocabIndex && otherType != LocalVocabIndex) {
return _bits <=> other._bits;
}
if (type == LocalVocabIndex && otherType == LocalVocabIndex) [[unlikely]] {
return *getLocalVocabIndex() <=> *other.getLocalVocabIndex();
}
auto compareVocabAndLocalVocab =
[](::VocabIndex vocabIndex,
::LocalVocabIndex localVocabIndex) -> std::strong_ordering {
auto [lowerBound, upperBound] = localVocabIndex->positionInVocab();
if (vocabIndex < lowerBound) {
return std::strong_ordering::less;
} else if (vocabIndex >= upperBound) {
return std::strong_ordering::greater;
} else {
return std::strong_ordering::equal;
}
};
if (type == VocabIndex) {
return compareVocabAndLocalVocab(getVocabIndex(),
other.getLocalVocabIndex());
} else if (otherType == VocabIndex) {
auto inverseOrder = compareVocabAndLocalVocab(other.getVocabIndex(),
getLocalVocabIndex());
return 0 <=> inverseOrder;
}
return _bits <=> other._bits;
}

// For some reason which I (joka921) don't understand, we still need
// operator== although we already have operator <=>.
constexpr bool operator==(const ValueId& other) const {
return (*this <=> other) == 0;
}

/// Get the underlying bit representation, e.g. for compression etc.
[[nodiscard]] constexpr T getBits() const noexcept { return _bits; }
/// Construct from the underlying bit representation. `bits` must have been
Expand Down
Loading
Loading