Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct lexicographic ordering of items from the local vocab #1390

Merged
merged 64 commits into from
Jul 27, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
78ec2b2
A rough draft. Working: Joins with local vocab.
joka921 Jul 5, 2024
79d90b9
The sorting now works between
joka921 Jul 5, 2024
9c73759
A lot of cases work now.
joka921 Jul 5, 2024
05089b7
Everything compiles.
joka921 Jul 5, 2024
aab72ec
It seems to work, now fix and add tests.
joka921 Jul 8, 2024
e763d98
Fixing more tests...
joka921 Jul 10, 2024
1f38f43
All tests working.
joka921 Jul 10, 2024
13fd1ae
NexistePasMore stuff.
joka921 Jul 10, 2024
eae3725
Tests not working at all, but the local vocab is messy....
joka921 Jul 10, 2024
4329a4c
Another bug...
joka921 Jul 10, 2024
aa2bfad
A lot of stuff not working with the external vocabulary...
joka921 Jul 10, 2024
62a0800
In the middle of a review....
joka921 Jul 11, 2024
634400a
Fix a small bug in the pattern creation.
joka921 Jul 11, 2024
33c64b6
Fix the unit test.
joka921 Jul 12, 2024
59b64b0
Some fixes.
joka921 Jul 12, 2024
6796af1
Merge branch 'refs/heads/small-bug-pattern-creation' into correct-com…
joka921 Jul 12, 2024
0d34669
Fix the unit tests, so that we can work alongside our beloved tools f…
joka921 Jul 12, 2024
3072377
Fix the unit test.
joka921 Jul 12, 2024
baa30f1
Fixed unused variables.
joka921 Jul 15, 2024
ef940e8
Add some comments.
joka921 Jul 15, 2024
fa50324
Try to fix the E2E tests by using different settings...
joka921 Jul 15, 2024
e550067
A first draft of a correctly interleaved vocab.
joka921 Jul 17, 2024
7145b80
Start integrating into the vocabulary.
joka921 Jul 17, 2024
849b3d4
TODO:
joka921 Jul 17, 2024
9d5d2fa
TODO<joka921> pass On ethe information about compression to the inter…
joka921 Jul 17, 2024
0d59e1d
This compiles...
joka921 Jul 17, 2024
bfb6be9
We are finding bugs through unit tests.
joka921 Jul 17, 2024
0a93ebb
More unit tests passing, everything compiles.
joka921 Jul 18, 2024
1ab304d
Implement caching.
joka921 Jul 18, 2024
1caf71c
Fix bug.
joka921 Jul 18, 2024
97d2052
Smaller default milestone distance.
joka921 Jul 18, 2024
4cb95b7
Make gcc happy
joka921 Jul 18, 2024
ac67c7a
Unnecessary complexity in the comparison.
joka921 Jul 18, 2024
a92c93c
We are getting there, need to prove some things.
joka921 Jul 18, 2024
7541116
It seems to work, we improve one after the other.
joka921 Jul 18, 2024
b97e782
Let the unit tests pass again....
joka921 Jul 18, 2024
72a3cdf
This is really getting into shape.
joka921 Jul 18, 2024
002354e
Refactoring etc.
joka921 Jul 18, 2024
e741338
Move to anonymous namespace for testing.
joka921 Jul 18, 2024
b8a1daf
Move to anonymous namespace for testing.
joka921 Jul 18, 2024
4054973
Merge branch 'refs/heads/master' into correct-vocabulary
joka921 Jul 18, 2024
c241eca
Merge branch 'refs/heads/master' into correct-comparison-local-vocab
joka921 Jul 18, 2024
7ffe7df
A round of reviews with Hannah.
joka921 Jul 18, 2024
ce94613
Completely refactor the wordAndIndex class.
joka921 Jul 19, 2024
79eb1f5
This works.
joka921 Jul 19, 2024
fc9b073
More cleanups.
joka921 Jul 19, 2024
949f1f6
Might this word?
joka921 Jul 19, 2024
9b9b85c
Further cleanup for sonarcloud
joka921 Jul 19, 2024
dd97717
Bugfix
joka921 Jul 19, 2024
b2227f5
Further cleanups.
joka921 Jul 19, 2024
d16bbac
Further further cleanups.
joka921 Jul 19, 2024
89f496b
A review.
joka921 Jul 25, 2024
9582b19
Tiny last round of reviews.
joka921 Jul 25, 2024
edfeb81
Merge branch 'refs/heads/master' into correct-comparison-local-vocab
joka921 Jul 26, 2024
694cffd
Merge branch 'refs/heads/correct-vocabulary' into correct-comparison-…
joka921 Jul 26, 2024
118f50b
Fix all the unit tests....
joka921 Jul 26, 2024
af036c4
Merge branch 'refs/heads/master' into correct-comparison-local-vocab
joka921 Jul 26, 2024
d933852
Fix some sonarcloud issues.
joka921 Jul 26, 2024
0636b6e
Changes from a review.
joka921 Jul 26, 2024
1ed9ab2
Tiny final reviews...
joka921 Jul 26, 2024
e3dbc9f
Add member for explicit slicing (to make SonarCloud happy)
Jul 26, 2024
745c521
Remove std::move in CopyableAtomic move constructor
Jul 27, 2024
f4b2a12
Also removing the std::move in the move assignment
Jul 27, 2024
7dd4906
Also make the noecxept unconditional
Jul 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion benchmark/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,6 @@ addAndLinkBenchmark(JoinAlgorithmBenchmark testUtil memorySize)

addAndLinkBenchmark(IdTableCompressedWriterBenchmark engine testUtil)

addAndLinkBenchmark(ParallelMergeBenchmark)
addAndLinkBenchmark(ParallelMergeBenchmark testUtil)

addAndLinkBenchmark(GroupByHashMapBenchmark engine testUtil gtest gmock)
17 changes: 17 additions & 0 deletions src/engine/LocalVocab.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
#include "absl/strings/str_cat.h"
#include "global/Id.h"
#include "global/ValueId.h"
#include "index/IndexImpl.h"

// _____________________________________________________________________________
LocalVocab LocalVocab::clone() const {
Expand Down Expand Up @@ -77,3 +78,19 @@ std::vector<LocalVocab::LiteralOrIri> LocalVocab::getAllWordsForTesting()
}
return result;
}

// TODO<joka921> Consider moving the cheap case (if precomputed) into the
// header.
auto LocalVocabEntry::lowerBoundInIndex() const -> BoundsInIndex {
if (positionInVocabKnown) {
return {lowerBoundInIndex_, upperBoundInIndex_};
}
const IndexImpl& index = IndexImpl::staticGlobalSingletonIndex();
BoundsInIndex result;
const auto& vocab = index.getVocab();
result.lowerBound_ = vocab.lower_bound_external(toStringRepresentation());
result.upperBound_ = vocab.upper_bound_external(toStringRepresentation());
lowerBoundInIndex_ = result.lowerBound_;
upperBoundInIndex_ = result.upperBound_;
return result;
}
4 changes: 3 additions & 1 deletion src/engine/LocalVocab.h
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,11 @@
// is meant for words that are not part of the normal vocabulary (constructed
// from the input data at indexing time).
//

class LocalVocab {
private:
using LiteralOrIri = ad_utility::triple_component::LiteralOrIri;
using Entry = LocalVocabEntry;
using LiteralOrIri = LocalVocabEntry;
// A map of the words in the local vocabulary to their local IDs. This is a
// node hash map because we need the addresses of the words (which are of type
// `LiteralOrIri`) to remain stable over their lifetime in the hash map
Expand Down
26 changes: 12 additions & 14 deletions src/engine/sparqlExpressions/RelationalExpressionHelpers.h
Original file line number Diff line number Diff line change
Expand Up @@ -130,23 +130,22 @@ inline std::pair<ValueId, ValueId> getRangeFromVocab(
// consecutive range of IDs. For its usage see below.
template <typename S>
concept StoresStringOrId =
ad_utility::SimilarToAny<S, ValueId,
ad_utility::triple_component::LiteralOrIri,
IdOrLiteralOrIri, std::pair<Id, Id>>;
ad_utility::SimilarToAny<S, ValueId, LocalVocabEntry, IdOrLiteralOrIri,
std::pair<Id, Id>>;
// Convert a string or `IdOrLiteralOrIri` value into the (possibly empty) range
// of corresponding `ValueIds` (denoted by a `std::pair<Id, Id>`, see
// `getRangeFromVocab` above for details). This function also takes `ValueId`s
// and `pair<ValuedId, ValueId>` which are simply returned unchanged. This makes
// the usage of this function easier.
template <StoresStringOrId S>
auto makeValueId(const S& value, const EvaluationContext* context) {
if constexpr (ad_utility::isSimilar<S, ValueId>) {
return value;
} else if constexpr (ad_utility::isSimilar<S, std::pair<Id, Id>>) {
if constexpr (ad_utility::SimilarToAny<S, ValueId, std::pair<Id, Id>>) {
return value;
} else if constexpr (ad_utility::isSimilar<S, IdOrLiteralOrIri>) {
auto visitor = [context](const auto& x) {
auto res = makeValueId(x, context);
return res;
/*
if constexpr (ad_utility::isSimilar<decltype(res), Id>) {
// We need the same return type on all cases when visiting a variant, so
// we need to return a pair here. As the second element is an upper
Expand All @@ -155,13 +154,13 @@ auto makeValueId(const S& value, const EvaluationContext* context) {
} else {
return res;
}
*/
};
return std::visit(visitor, value);

} else {
static_assert(
ad_utility::isSimilar<S, ad_utility::triple_component::LiteralOrIri>);
return getRangeFromVocab(value, context);
static_assert(ad_utility::isSimilar<S, LocalVocabEntry>);
return Id::makeFromLocalVocabIndex(&value);
}
};

Expand All @@ -176,13 +175,12 @@ inline const auto compareIdsOrStrings =
[]<StoresStringOrId T, StoresStringOrId U>(
const T& a, const U& b,
const EvaluationContext* ctx) -> valueIdComparators::ComparisonResult {
using LiteralOrIri = ad_utility::triple_component::LiteralOrIri;
if constexpr (ad_utility::isSimilar<LiteralOrIri, T> &&
ad_utility::isSimilar<LiteralOrIri, U>) {
if constexpr (ad_utility::isSimilar<LocalVocabEntry, T> &&
ad_utility::isSimilar<LocalVocabEntry, U>) {
// TODO<joka921> integrate comparison via ICU and proper handling for
// IRIs/ Literals/etc.
return valueIdComparators::fromBool(applyComparison<Comp>(
a.toStringRepresentation(), b.toStringRepresentation()));
// TODO<joka921> This is now the place to fix it....
return valueIdComparators::fromBool(applyComparison<Comp>(a, b));
} else {
auto x = makeValueId(a, ctx);
auto y = makeValueId(b, ctx);
Expand Down
6 changes: 2 additions & 4 deletions src/engine/sparqlExpressions/SparqlExpressionTypes.h
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,7 @@ static_assert(!std::default_initializable<VectorWithMemoryLimit<int>>);
// A class to store the results of expressions that can yield strings or IDs as
// their result (for example IF and COALESCE). It is also used for expressions
// that can only yield strings.
using IdOrLiteralOrIri =
std::variant<ValueId, ad_utility::triple_component::LiteralOrIri>;
using IdOrLiteralOrIri = std::variant<ValueId, LocalVocabEntry>;
// Printing for GTest.
void PrintTo(const IdOrLiteralOrIri& var, std::ostream* os);

Expand Down Expand Up @@ -223,8 +222,7 @@ Id constantExpressionResultToId(T&& result, LocalVocabT& localVocab) {
} else if constexpr (ad_utility::isSimilar<T, IdOrLiteralOrIri>) {
return std::visit(
[&localVocab]<typename R>(R&& el) mutable {
if constexpr (ad_utility::isSimilar<
R, ad_utility::triple_component::LiteralOrIri>) {
if constexpr (ad_utility::isSimilar<R, LocalVocabEntry>) {
return Id::makeFromLocalVocabIndex(
localVocab.getIndexAndAddIfNotContained(AD_FWD(el)));
} else {
Expand Down
44 changes: 43 additions & 1 deletion src/global/IndexTypes.h
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,49 @@
// requests.
using VocabIndex = ad_utility::TypedIndex<uint64_t, "VocabIndex">;

using LocalVocabIndex = const ad_utility::triple_component::LiteralOrIri*;
// TODO<joka921> Move to util library
template <typename T>
class CopyableAtomic : public std::atomic<T> {
using Base = std::atomic<T>;

public:
using Base::Base;
CopyableAtomic(const CopyableAtomic& rhs) : Base{rhs.load()} {}
CopyableAtomic& operator=(const CopyableAtomic& rhs) {
static_cast<Base&>(*this) = rhs.load();
return *this;
}
};

class alignas(16) LocalVocabEntry
: public ad_utility::triple_component::LiteralOrIri {
private:
using Base = ad_utility::triple_component::LiteralOrIri;
mutable CopyableAtomic<VocabIndex> lowerBoundInIndex_;
mutable CopyableAtomic<VocabIndex> upperBoundInIndex_;
mutable CopyableAtomic<bool> positionInVocabKnown = false;

public:
struct BoundsInIndex {
VocabIndex lowerBound_;
VocabIndex upperBound_;
};
using Base::Base;
BoundsInIndex lowerBoundInIndex() const;

LocalVocabEntry(const Base& base) : Base{base} {}
LocalVocabEntry(Base&& base) noexcept : Base{std::move(base)} {}

template <typename H>
friend H AbslHashValue(H h, const std::same_as<LocalVocabEntry> auto& entry) {
return AbslHashValue(std::move(h), static_cast<const Base&>(entry));
}

auto operator<=>(const LocalVocabEntry& rhs) const {
return static_cast<const Base&>(*this) <=> static_cast<const Base&>(rhs);
}
};
using LocalVocabIndex = const LocalVocabEntry*;
using TextRecordIndex = ad_utility::TypedIndex<uint64_t, "TextRecordIndex">;
using WordVocabIndex = ad_utility::TypedIndex<uint64_t, "WordVocabIndex">;
using BlankNodeIndex = ad_utility::TypedIndex<uint64_t, "BlankNodeIndex">;
Expand Down
56 changes: 39 additions & 17 deletions src/global/ValueId.h
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#include <bit>
#include <cstdint>
#include <functional>
#include <iostream>
#include <limits>

#include "global/IndexTypes.h"
Expand Down Expand Up @@ -88,6 +89,12 @@ class ValueId {

// The largest representable integer value.
static constexpr int64_t maxInt = IntegerType::max();
// These types store strings. When sorting IDs, then these types are still
// interleaved (meaning that there is a consecutive range of IDs that contains
// VocabIndex and LocalVocabIndex, but those two datatypes do not necessarily
// form contiguous ranges).
static constexpr std::array<Datatype, 2> stringTypes_{
Datatype::VocabIndex, Datatype::LocalVocabIndex};

/// This exception is thrown if we try to store a value of an index type
/// (VocabIndex, LocalVocabIndex, TextRecordIndex) that is larger than
Expand Down Expand Up @@ -120,21 +127,6 @@ class ValueId {
/// Default construction of an uninitialized id.
ValueId() = default;

/// Equality comparison is performed directly on the underlying
/// representation.
// NOTE: (Also for the operator<=> below: These comparisons only work
// correctly if we only store entries in the local vocab that are NOT part of
// the vocabulary. This is currently not true for expression results from
// GROUP BY and BIND operations (for performance reasons). So a join with such
// results will currently lead to wrong results.
constexpr bool operator==(const ValueId& other) const {
if (getDatatype() == Datatype::LocalVocabIndex &&
other.getDatatype() == Datatype::LocalVocabIndex) [[unlikely]] {
return *getLocalVocabIndex() == *other.getLocalVocabIndex();
}
return _bits == other._bits;
}

/// Comparison is performed directly on the underlying representation. Note
/// that because the type bits are the most significant bits, all values of
/// the same `Datatype` will be adjacent to each other. Unsigned index types
Expand All @@ -144,13 +136,43 @@ class ValueId {
/// doubles in reversed order. This is a direct consequence of comparing the
/// bit representation of these values as unsigned integers.
constexpr auto operator<=>(const ValueId& other) const {
if (getDatatype() == Datatype::LocalVocabIndex &&
other.getDatatype() == Datatype::LocalVocabIndex) [[unlikely]] {
using enum Datatype;
auto type = getDatatype();
auto otherType = other.getDatatype();
if (type != LocalVocabIndex && otherType != LocalVocabIndex) {
return _bits <=> other._bits;
}
if (type == LocalVocabIndex && otherType == LocalVocabIndex) [[unlikely]] {
return *getLocalVocabIndex() <=> *other.getLocalVocabIndex();
}
auto compareVocabAndLocalVocab = [](::VocabIndex vocabIndex,
::LocalVocabIndex localVocabIndex) {
auto [lowerBound, upperBound] = localVocabIndex->lowerBoundInIndex();
if (vocabIndex < lowerBound) {
return std::strong_ordering::less;
} else if (vocabIndex >= upperBound) {
return std::strong_ordering::greater;
} else {
return std::strong_ordering::equal;
}
};
if (type == VocabIndex) {
return compareVocabAndLocalVocab(getVocabIndex(),
other.getLocalVocabIndex());
} else if (otherType == VocabIndex) {
auto inverseOrder = compareVocabAndLocalVocab(other.getVocabIndex(),
getLocalVocabIndex());
return 0 <=> inverseOrder;
}
return _bits <=> other._bits;
}

// For some reason which I (joka921) don't understand, we still need
// operator== although we already have operator <=>.
constexpr bool operator==(const ValueId& other) const {
return (*this <=> other) == 0;
}

/// Get the underlying bit representation, e.g. for compression etc.
[[nodiscard]] constexpr T getBits() const noexcept { return _bits; }
/// Construct from the underlying bit representation. `bits` must have been
Expand Down
48 changes: 39 additions & 9 deletions src/global/ValueIdComparators.h
Original file line number Diff line number Diff line change
Expand Up @@ -106,9 +106,25 @@ template <typename RandomIt>
inline std::pair<RandomIt, RandomIt> getRangeForDatatype(RandomIt begin,
RandomIt end,
Datatype datatype) {
return std::equal_range(
begin, end, datatype,
detail::makeSymmetricComparator(&ValueId::getDatatype));
auto comparator = detail::makeSymmetricComparator(&ValueId::getDatatype);
// In a sorted input, `VocabIndex` and `LocalVocabIndex` IDs might be
// interleaved because they logically both store strings. We therefore
// need the range where any of those Datatypes match.

// Binary search on the `Datatype` can only work in the string case, if the
// involved datatypes are directly adjacent.
static_assert(static_cast<int>(Datatype::LocalVocabIndex) ==
static_cast<int>(Datatype::VocabIndex) + 1);
if (ad_utility::contains(
std::array{Datatype::LocalVocabIndex, Datatype::VocabIndex},
datatype)) {
auto lower_bound =
std::lower_bound(begin, end, Datatype::VocabIndex, comparator);
auto upper_bound =
std::upper_bound(begin, end, Datatype::LocalVocabIndex, comparator);
return {lower_bound, upper_bound};
}
return std::equal_range(begin, end, datatype, comparator);
}

namespace detail {
Expand Down Expand Up @@ -410,7 +426,11 @@ inline std::vector<std::pair<RandomIt, RandomIt>> getRangesForEqualIds(
AD_CONTRACT_CHECK(valueIdBegin <= valueIdEnd);
// This lambda enforces the invariants `non-empty` and `sorted` and also
// merges directly adjacent ranges.
AD_CONTRACT_CHECK(valueIdBegin.getDatatype() == valueIdEnd.getDatatype());
auto typeA = valueIdBegin.getDatatype();
auto typeB = valueIdEnd.getDatatype();
AD_CONTRACT_CHECK(typeA == typeB ||
(ad_utility::contains(Id::stringTypes_, typeA) &&
ad_utility::contains(Id::stringTypes_, typeB)));
switch (valueIdBegin.getDatatype()) {
case Datatype::Double:
case Datatype::Int:
Expand Down Expand Up @@ -438,10 +458,15 @@ inline bool areTypesCompatible(Datatype typeA, Datatype typeB) {
auto isNumeric = [](Datatype type) {
return type == Datatype::Double || type == Datatype::Int;
};
// TODO<joka921> Make this work for the WordIndex also.
auto isString = [](Datatype type) {
return type == Datatype::VocabIndex || type == Datatype::LocalVocabIndex;
};
auto isUndefined = [](Datatype type) { return type == Datatype::Undefined; };
// Note: Undefined values cannot be compared to other undefined values.
return (!isUndefined(typeA)) &&
((typeA == typeB) || (isNumeric(typeA) && isNumeric(typeB)));
((typeA == typeB) || (isNumeric(typeA) && isNumeric(typeB)) ||
(isString(typeA) && isString(typeB)));
}

// This function is part of the implementation of `compareIds` (see below).
Expand All @@ -461,14 +486,19 @@ ComparisonResult compareIdsImpl(ValueId a, ValueId b, auto comparator) {
}
}

// If any of the entries is a `LocalVocabIndex`, then the ordinary comparison
// on ValueIds already does the right thing.
if (a.getDatatype() == Datatype::LocalVocabIndex ||
b.getDatatype() == Datatype::LocalVocabIndex) {
return fromBool(std::invoke(comparator, a, b));
}

auto visitor = [comparator]<typename A, typename B>(
const A& aValue, const B& bValue) -> ComparisonResult {
if constexpr (std::is_same_v<A, LocalVocabIndex> &&
std::is_same_v<B, LocalVocabIndex>) {
// TODO<joka921> This is one of the places that has to be changed once
// we want to implement correct comparisons for the local vocab that use
// ICU collation.
return fromBool(std::invoke(comparator, *aValue, *bValue));
// We have handled this case outside the visitor.
AD_FAIL();
} else if constexpr (requires() {
std::invoke(comparator, aValue, bValue);
}) {
Expand Down
5 changes: 4 additions & 1 deletion src/index/IndexImpl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,9 @@ static constexpr size_t NUM_EXTERNAL_SORTERS_AT_SAME_TIME = 2u;

// _____________________________________________________________________________
IndexImpl::IndexImpl(ad_utility::AllocatorWithLimit<Id> allocator)
: allocator_{std::move(allocator)} {};
: allocator_{std::move(allocator)} {
globalSingletonIndex_ = this;
};

// _____________________________________________________________________________
IndexBuilderDataAsFirstPermutationSorter IndexImpl::createIdTriplesAndVocab(
Expand Down Expand Up @@ -710,6 +712,7 @@ void IndexImpl::createFromOnDiskIndex(const string& onDiskBase) {
readConfiguration();
vocab_.readFromFile(onDiskBase_ + INTERNAL_VOCAB_SUFFIX,
onDiskBase_ + EXTERNAL_VOCAB_SUFFIX);
globalSingletonComparator_ = &vocab_.getCaseComparator();

totalVocabularySize_ = vocab_.size() + vocab_.getExternalVocab().size();
LOG(DEBUG) << "Number of words in internal and external vocabulary: "
Expand Down
Loading
Loading