Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazily export JSON results #1438

Merged
merged 16 commits into from
Aug 27, 2024
Merged

Conversation

UNEXENU
Copy link
Contributor

@UNEXENU UNEXENU commented Aug 10, 2024

JSON results (in the official SPARQL JSON format as well as QLever's custom JSON format that is used by qlever-ui) are now also created on the fly and sent out in chunks without materializing the (possibly huge) complete JSON string in RAM. With this change, all query results are exported in this lazy manner, as the support for all other result types (TSV, CSV, Turtle, XML, Binary) has already been implemented previously.

Copy link

codecov bot commented Aug 10, 2024

Codecov Report

Attention: Patch coverage is 99.36709% with 1 line in your changes missing coverage. Please review.

Project coverage is 89.41%. Comparing base (f986808) to head (349db27).
Report is 8 commits behind head on master.

Files Patch % Lines
src/engine/ExportQueryExecutionTrees.cpp 99.36% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1438      +/-   ##
==========================================
- Coverage   89.44%   89.41%   -0.03%     
==========================================
  Files         346      346              
  Lines       25262    25347      +85     
  Branches     3371     3381      +10     
==========================================
+ Hits        22595    22665      +70     
+ Misses       1507     1495      -12     
- Partials     1160     1187      +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this already looks very nice.
There's some small things to do.
You can also write a similar exporter for the
qleverJSON , then we can get rid of a lot of if(json) then bla statements in the server.

Comment on lines 624 to 625
co_yield absl::StrCat("{\"head\":{\"vars\":[\"", absl::StrJoin(vars, "\",\""),
"\"]},\"results\":{\"bindings\":[");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be a little bit safer here,
create an nlohmann::json with the variables, and then use its dump() to fill the head.
(Maybe we have odd variable names etc, we want the exporter to be robust).

Comment on lines 638 to 639
const auto& optionalValue = idToStringAndType(
qet.getQec()->getIndex(), currentId, result->localVocab());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an unused duplicate.

Comment on lines 648 to 657
nlohmann::ordered_json b;
if (!xsdType) {
// No xsdType, this means that `stringValue` is a plain string literal
// or entity.
b = stringToBinding(stringValue);
} else {
b["value"] = stringValue;
b["type"] = "literal";
b["datatype"] = xsdType;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part can also be a separate function
(stringAndTypeToJsonBinding) or something like that.

@@ -1,5 +1,5 @@
add_subdirectory(ConfigManager)
add_subdirectory(MemorySize)
add_subdirectory(http)
add_library(util GeoSparqlHelpers.cpp antlr/ANTLRErrorHandling.cpp ParseException.cpp Conversions.cpp Date.cpp DateYearDuration.cpp Duration.cpp antlr/GenerateAntlrExceptionMetadata.cpp CancellationHandle.cpp StringUtils.cpp)
add_library(util GeoSparqlHelpers.cpp antlr/ANTLRErrorHandling.cpp ParseException.cpp Conversions.cpp Date.cpp DateYearDuration.cpp Duration.cpp antlr/GenerateAntlrExceptionMetadata.cpp CancellationHandle.cpp StringUtils.cpp LazyJsonParser.cpp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can relatively easy delete all theLazyJsonParser files and changes from this PR (git rm` etc and delete the single lines here).
That is in this case probably the easiest way to split this up.

test/ExportQueryExecutionTreesTest.cpp Show resolved Hide resolved
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already really nice.
I think for simplicity you can throw out the old (non-lazy) exporters,
then we have less code duplications.

@@ -720,23 +726,12 @@ boost::asio::awaitable<void> Server::processQuery(
case octetStream:
case sparqlXml:
case turtle:
case sparqlJson:
case qleverJson:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get rid of several lines here:

static constexpr std::array supportedTypes{MediaType::csv, MediaType::tsv, ...}
AD_CORRECTNESS_CHECK(ad_utility::contains(supportedTypes, mediaType.value());
co_await sendStreamableResponse(....)

(Always clean up code as you go).

Comment on lines 662 to 666
TEST(ExportQueryExecutionTrees, LiteralCornercases) {
auto runTestCaseWithLiteral = [](std::string literal) {
const std::string innerValue = literal.substr(1, literal.rfind("\"") - 1);
const std::string suffixValue = literal.substr(literal.rfind("\"") + 1);
std::string kg = absl::StrCat("<s> <p> ", literal);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code by you, or is this shown because you haven't merged in the current master?

Comment on lines 867 to 869
// _____________________________________________________________________________
cppcoro::generator<std::string>
ExportQueryExecutionTrees::computeResultAsQLeverJSONStream(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For efficiency reasons
(The result has to be packed in reasonably large batches)
the inner generators (like for example this one here)
should not be generator<string> but stream_generator (see the other functions).
So the outermost generator of each type
(in particular this one here) should be a stream_generator,
And then best you do your switch for the QLeverJSON also in the single
computeResultAsStream function, which then transforms the stream_generator
back into a generator<string> which then contains large chunks.
If you want to know more details, this is best explained in a meeting.

src/engine/ExportQueryExecutionTrees.cpp Show resolved Hide resolved
Comment on lines +664 to +670
template <>
ad_utility::streams::stream_generator ExportQueryExecutionTrees::
selectQueryResultToStream<ad_utility::MediaType::sparqlJson>(
const QueryExecutionTree& qet,
const parsedQuery::SelectClause& selectClause,
LimitOffsetClause limitAndOffset,
CancellationHandle cancellationHandle) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After putting some thought into this,
Maybe the cleaner way is to give
all those functions the [[maybe_unused]] request timer,
s.t. the qleverJson export can just be a normal template specialization here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one more thing needed for the qleverJson spezialization methods, they both require ParsedQuery::_originalString which can't be accessed by e.g. ParsedQuery::SelectClause which is available in the selectQueryResultToStream template method.
So this cleanup would come at the price of passing the ParsedQuery query to each selectquery-/constructqueryToStream method and manually retrieving the required specific object there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine now as it is.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently all the tests fail,
But I have identified the problem and proposed a fix.
The remaining things are just very small renamings etc. s.t. we get this clean.
This is almost done, great job!

Comment on lines 345 to 346
} else if (quotePos < entitystr.size() - 2) {
AD_CONTRACT_CHECK(entitystr[quotePos + 1] == '^');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it seems this doesn' work because of our strange fulltext index , that's why the E2E tests fail.
Restore the old behavior, and then add a TODO<joka921> This can be aAD_CONTRACT_CHECK` once the fulltext index vocabulary is stored in a consistent format.

Comment on lines 717 to 722
static constexpr std::array supportedTypes{
MediaType::csv, MediaType::tsv, MediaType::octetStream,
MediaType::sparqlXml, MediaType::turtle, MediaType::sparqlJson,
MediaType::qleverJson};
AD_CORRECTNESS_CHECK(
ad_utility::contains(supportedTypes, mediaType.value()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check even needed?
Or can it be further inside (inside sendStreamableResponse or even better inside computeResultAsStream, s.t. we further eliminate code from the Server.cpp .

test/ExportQueryExecutionTreesTest.cpp Show resolved Hide resolved
CancellationHandle cancellationHandle);

// ___________________________________________________________________________
static nlohmann::json constructQueryResultBindingsToQLeverJSON(
static cppcoro::generator<std::string>
constructQueryResultBindingsToQLeverJSONStream(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to call it ...Stream as you have deleted the old version. So the original name suffices.

@@ -133,14 +120,15 @@ class ExportQueryExecutionTrees {
* then the query result will be obtained via `qet->getResult()`.
* @return a 2D-Json array corresponding to the IdTable given the arguments
*/
static nlohmann::json idTableToQLeverJSONArray(
static cppcoro::generator<std::string> idTableToQLeverJSONBindingsStream(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idTableToQLeverJSONBindings suffices (the Stream is not needed anymore, see below).

Comment on lines 95 to 96
// TODO<unexenu>
static ad_utility::streams::stream_generator computeResultAsQLeverJSONStream(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for theStream in the name, as you have deleted the other one.

And a good comment is simply "Return the result of the query in the QLeverJSON format."
(There still is a TODO there.

@@ -37,7 +37,8 @@ class ExportQueryExecutionTrees {
// lazily computes the serialized result in large chunks of bytes.
static cppcoro::generator<std::string> computeResultAsStream(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is the only function left, we can simply call it computeResult (no AsStream needed, as everything is lazy.

Comment on lines +664 to +670
template <>
ad_utility::streams::stream_generator ExportQueryExecutionTrees::
selectQueryResultToStream<ad_utility::MediaType::sparqlJson>(
const QueryExecutionTree& qet,
const parsedQuery::SelectClause& selectClause,
LimitOffsetClause limitAndOffset,
CancellationHandle cancellationHandle) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine now as it is.

Copy link

sonarcloud bot commented Aug 27, 2024

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much.
This is a milestone for reducing QLever's RAM consumption and increasing its ability to compute and also export huge query results!

@joka921 joka921 changed the title Stream export for JSON Lazily export JSON results Aug 27, 2024
@joka921 joka921 merged commit ac257c7 into ad-freiburg:master Aug 27, 2024
19 of 20 checks passed
joka921 added a commit that referenced this pull request Aug 28, 2024
* This looks right to me.
#1438 accidentally disabled the lazy computation of query results in JSON format. This commit not only reinstates this lazy computation, but also prefers lazy computation for all other export formats.
joka921 pushed a commit that referenced this pull request Sep 5, 2024
This fixes a regression introduced in #1438 that I didn't notice wasn't fixed as part of #1470. The runtime information was wrong for lazy operations (wrong number of rows and wrong times).
This is now fixed + a unit tests is added that would have prevented this regression in the first place.
hannahbast pushed a commit that referenced this pull request Sep 5, 2024
So far, when an error occurred after the sending of the result began, the HTTP status code was OK but the result was not. This resulted in confusing behavior in the QLever UI. In particular, if no result was produced at all, the QLever UI would wait for the result until the timeout and then show some generic error message. Since #1438, this affected any query with an error message, with the effect that no more specific error messages were shown in the QLever UI. This is now fixed now as follows:

1.  When the error occurs *before* the sending of the result begins (the typical case), a non-200 HTTP code is sent just like before (this was broken by #1438).

2.  When the error occurs *after* the sending of the result has begun, a 200 HTTP code is sent (HTTP provides no way to change this code after sending has begun), but when the error occurs a message is sent that begins with a highly unusual character sequence (currently `!!!!>>#`). This message could then be detected by the QLever UI or other tools. It is important to understand that this is not a problem specific to QLever, but a problem of the simplistic HTTP 1.1 protocol.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants