-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazily export JSON results #1438
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1438 +/- ##
==========================================
- Coverage 89.44% 89.41% -0.03%
==========================================
Files 346 346
Lines 25262 25347 +85
Branches 3371 3381 +10
==========================================
+ Hits 22595 22665 +70
+ Misses 1507 1495 -12
- Partials 1160 1187 +27 ☔ View full report in Codecov by Sentry. |
This reverts commit 8c70415.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this already looks very nice.
There's some small things to do.
You can also write a similar exporter for the
qleverJSON
, then we can get rid of a lot of if(json) then bla
statements in the server.
co_yield absl::StrCat("{\"head\":{\"vars\":[\"", absl::StrJoin(vars, "\",\""), | ||
"\"]},\"results\":{\"bindings\":["); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be a little bit safer here,
create an nlohmann::json
with the variables, and then use its dump()
to fill the head
.
(Maybe we have odd variable names etc, we want the exporter to be robust).
const auto& optionalValue = idToStringAndType( | ||
qet.getQec()->getIndex(), currentId, result->localVocab()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an unused duplicate.
nlohmann::ordered_json b; | ||
if (!xsdType) { | ||
// No xsdType, this means that `stringValue` is a plain string literal | ||
// or entity. | ||
b = stringToBinding(stringValue); | ||
} else { | ||
b["value"] = stringValue; | ||
b["type"] = "literal"; | ||
b["datatype"] = xsdType; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part can also be a separate function
(stringAndTypeToJsonBinding
) or something like that.
src/util/CMakeLists.txt
Outdated
@@ -1,5 +1,5 @@ | |||
add_subdirectory(ConfigManager) | |||
add_subdirectory(MemorySize) | |||
add_subdirectory(http) | |||
add_library(util GeoSparqlHelpers.cpp antlr/ANTLRErrorHandling.cpp ParseException.cpp Conversions.cpp Date.cpp DateYearDuration.cpp Duration.cpp antlr/GenerateAntlrExceptionMetadata.cpp CancellationHandle.cpp StringUtils.cpp) | |||
add_library(util GeoSparqlHelpers.cpp antlr/ANTLRErrorHandling.cpp ParseException.cpp Conversions.cpp Date.cpp DateYearDuration.cpp Duration.cpp antlr/GenerateAntlrExceptionMetadata.cpp CancellationHandle.cpp StringUtils.cpp LazyJsonParser.cpp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can relatively easy delete all theLazyJsonParser files and changes from this PR (
git rm` etc and delete the single lines here).
That is in this case probably the easiest way to split this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already really nice.
I think for simplicity you can throw out the old (non-lazy) exporters,
then we have less code duplications.
src/engine/Server.cpp
Outdated
@@ -720,23 +726,12 @@ boost::asio::awaitable<void> Server::processQuery( | |||
case octetStream: | |||
case sparqlXml: | |||
case turtle: | |||
case sparqlJson: | |||
case qleverJson: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get rid of several lines here:
static constexpr std::array supportedTypes{MediaType::csv, MediaType::tsv, ...}
AD_CORRECTNESS_CHECK(ad_utility::contains(supportedTypes, mediaType.value());
co_await sendStreamableResponse(....)
(Always clean up code as you go).
TEST(ExportQueryExecutionTrees, LiteralCornercases) { | ||
auto runTestCaseWithLiteral = [](std::string literal) { | ||
const std::string innerValue = literal.substr(1, literal.rfind("\"") - 1); | ||
const std::string suffixValue = literal.substr(literal.rfind("\"") + 1); | ||
std::string kg = absl::StrCat("<s> <p> ", literal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code by you, or is this shown because you haven't merged in the current master?
// _____________________________________________________________________________ | ||
cppcoro::generator<std::string> | ||
ExportQueryExecutionTrees::computeResultAsQLeverJSONStream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For efficiency reasons
(The result has to be packed in reasonably large batches)
the inner generators (like for example this one here)
should not be generator<string>
but stream_generator
(see the other functions).
So the outermost generator of each type
(in particular this one here) should be a stream_generator
,
And then best you do your switch for the QLeverJSON also in the single
computeResultAsStream
function, which then transforms the stream_generator
back into a generator<string>
which then contains large chunks.
If you want to know more details, this is best explained in a meeting.
template <> | ||
ad_utility::streams::stream_generator ExportQueryExecutionTrees:: | ||
selectQueryResultToStream<ad_utility::MediaType::sparqlJson>( | ||
const QueryExecutionTree& qet, | ||
const parsedQuery::SelectClause& selectClause, | ||
LimitOffsetClause limitAndOffset, | ||
CancellationHandle cancellationHandle) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After putting some thought into this,
Maybe the cleaner way is to give
all those functions the [[maybe_unused]]
request timer,
s.t. the qleverJson
export can just be a normal
template specialization here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's one more thing needed for the qleverJson spezialization methods, they both require ParsedQuery::_originalString
which can't be accessed by e.g. ParsedQuery::SelectClause
which is available in the selectQueryResultToStream
template method.
So this cleanup would come at the price of passing the ParsedQuery query
to each selectquery-/constructqueryToStream
method and manually retrieving the required specific object there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine now as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently all the tests fail,
But I have identified the problem and proposed a fix.
The remaining things are just very small renamings etc. s.t. we get this clean.
This is almost done, great job!
} else if (quotePos < entitystr.size() - 2) { | ||
AD_CONTRACT_CHECK(entitystr[quotePos + 1] == '^'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it seems this doesn' work because of our strange fulltext index , that's why the E2E tests fail.
Restore the old behavior, and then add a TODO<joka921> This can be a
AD_CONTRACT_CHECK` once the fulltext index vocabulary is stored in a consistent format.
src/engine/Server.cpp
Outdated
static constexpr std::array supportedTypes{ | ||
MediaType::csv, MediaType::tsv, MediaType::octetStream, | ||
MediaType::sparqlXml, MediaType::turtle, MediaType::sparqlJson, | ||
MediaType::qleverJson}; | ||
AD_CORRECTNESS_CHECK( | ||
ad_utility::contains(supportedTypes, mediaType.value())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check even needed?
Or can it be further inside (inside sendStreamableResponse
or even better inside computeResultAsStream
, s.t. we further eliminate code from the Server.cpp
.
CancellationHandle cancellationHandle); | ||
|
||
// ___________________________________________________________________________ | ||
static nlohmann::json constructQueryResultBindingsToQLeverJSON( | ||
static cppcoro::generator<std::string> | ||
constructQueryResultBindingsToQLeverJSONStream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to call it ...Stream
as you have deleted the old version. So the original name suffices.
@@ -133,14 +120,15 @@ class ExportQueryExecutionTrees { | |||
* then the query result will be obtained via `qet->getResult()`. | |||
* @return a 2D-Json array corresponding to the IdTable given the arguments | |||
*/ | |||
static nlohmann::json idTableToQLeverJSONArray( | |||
static cppcoro::generator<std::string> idTableToQLeverJSONBindingsStream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idTableToQLeverJSONBindings
suffices (the Stream
is not needed anymore, see below).
// TODO<unexenu> | ||
static ad_utility::streams::stream_generator computeResultAsQLeverJSONStream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for theStream
in the name, as you have deleted the other one.
And a good comment is simply "Return the result of the query in the QLeverJSON
format."
(There still is a TODO
there.
@@ -37,7 +37,8 @@ class ExportQueryExecutionTrees { | |||
// lazily computes the serialized result in large chunks of bytes. | |||
static cppcoro::generator<std::string> computeResultAsStream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is the only function left, we can simply call it computeResult
(no AsStream
needed, as everything is lazy.
template <> | ||
ad_utility::streams::stream_generator ExportQueryExecutionTrees:: | ||
selectQueryResultToStream<ad_utility::MediaType::sparqlJson>( | ||
const QueryExecutionTree& qet, | ||
const parsedQuery::SelectClause& selectClause, | ||
LimitOffsetClause limitAndOffset, | ||
CancellationHandle cancellationHandle) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine now as it is.
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much.
This is a milestone for reducing QLever's RAM consumption and increasing its ability to compute and also export huge query results!
* This looks right to me. #1438 accidentally disabled the lazy computation of query results in JSON format. This commit not only reinstates this lazy computation, but also prefers lazy computation for all other export formats.
So far, when an error occurred after the sending of the result began, the HTTP status code was OK but the result was not. This resulted in confusing behavior in the QLever UI. In particular, if no result was produced at all, the QLever UI would wait for the result until the timeout and then show some generic error message. Since #1438, this affected any query with an error message, with the effect that no more specific error messages were shown in the QLever UI. This is now fixed now as follows: 1. When the error occurs *before* the sending of the result begins (the typical case), a non-200 HTTP code is sent just like before (this was broken by #1438). 2. When the error occurs *after* the sending of the result has begun, a 200 HTTP code is sent (HTTP provides no way to change this code after sending has begun), but when the error occurs a message is sent that begins with a highly unusual character sequence (currently `!!!!>>#`). This message could then be detected by the QLever UI or other tools. It is important to understand that this is not a problem specific to QLever, but a problem of the simplistic HTTP 1.1 protocol.
JSON results (in the official SPARQL JSON format as well as QLever's custom JSON format that is used by
qlever-ui
) are now also created on the fly and sent out in chunks without materializing the (possibly huge) complete JSON string in RAM. With this change, all query results are exported in this lazy manner, as the support for all other result types (TSV, CSV, Turtle, XML, Binary) has already been implemented previously.