feat(performance): support for Map as dataset #93

dios-david · 2023-01-30T20:58:46Z

As the High performance GROQ mentions, resolving references can be really slow when using a large dataset.
I ran into an issue where querying a list of documents can result in ~3200ms execution time. This query just queries documents and resolves one reference in each document using the -> operator, like:

*[_type == "book"] {
  title,
  "author": author->name
}

I made some debugging and I noticed that the majority of the time is spent in the Deref function when resolving references:

  async Deref({base}, scope, execute) {
    // ...

    const id = value.data._ref

    if (typeof id !== 'string') {
      return NULL_VALUE
    }

    // ⬇️ HERE ⬇️
    for await (const doc of scope.source) {
      if (doc.type === 'object' && id === doc.data._id) {
        return doc
    }

    return NULL_VALUE
  },

As the current implementation iterates over the entire dataset when resolving references, this will be exponentially slower when increasing the number of documents in the dataset.

I was thinking of simply indexing the documents using their _ids with a simple Map (Map<ID, Document>), so resolving references can be way faster than iterating over the whole dataset.

I'm not 100% sure what is the best way to add this feature, I was thinking of:
A) supporting a Map<ID, Document> as the value of the dataset parameter
B) adding an extra option like indexById: Boolean to evaluate which would transform the current dataset: Array<Document> to dataset: Map<ID, Document> before actually doing the evaluation

I ended up doing A) in this PR, and tested that slow query which I mentioned at the beginning with these changes, which got ~25x faster - from ~3200ms to ~125ms.

Let me know what you think!

stipsan

Good work on the investigation and resulting fix! ✨ Appreciate it, looks good from my end. I'm from the Ecosystem team though and I'm happy to merge and release this, and new versions of @sanity/groq-store, @sanity/preview-kit and next-sanity, as soon as we get an approval from someone at team Content Lake 🙇

src/values/utils.ts

atombender · 2023-02-01T22:52:43Z

Thanks so much for the PR!

Intuitively, the idea that a Map can be a dataset feels a bit wrong to me. A map is conceptually the same as an object, and I think it would be useful to support maps as objects inside the pipeline, not just plain JS objects, whereas your PR introduces (if I understood it correctly) the idea that a Map value is always a dataset that maps IDs to documents.

I think being explicit would be better here. We could introduce something like a Dataset interface:

interface Dataset {
  allDocuments(): [Object]
  getDocument(id: string): Object
}

Then we can have concrete implementations that wrap arrays, Map index by ID, etc. This would be useful to groq-store, too.

dios-david · 2023-02-02T09:58:34Z

Hey @atombender,

I agree with your point. I'm actually using groq-store and trying to make it a bit faster on my end. Using Map looked to be the simplest change for me as I'm not that familiar with the codebase, but having the Dataset interface you mentioned above would be the most flexible solution in long-term.

Shall I try to spend some time on changing this implementation to accommodate that, or would you prefer to take this over?

atombender · 2023-02-02T13:36:59Z

I think it would be great if you could amend this PR!

atombender · 2023-02-02T13:37:26Z

@judofyr Any thoughts about what this should look like?

judofyr · 2023-02-02T15:13:39Z

We've been talking about making the whole dereference operator configurable:

evaluate(tree, {
  dataset,
  dereference(obj) { // todo: bikeshed the name
    // this is the default behavior:
    for (const doc of dataset) {
      if (obj._ref === doc._id) return doc
    }
    return null
})

In groq-store this could be useful when we want to load partial data and give an error when trying to dereference a (strong) reference to something which wasn't loaded.

And for the case described here you could just do evaluate(tree, { dataset, dereference: datasetAsMap.get }) and dereferences would be instantly fast.

Another extension would be to support dereference: 'loop' (for today's behavior) and dereference: 'map' (which builds the Map for you and uses it) and then default to dereference: 'map' for better performance (over some more memory usage).

dios-david · 2023-02-02T15:30:17Z

dereference: 'map' (which builds the Map for you and uses it) and then default to dereference: 'map' for better performance

I had a similar idea (the "B" option I mentioned in the PR description), my reason of not doing that was:

groq.evaluate is a one-off operation, so the array->map transformation would happen each time you run a query even when using the same dataset multiple times
the transformation might not be needed at all, e.g. if you don't have reference resolution used in the query - I guess technically you can just do the transformation at the first time the dereference operator is used in the query, or do some pre-parsing etc

evaluate(tree, { dataset: datasetAsArray, dereference: datasetAsMap.get })

This makes sense as well for me, I'm happy to amend this PR in any direction you think would be the best :)

judofyr · 2023-02-13T09:13:54Z

evaluate(tree, { dataset: datasetAsArray, dereference: datasetAsMap.get })
This makes sense as well for me, I'm happy to amend this PR in any direction you think would be the best :)

Yes, I'm still leaning towards this solution. With the customized dereference-function it's also trivial to implement "on-demand Map creation" so that sounds like a good solution.

For consistency the signature should probably be dereference: (obj: unknown) => PromiseLike<unknown>.

dios-david · 2023-02-14T10:11:26Z

For consistency the signature should probably be dereference: (obj: unknown) => PromiseLike<unknown>.

If I get it right the deref function would be called when an object is a reference ({ _ref: 'some-id-of-another-document' }). Would you pass this to the dereference function, or would you just pass in the some-id-of-another-document?

The options I can think of are:

(obj: { _ref: string }) => PromiseLike<Document | null>
(id: string) => PromiseLike<Document | null>

judofyr · 2023-02-15T14:36:45Z

Two thoughts:

I think it's valuable to get the full reference object. E.g. you want want to do something different if weak: true or if you have added additional fields.
It's technically possible to run a query like *[_type == "author"]._id-> where you will invoke the dereference operation on a non-object. This is all about how much we want to restrict the usage of dereference. We can say that the evaluator will validate that it actually conforms to {_ref: string} (and otherwise immediately return null without invoking the custom dereference function). That means that there's less validation you typically need in your dereference operator at the expense of making it possible to implement weird dereferencing (dereferencing a number??)

I think I'm leaning towards having the function have (obj: { _ref: string }) => PromiseLike<Document | null> as signature and we'll validate that the object actually has string _ref.

dios-david · 2023-02-15T18:07:22Z

It's technically possible to run a query like *[_type == "author"]._id-> where you will invoke the dereference operation on a non-object.

dereferencing a number??

Oh yeah, that would be weird :) The spec doesn't allow that either, although I could imagine -> being an identity function when applied to non-reference objects - but this is way out of scope for this discussion, and not compliant with the spec.

Great, thanks for the discussion @judofyr! I think I will open this as a separate PR as I think there is nothing I could reuse from this branch.

judofyr · 2023-02-16T08:55:16Z

Oh yeah, that would be weird :) The spec doesn't allow that either, although I could imagine -> being an identity function when applied to non-reference objects - but this is way out of scope for this discussion, and not compliant with the spec.

Well, once we allow the dereference operator to be overridden in any way then it's possible to make it not compliant with the spec. I think it's fine for groq-js to provide additional functionality as long as the defaults are compliant.

Great, thanks for the discussion @judofyr! I think I will open this as a separate PR as I think there is nothing I could reuse from this branch.

Nice!!

dios-david and others added 3 commits January 30, 2023 17:52

adding support for Map as dataset

da5b71e

adding tests for count function and make it work with map datasets

648017d

chore: apply prettier

21f06fb

stipsan approved these changes Feb 1, 2023

View reviewed changes

src/values/utils.ts Show resolved Hide resolved

stipsan requested a review from a team February 1, 2023 22:02

stipsan changed the title ~~Adding support for Map as dataset~~ feat(performance): support for Map as dataset Feb 1, 2023

dios-david mentioned this pull request Feb 20, 2023

Adding option to provide a custom dereference function #98

Merged

vquigley requested review from judofyr and removed request for a team September 14, 2023 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(performance): support for Map as dataset #93

feat(performance): support for Map as dataset #93

dios-david commented Jan 30, 2023 •

edited

Loading

stipsan left a comment

atombender commented Feb 1, 2023

dios-david commented Feb 2, 2023

atombender commented Feb 2, 2023

atombender commented Feb 2, 2023

judofyr commented Feb 2, 2023

dios-david commented Feb 2, 2023

judofyr commented Feb 13, 2023

dios-david commented Feb 14, 2023

judofyr commented Feb 15, 2023

dios-david commented Feb 15, 2023 •

edited

Loading

judofyr commented Feb 16, 2023

feat(performance): support for Map as dataset #93

Are you sure you want to change the base?

feat(performance): support for Map as dataset #93

Conversation

dios-david commented Jan 30, 2023 • edited Loading

stipsan left a comment

Choose a reason for hiding this comment

atombender commented Feb 1, 2023

dios-david commented Feb 2, 2023

atombender commented Feb 2, 2023

atombender commented Feb 2, 2023

judofyr commented Feb 2, 2023

dios-david commented Feb 2, 2023

judofyr commented Feb 13, 2023

dios-david commented Feb 14, 2023

judofyr commented Feb 15, 2023

dios-david commented Feb 15, 2023 • edited Loading

judofyr commented Feb 16, 2023

dios-david commented Jan 30, 2023 •

edited

Loading

dios-david commented Feb 15, 2023 •

edited

Loading