Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(performance): support for Map as dataset #93

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dios-david
Copy link
Contributor

@dios-david dios-david commented Jan 30, 2023

As the High performance GROQ mentions, resolving references can be really slow when using a large dataset.
I ran into an issue where querying a list of documents can result in ~3200ms execution time. This query just queries documents and resolves one reference in each document using the -> operator, like:

*[_type == "book"] {
  title,
  "author": author->name
}

I made some debugging and I noticed that the majority of the time is spent in the Deref function when resolving references:

  async Deref({base}, scope, execute) {
    // ...

    const id = value.data._ref

    if (typeof id !== 'string') {
      return NULL_VALUE
    }

    // ⬇️ HERE ⬇️
    for await (const doc of scope.source) {
      if (doc.type === 'object' && id === doc.data._id) {
        return doc
    }

    return NULL_VALUE
  },

As the current implementation iterates over the entire dataset when resolving references, this will be exponentially slower when increasing the number of documents in the dataset.

I was thinking of simply indexing the documents using their _ids with a simple Map (Map<ID, Document>), so resolving references can be way faster than iterating over the whole dataset.

I'm not 100% sure what is the best way to add this feature, I was thinking of:
A) supporting a Map<ID, Document> as the value of the dataset parameter
B) adding an extra option like indexById: Boolean to evaluate which would transform the current dataset: Array<Document> to dataset: Map<ID, Document> before actually doing the evaluation

I ended up doing A) in this PR, and tested that slow query which I mentioned at the beginning with these changes, which got ~25x faster - from ~3200ms to ~125ms.

Let me know what you think!

Copy link
Member

@stipsan stipsan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work on the investigation and resulting fix! ✨ Appreciate it, looks good from my end. I'm from the Ecosystem team though and I'm happy to merge and release this, and new versions of @sanity/groq-store, @sanity/preview-kit and next-sanity, as soon as we get an approval from someone at team Content Lake 🙇

src/values/utils.ts Show resolved Hide resolved
@stipsan stipsan requested a review from a team February 1, 2023 22:02
@stipsan stipsan changed the title Adding support for Map as dataset feat(performance): support for Map as dataset Feb 1, 2023
@atombender
Copy link
Member

Thanks so much for the PR!

Intuitively, the idea that a Map can be a dataset feels a bit wrong to me. A map is conceptually the same as an object, and I think it would be useful to support maps as objects inside the pipeline, not just plain JS objects, whereas your PR introduces (if I understood it correctly) the idea that a Map value is always a dataset that maps IDs to documents.

I think being explicit would be better here. We could introduce something like a Dataset interface:

interface Dataset {
  allDocuments(): [Object]
  getDocument(id: string): Object
}

Then we can have concrete implementations that wrap arrays, Map index by ID, etc. This would be useful to groq-store, too.

@dios-david
Copy link
Contributor Author

Hey @atombender,

I agree with your point. I'm actually using groq-store and trying to make it a bit faster on my end. Using Map looked to be the simplest change for me as I'm not that familiar with the codebase, but having the Dataset interface you mentioned above would be the most flexible solution in long-term.

Shall I try to spend some time on changing this implementation to accommodate that, or would you prefer to take this over?

@atombender
Copy link
Member

I think it would be great if you could amend this PR!

@atombender
Copy link
Member

@judofyr Any thoughts about what this should look like?

@judofyr
Copy link
Collaborator

judofyr commented Feb 2, 2023

We've been talking about making the whole dereference operator configurable:

evaluate(tree, {
  dataset,
  dereference(obj) { // todo: bikeshed the name
    // this is the default behavior:
    for (const doc of dataset) {
      if (obj._ref === doc._id) return doc
    }
    return null
})

In groq-store this could be useful when we want to load partial data and give an error when trying to dereference a (strong) reference to something which wasn't loaded.

And for the case described here you could just do evaluate(tree, { dataset, dereference: datasetAsMap.get }) and dereferences would be instantly fast.

Another extension would be to support dereference: 'loop' (for today's behavior) and dereference: 'map' (which builds the Map for you and uses it) and then default to dereference: 'map' for better performance (over some more memory usage).

@dios-david
Copy link
Contributor Author

dereference: 'map' (which builds the Map for you and uses it) and then default to dereference: 'map' for better performance

I had a similar idea (the "B" option I mentioned in the PR description), my reason of not doing that was:

  • groq.evaluate is a one-off operation, so the array->map transformation would happen each time you run a query even when using the same dataset multiple times
  • the transformation might not be needed at all, e.g. if you don't have reference resolution used in the query - I guess technically you can just do the transformation at the first time the dereference operator is used in the query, or do some pre-parsing etc
evaluate(tree, { dataset: datasetAsArray, dereference: datasetAsMap.get })

This makes sense as well for me, I'm happy to amend this PR in any direction you think would be the best :)

@judofyr
Copy link
Collaborator

judofyr commented Feb 13, 2023

evaluate(tree, { dataset: datasetAsArray, dereference: datasetAsMap.get })

This makes sense as well for me, I'm happy to amend this PR in any direction you think would be the best :)

Yes, I'm still leaning towards this solution. With the customized dereference-function it's also trivial to implement "on-demand Map creation" so that sounds like a good solution.

For consistency the signature should probably be dereference: (obj: unknown) => PromiseLike<unknown>.

@dios-david
Copy link
Contributor Author

For consistency the signature should probably be dereference: (obj: unknown) => PromiseLike<unknown>.

If I get it right the deref function would be called when an object is a reference ({ _ref: 'some-id-of-another-document' }). Would you pass this to the dereference function, or would you just pass in the some-id-of-another-document?

The options I can think of are:

  • (obj: { _ref: string }) => PromiseLike<Document | null>
  • (id: string) => PromiseLike<Document | null>

@judofyr
Copy link
Collaborator

judofyr commented Feb 15, 2023

Two thoughts:

  1. I think it's valuable to get the full reference object. E.g. you want want to do something different if weak: true or if you have added additional fields.
  2. It's technically possible to run a query like *[_type == "author"]._id-> where you will invoke the dereference operation on a non-object. This is all about how much we want to restrict the usage of dereference. We can say that the evaluator will validate that it actually conforms to {_ref: string} (and otherwise immediately return null without invoking the custom dereference function). That means that there's less validation you typically need in your dereference operator at the expense of making it possible to implement weird dereferencing (dereferencing a number??)

I think I'm leaning towards having the function have (obj: { _ref: string }) => PromiseLike<Document | null> as signature and we'll validate that the object actually has string _ref.

@dios-david
Copy link
Contributor Author

dios-david commented Feb 15, 2023

It's technically possible to run a query like *[_type == "author"]._id-> where you will invoke the dereference operation on a non-object.

dereferencing a number??

Oh yeah, that would be weird :) The spec doesn't allow that either, although I could imagine -> being an identity function when applied to non-reference objects - but this is way out of scope for this discussion, and not compliant with the spec.

Great, thanks for the discussion @judofyr! I think I will open this as a separate PR as I think there is nothing I could reuse from this branch.

@judofyr
Copy link
Collaborator

judofyr commented Feb 16, 2023

Oh yeah, that would be weird :) The spec doesn't allow that either, although I could imagine -> being an identity function when applied to non-reference objects - but this is way out of scope for this discussion, and not compliant with the spec.

Well, once we allow the dereference operator to be overridden in any way then it's possible to make it not compliant with the spec. I think it's fine for groq-js to provide additional functionality as long as the defaults are compliant.

Great, thanks for the discussion @judofyr! I think I will open this as a separate PR as I think there is nothing I could reuse from this branch.

Nice!!

@vquigley vquigley requested review from judofyr and removed request for a team September 14, 2023 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants