Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating signatures is slow #13

Open
thijslemmens opened this issue Aug 24, 2021 · 3 comments
Open

Calculating signatures is slow #13

thijslemmens opened this issue Aug 24, 2021 · 3 comments

Comments

@thijslemmens
Copy link

I'm trying out the PG extension to figure out if I can use the signature strategy as an alternative to GROUP BY to give insights into results sets. My aim is to have facets on very big result sets within a second. I'm talking about 5M rows to begin with, but some of the cases we want to tackle might be a lot larger.
From my current experience, the "&" operator and facet.count() function works reasonably fast, but calculating a signature takes too much time (facet.signature aggregate). I understand that that aggregate does have to handle all the rows, but it is also slow compared to other aggregates over the same result set.
Do you have an idea what the reason could be? I'm looking at the sig_set function, but I'm not yet familiar with C code, so it takes some time. I suspect the memcpy is copying data for every row, and that might take most of the time.

@fendt
Copy link
Member

fendt commented Sep 1, 2021 via email

@forestofarden
Copy link
Contributor

Dear thijslemmens,

Thanks for your message. Based on the sig_set source code, your suspicion that memcpy per row is slowing things down may be correct.

Does an aggregate that uses fixed space but still touches every row run much faster? e.g. if memcpy is the limiting factor rather than the complete table scan then you should see select sum(id::real) from foo; execute quite quickly by comparison.

One option would be to write a custom version of sig_set that only allocates new memory when it grows, and otherwise alters state in place. (This would need to be a new function, since the existing sig_set is used elsewhere and cannot not modify its arguments.)

If you wish to open a PR that does this I would be happy to review and potentially merge it.

Best, Christopher

I'm trying out the PG extension to figure out if I can use the signature strategy as an alternative to GROUP BY to give insights into results sets. My aim is to have facets on very big result sets within a second. I'm talking about 5M rows to begin with, but some of the cases we want to tackle might be a lot larger.
From my current experience, the "&" operator and facet.count() function works reasonably fast, but calculating a signature takes too much time (facet.signature aggregate). I understand that that aggregate does have to handle all the rows, but it is also slow compared to other aggregates over the same result set.
Do you have an idea what the reason could be? I'm looking at the sig_set function, but I'm not yet familiar with C code, so it takes some time. I suspect the memcpy is copying data for every row, and that might take most of the time.

@thijslemmens
Copy link
Author

Hello

We've been working with a partner not further explore faceting for PostgreSQL. They have published a first version of an extension on github:
https://github.com/cybertec-postgresql/pgfaceting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants