Calculating signatures is slow #13

thijslemmens · 2021-08-24T08:02:33Z

I'm trying out the PG extension to figure out if I can use the signature strategy as an alternative to GROUP BY to give insights into results sets. My aim is to have facets on very big result sets within a second. I'm talking about 5M rows to begin with, but some of the cases we want to tackle might be a lot larger.
From my current experience, the "&" operator and facet.count() function works reasonably fast, but calculating a signature takes too much time (facet.signature aggregate). I understand that that aggregate does have to handle all the rows, but it is also slow compared to other aggregates over the same result set.
Do you have an idea what the reason could be? I'm looking at the sig_set function, but I'm not yet familiar with C code, so it takes some time. I suspect the memcpy is copying data for every row, and that might take most of the time.

The text was updated successfully, but these errors were encountered:

fendt · 2021-09-01T14:22:21Z

Hi, Thanks for your inquiry. I forwarded your message to the original developer who will be in a better position to answer your question. Best, Kurt Dr. Kurt E. Fendt Senior Lecturer Director, Active Archives Initiative Comparative Media Studies/Writing Massachusetts Institute of Technology Room 14N-421 77 Massachusetts Avenue Cambridge, MA 02139, USA Phone: (617) 253-4312 https://aai.mit.edu On Aug 24, 2021, at 04:02, thijslemmens ***@***.******@***.***>> wrote: I'm trying out the PG extension to figure out if I can use the signature strategy as an alternative to GROUP BY to give insights into results sets. My aim is to have facets on very big result sets within a second. I'm talking about 5M rows to begin with, but some of the cases we want to tackle might be a lot larger. From my current experience, the "&" operator and facet.count() function works reasonably fast, but calculating a signature takes too much time (facet.signature aggregate). I understand that that aggregate does have to handle all the rows, but it is also slow compared to other aggregates over the same result set. Do you have an idea what the reason could be? I'm looking at the sig_set function, but I'm not yet familiar with C code, so it takes some time. I suspect the memcpy is copying data for every row, and that might take most of the time. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#13>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA7FLSA2IL3GXZKDASSS5T3T6NG2LANCNFSM5CWIRHDQ>.

forestofarden · 2021-09-02T13:56:56Z

Dear thijslemmens,

Thanks for your message. Based on the sig_set source code, your suspicion that memcpy per row is slowing things down may be correct.

Does an aggregate that uses fixed space but still touches every row run much faster? e.g. if memcpy is the limiting factor rather than the complete table scan then you should see select sum(id::real) from foo; execute quite quickly by comparison.

One option would be to write a custom version of sig_set that only allocates new memory when it grows, and otherwise alters state in place. (This would need to be a new function, since the existing sig_set is used elsewhere and cannot not modify its arguments.)

If you wish to open a PR that does this I would be happy to review and potentially merge it.

Best, Christopher

I'm trying out the PG extension to figure out if I can use the signature strategy as an alternative to GROUP BY to give insights into results sets. My aim is to have facets on very big result sets within a second. I'm talking about 5M rows to begin with, but some of the cases we want to tackle might be a lot larger.
From my current experience, the "&" operator and facet.count() function works reasonably fast, but calculating a signature takes too much time (facet.signature aggregate). I understand that that aggregate does have to handle all the rows, but it is also slow compared to other aggregates over the same result set.
Do you have an idea what the reason could be? I'm looking at the sig_set function, but I'm not yet familiar with C code, so it takes some time. I suspect the memcpy is copying data for every row, and that might take most of the time.

thijslemmens · 2022-10-26T09:07:36Z

Hello

We've been working with a partner not further explore faceting for PostgreSQL. They have published a first version of an extension on github:
https://github.com/cybertec-postgresql/pgfaceting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculating signatures is slow #13

Calculating signatures is slow #13

thijslemmens commented Aug 24, 2021

fendt commented Sep 1, 2021 via email

forestofarden commented Sep 2, 2021

thijslemmens commented Oct 26, 2022

Calculating signatures is slow #13

Calculating signatures is slow #13

Comments

thijslemmens commented Aug 24, 2021

fendt commented Sep 1, 2021 via email

forestofarden commented Sep 2, 2021

thijslemmens commented Oct 26, 2022