Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-123894: Simplify ascii decode using unaligned loads #123895

Closed
wants to merge 1 commit into from

Conversation

rhpvorderman
Copy link
Contributor

@rhpvorderman rhpvorderman commented Sep 10, 2024

When working on #120212 I noticed that the code for ascii_decode could be worded much simpler.

The current code prevents unaligned loads, but on x86-64 unaligned loads are not necessarily slower. By using memcpy the compiler is allowed to force aligned loads on platforms that require it, while allowing unaligned loads on platforms that support it.
It simplifies the code and allows for easier manual unrolling if that is desired at a later point.

EDIT: I feel that this does not warrant a NEWS entry. So I haven't written one.

Copy link
Contributor

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's easier to read, I'm wondering whether the memcpy() calls could induce some overhead. We are copying many small parts and check for the ASCII_CHAR_MASK mask instead of one single large memcpy() and some pointer arithmetic behind.

break;
*q++ = *p++;
Py_UCS1 *q = dest;
const char *size_t_end = end - (SIZEOF_SIZE_T - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we already discussed it but would _Py_ALIGN_DOWN(end, SIZEOF_SIZE_T) work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. Is using a macro here important?

const size_t *restrict _p = (const size_t *)p;
size_t *restrict _q = (size_t *)q;
size_t value;
memcpy(&value, _p, SIZEOF_SIZE_T);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are performances affected by those multiple memcpy() calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memcpy isn't called. Memcpy with a given size gets optimized to a single load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@picnixz picnixz Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see the ASM yes. I'd suggest reformulating the comment by saying that "When supported, compilers optimize memcpy() into a single LOAD instruction if the number of bytes to copy is a literal value".

I confirmed it on GCC, but is it the case for MSVC & clang by the way? (NVM, I posted before seeing your reply)

@ZeroIntensity
Copy link
Member

I wasn't aware of any precedent for using size_t in the codebase. I do see that it was there beforehand, but maybe that was a mistake -- we should be using Py_ssize_t, shouldn't we?

@rhpvorderman
Copy link
Contributor Author

This can be closed as the CPython maintainers prefer aligned loads: #120212 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants