-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-123894: Simplify ascii decode using unaligned loads #123895
Conversation
Add comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it's easier to read, I'm wondering whether the memcpy() calls could induce some overhead. We are copying many small parts and check for the ASCII_CHAR_MASK
mask instead of one single large memcpy()
and some pointer arithmetic behind.
break; | ||
*q++ = *p++; | ||
Py_UCS1 *q = dest; | ||
const char *size_t_end = end - (SIZEOF_SIZE_T - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we already discussed it but would _Py_ALIGN_DOWN(end, SIZEOF_SIZE_T)
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. Is using a macro here important?
const size_t *restrict _p = (const size_t *)p; | ||
size_t *restrict _q = (size_t *)q; | ||
size_t value; | ||
memcpy(&value, _p, SIZEOF_SIZE_T); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are performances affected by those multiple memcpy() calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memcpy isn't called. Memcpy with a given size gets optimized to a single load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see the ASM yes. I'd suggest reformulating the comment by saying that "When supported, compilers optimize memcpy()
into a single LOAD instruction if the number of bytes to copy is a literal value".
I confirmed it on GCC, but is it the case for MSVC & clang by the way? (NVM, I posted before seeing your reply)
I wasn't aware of any precedent for using |
This can be closed as the CPython maintainers prefer aligned loads: #120212 (comment) |
When working on #120212 I noticed that the code for ascii_decode could be worded much simpler.
The current code prevents unaligned loads, but on x86-64 unaligned loads are not necessarily slower. By using memcpy the compiler is allowed to force aligned loads on platforms that require it, while allowing unaligned loads on platforms that support it.
It simplifies the code and allows for easier manual unrolling if that is desired at a later point.
EDIT: I feel that this does not warrant a NEWS entry. So I haven't written one.