-
Notifications
You must be signed in to change notification settings - Fork 935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STM32 Hash Accelerator #2528
STM32 Hash Accelerator #2528
Conversation
Note that the hash_v1 implementation differs from the hash_v2 and hash_v3 implementations in that it does not integrate DMA. While the peripheral supports DMA, it does not support multiple DMA transfers (MDMAT). This makes it impractical for DMA to be used in a context-switching scenario. Thus, DMA is only implemented for hash_v2 and hash_v3. |
Very cool! Can you add on-hardware tests for HASH here? This is a prime candidate for testing since it needs no connections to external hardware. |
Will do! |
the |
50% of the 8 supported boards failed on the first HIL round :( Two of the boards failed due to an apparent error in the quantity of context registers that needs to be corrected in the stm32-data repo. I will open a PR for that. The other two failures were the two STM32U5 boards. I don't currently have this hardware on hand and the log is a little vague:
The affected line in source is:
I assume it has something to do with the DMA configuration. I can work on getting hold of an STM32U5 Nucleo board to debug. In the meantime, if anyone has any insight I would appreciate it :) |
the "WARN - UNWIND: Error while checking for exception context" is a red herring, it's an error trying to print the stacktrace. by then the test has already failed. The "real" error is " WARN - Deadline exceeded!". Tests have a timeout of 10 seconds by default, "deadline exceeded" means the test took longer than that (i.e. it's hanging) |
if you want, I can give you a teleprobe token so you can locally |
I was able to debug the HIL build issue on a NUCLEO-U5A5ZJ and uncovered a very frustrating issue. There is in fact a 4th variant of this peripheral that is used on the STM32WBA, STM32H5, and STM32U5 series. From the STM32 reference manual, there is virtually no difference in this peripheral from the v2 variant besides the memory mapping of ALGO bits. (I suspect the real difference is under the hood to support GPDMA on these chips.) The HIL test is timing out on these chips because when multiple DMA transfers are enabled (MDMAT=1), the data input interrupt status bit (DINIS) is never asserted after a DMA transfer. This prevents context switching because the CSRx registers are never loaded. Note that the peripheral otherwise works exactly as expected as long as you don't attempt to context switch between DMA transfers. I wonder if this is a silicon issue and I'm just the first person to try this particular feature on these chips... My current plan is to add a v4 variant of this peripheral without DMA support. Not ideal, but still considerably faster than a software implementation. |
the L0 ADC failures should be fixed in git main, could you rebase? |
embassy-stm32/src/hash/v1v3v4.rs
Outdated
|
||
/// Computes a digest for the given context. A slice of the provided digest buffer is returned. | ||
/// The length of the returned slice is dependent on the digest length of the selected algorithm. | ||
pub async fn finish<'a>(&mut self, mut ctx: Context, digest: &'a mut [u8; MAX_DIGEST_SIZE]) -> &'a [u8] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this signature could be improved. We shouldn't require the user to allocate MAX_DIGEST_SIZE
bytes if they're using an algorithm with a smaller digest.
Also, returning a subslice of the passed-in slice sometimes run into trouble with the borrow checker. In the rest of Embassy we follow the other convention, which is returning the written length as usize.
so I suggest changing it to:
pub async fn finish(&mut self, mut ctx: Context, digest: &mut [u8]) -> usize {
it could either panic or return an error if the buffer is too short.
embassy-stm32/src/hash/v2.rs
Outdated
/// Instantiates, resets, and enables the HASH peripheral. | ||
pub fn new( | ||
peripheral: impl Peripheral<P = T> + 'd, | ||
dma: impl Peripheral<P = D> + 'd, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm somewhat worried about code duplication here, also the fact that the Hash::new()
signature changes between chip families which is something we try to avoid, since ot helps writing portable code.
I think it'd be better to have 2 constructors: new
with DMA and new_blocking
without (which sets the D
generic param to NoDma
). So the same driver can both be used with and without DMA. Then you can cfg
the DMA one to only v2.
This is what other drivers like UART, SPI etc already do, you might want to look at their code to see how it works.
The advantage is now you're not forced to use DMA on v2, so you can write code that works on all families as long as you don't use DMA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like other STM32 drivers are only implementing new
and not new_blocking
. Could you clarify how this should look? For instance this doesn't compile:
pub struct Hash<'d, T: Instance, D = NoDma> {
_peripheral: PeripheralRef<'d, T>,
dma: PeripheralRef<'d, D>,
}
impl<'d, T: Instance, D> Hash<'d, T, D> {
pub fn new_blocking(
peripheral: impl Peripheral<P = T> + 'd,
irq: impl interrupt::typelevel::Binding<T::Interrupt, InterruptHandler<T>> + 'd,
) -> Self {
HASH::enable_and_reset();
let no_dma = NoDma;
into_ref!(peripheral, no_dma);
let instance = Self {
_peripheral: peripheral,
dma: no_dma, // ERROR: expected struct `PeripheralRef<'d, D>', found struct `PeripheralRef<'_, NoDma>`
};
T::Interrupt::unpend();
unsafe { T::Interrupt::enable() };
instance
}
}
Forgive me if this is a dumb question, I'm still new to Rust.
embassy-stm32/src/hash/v1v3v4.rs
Outdated
T::regs().str().write(|w| w.set_dcal(true)); | ||
|
||
// Wait for completion. | ||
poll_fn(|cx| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO these should just do a blockingbusy-loop wait. The manual says these computations take ~60 cycles. The cost of an async context switch is ~100 cycles in the very best case, likely more. So using async waits here is likely to make things slower for no benefit. (same for the poll_fn
in store_context
).
Another benefit of doing blocking waits is now the entire hash computation can be blocking (when not using DMA), which is required in some cases, for example when implementing the RustCrypto traits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I will make that adjustment.
I was actually looking at the RustCrypto traits today. If implementing a trait such as Digest
is desired or appropriate to include within this project I can certainly contribute towards that.
I removed the hash test from the STM32U5A5ZJ. The application won't fit due to incorrect memory size for this chip noted here: embassy-rs/stm32-data#301 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice, thank you! :)
f103 failure is unrelated (I bricked it refactoring RCC 😭 ) |
Adds STM32 Hash accelerator HAL to embassy-stm32. This includes support for the MD5, SHA1, SHA224, SHA256, SHA-512/224, SHA-512/256, and SHA-512 algorithms (depending on device). HMAC support is the only feature of this peripheral that is not yet implemented.
The API mimics the start/update/finish of other popular crypto libraries, so integration is theoretically straightforward. This includes use of the context switching feature where one digest calculation may be paused for another, then later resumed. This is particularly useful when digests are computed in a concurrent TX/RX scenario.
An example for the STM32F7 series is included that demonstrates use and compares execution time of the SHA256 algorithm with the sha2 crate. The hardware calculation runs >3x faster than the sha2 crate's implementation in my tests.