How to add AXI-Lite and AXI Stream peripherals #52

umarcor · 2021-06-01T21:16:58Z

umarcor
Jun 1, 2021
Collaborator

After getting familiar with the structure of the project, it's time to look into some practical use case. 🚀

My background is developing ad-hoc accelerators for non-trivial DSP, say machine-learning/image-processing, say algebraic kernels in VHDL. During testing and verification, those need to be complemented with some CPU, in order to move data and results between a workstation/laptop and the accelerator. I used MicroBlaze and Zynq (ARM A9). The interfaces are either AXI-Lite or AXI Stream, despite some cores using Wishbone internally. However, those CPUs are overkill for orchestration purposes only, and the toolchains/frameworks provided by the vendor are not the most comfortable out there.

I'm willing to learn how to use NEORV32 SoC. Although not all my designs are open source, I believe there are a few examples which can be useful enough for didactic purposes.

umarcor · 2021-06-01T21:17:27Z

umarcor
Jun 1, 2021
Collaborator Author

AXI Stream

The AXI Stream loopback example available in VUnit's repository (https://github.com/VUnit/vunit/tree/master/examples/vhdl/array_axis_vcs/src) is an AXI Stream slave connected to an AXI Stream master through a FIFO. That is used as a foundation for any streaming DSP processing, by placing custom logic either before, after or in-between the FIFO. Say, for instance a CORDIC component (either pipelined or iterative, since I/O are async/FIFO).

I believe that example is interesting because I've used it for didactic/demo purposes in other open source project documentation sites and in academia/research. For instance, in https://ghdl.github.io/ghdl-cosim/vhpidirect/examples/arrays.html#array-and-axi4-stream-verification-components it is modified for using direct cosimulation as an alternative to CSV files for sharing data with foreign functions/tools. For didactic purposes too, VHDL's fixed_generic_pkg is used. That is further related to dbhi/vboard.

The same example was used in DBHI: towards decoupled functional hardware-software co-design on SoCs. 28th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2020). It's the same architecture, but a Dynamic Binary Modification tool is used for replacing a function call in a binary application (without access to the sources).

Yet, as you see, that is all related to simulation/cosimulation and to testing the accelerator itself through Verification Components. Synthesis is not covered/documented in any open source repo (yet).

When using MicroBlaze, having AXI Stream accelerators is quite nice. They used to have these Fast Simplex Link (FSL) interfaces (https://www.xilinx.com/support/documentation/sw_manuals/mb_ref_guide.pdf) which were then replaced with Stream Link Interfaces (https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_2/ug984-vivado-microblaze-ref.pdf). Same software framework in both cases: specific ASM instructions for writing/reading to/from some specific registers, which are mapped to the hardware "Links", i.e. AXI Stream ports.

On Zynq, that's slightly uglier. There is no built-in AXI Stream between the hard ARM cores and the Programmable Logic (PL). Therefore, a DMA or an AXI-Lite to Stream bridge needs to be used. Not a huge deal, but a whole source of potential bugs and configuration issues when one wants to just test some software and some accelerator together.

NEORV has a Wishbone component, which is labeled as en External bus of type Wishbone b4 or AXI4-Lite. I assume it is a single master port to be connected to some interconnect, in case multiple external peripherals are added. Therefore, the integration would need to be similar to the Zynq. However, since RISC-V is suppossed to be easy to extend, open, and because NEORV32 is also open, I wonder if we can do better.

@stnolting, what do you think? Can we provide a configurable number of "Links" which 1) can be of type out only, in only or "instantiate" one of each; 2) are mapped in the CPU memory; and 3) are usable through specific instructions? I guess it is a two stage question. The first one is adding a configurable number of links. The second one is whether it's worth having specific instruction for that. I must say I know nothing specific about RISC-V and CPU architecture is not my field. Please, excuse me if I'm making some stupid assumptions.

For now, I'm not concerned about performance, but about having a solution which is as simple as technically possible. It is meant for students to understand the whole system, CPU, interface and Core. That's why AXI-Lite to Stream bridges or DMAs are not desirable as the first use case. That should addressed after they are familiar with the most simple Stream/FIFO interface (which is otherwise pretty common in low-level RTL).

6 replies

stnolting Jun 2, 2021
Maintainer

.. those need to be complemented with some CPU, in order to move data and results ...

I think this is the most common use case of soft-core CPUs: Orchestrating a custom system instead of doing all the work.

AXI Stream

I am familiar with AXI4 stream but I have never really used it. As the name already suggests, it is used for streaming data to feed things converters / encryption units / data interfaces and so on. But how do you configure these devices? For example if you have a programmable CORDIC unit, how do you configure the actual function? Do you really use the "side band" channel for this?
Or do you have an AXI4-Lite port for configuration and a AXI stream port for data?

NEORV has a Wishbone component, which is labeled as en External bus of type Wishbone b4 or AXI4-Lite. I assume it is a single master port to be connected to some interconnect, in case multiple external peripherals are added.

Correct.

Can we provide a configurable number of "Links" which 1) can be of type out only, in only or "instantiate" one of each; 2) are mapped in the CPU memory; and 3) are usable through specific instructions?

1.
Sure! The "problem" is how to define this in the processor's top entity!? Define 8 input stream ports and 8 output stream ports (as a maximum) and only the ones defined by some generics are really active?

2.
I have just looked up the AXI4 stream specs again - which are really simple and straightforward. Basically, it is just a register interface (like Wishbone) without an explicit address channel. So it behaves like writing to the same address over and over again when using Wishbone or AXI4-Lite, right?
It should be no problem to add this to the IO space of the processor. But I wonder if one would need some additional control overhead?

3.
If the link's hardware is mapped to IO space, a direct access is possible using just a single load or store assembly instruction. So no need for custom instructions. That would also highly simplify access when using the links from C code. But: How do you handle the communication via AXI stream links in software? What happens if

the AXI stream output (driven by the CPU) cannot start sending data since the slave is busy? does this raise an exception? or does the CPU stall until it can send data? or do we have a FIFO that catches this? what if the FIFO is full?
an AXI stream input (read by the CPU) cannot be processed by the CPU because it is busy? just discard this packet and maybe set some overrun flag? also use a FIFO?!

Or do we need a specific control and status register for each link (to check if data is available and to check if we can send new data). If we do so, then I do not see a real advantage here. A Wishbone/AXI4-Lite to AXI-stream converter would then be the better choice.

For now, I'm not concerned about performance ...

From my point of view, AXI-stream without a dedicated DMA is not more or less performant as the normal AXI4-Lite interface. However, it would provide a different type of interface that might be better suited for some kinds of applications. So I'm open for this idea 👍 But again, using an AXI-lite-to-AXI-stream converter does also seem to be a valid solution 🤔

stnolting Jun 2, 2021
Maintainer

Is the custom functions subsystem (CFS) mentioned in #9 the solution for adding what I named "Link" above?

Sorry, I forgot about that.
Of course you can add an AXI stream infrastructure to the CFS. But I don't think the default interface would be very pretty as the CFS only provides two signals - one input vector and one output vector.

umarcor Jun 2, 2021
Collaborator Author

I think this is the most common use case of soft-core CPUs: Orchestrating a custom system instead of doing all the work.

It is! However, since I do research, the scope is slightly wider. We target implementation of complex and data-intensive real-time and high-performance applications in the edge. Moving data, or having places where to place data temporarily becomes a really expensive resource. It might not be affordable to have some additional DMA and it might be too expensive to implement certain operators in hardware. In such cases, we need a CPU which is slightly more complex than a simple uC, so that it can perform some pre/post processing without being a bottleneck.

On the other hand, when we are using a large device for prototyping but we did not have time for implementing some stage/operator/component in hardware yet, having a simple uC with FIFO/Stream I/O is very valuable for doing behaviour prototyping of the system in hardware. The idea is that each of the blocks in the system can be written either in C or in VHDL, and it can be either simulated or implemented. Therefore, this request is also for using dozens of NEORV32 instances as functional black-boxes, mostly ignoring the whole SoC except the very specific interfaces of that submodule/component.

From a visual point of view, imagine that each of the blocks in umarcor.github.io/hwstudio can have an architecture defined in C. Not HLS, but automatic instantiation of some soft CPU which executes the C source you wrote. Same ports and same behaviour as a "real" HDL architecture, but a completely different implementation. For simulation purposes, we ignore the soft CPUs, we do that with cosimulation (as shown in the diagrams below), using Verification Components (that's why we want to use Wishbone/AXI/UART...). For testing in hardware, we use the CPUs. Whenever a "hardware" architecture is ready, the user changes it in the diagram and hits simulation/implementation again.

I got a lot of problems when explaining this workflow/methodology. Most people understand I'm talking about HLS or any other kind of automatic source transformation. It has nothing to do with that. I want to write both C and VHDL manually, because SoC design is an art I love. However, I don't want to wait until everything is synthesisable VHDL in order to be able to test the whole system. I want an iterative methodology that allows testing all the mixed software-hardware-nonsynth-synth variants that a developer writes when designing a SoC from scratch.

GHDL, VUnit, containers, MSYS2, CI, Octave, Python/numpy, etc. all of that composes the infrastructure for making that vision possible with open source tooling only. As you might guess, that's a lot of people from many different places, each with their own priorities and interests. It takes a lot of time to understand and then being able to help each other. That's why I say I'm a hitchhiker in the open-source electronic design automation (EDA) galaxies 😆.

I am familiar with AXI4 stream but I have never really used it. As the name already suggests, it is used for streaming data to feed things converters / encryption units / data interfaces and so on.

A FIFO is basically an AXI4 Stream slave connected to an AXI4 Stream master. You use the empty/full signals from the FIFO as the ready/valid signals of the AXI4 Streams: https://github.com/VUnit/vunit/blob/master/examples/vhdl/array_axis_vcs/src/axis_buffer.vhd#L60-L84.

That is ignoring the usage of STRB, TLAST, and other signals. However, I guess you get the basic idea about what it's used for.

But how do you configure these devices? For example if you have a programmable CORDIC unit, how do you configure the actual function? Do you really use the "side band" channel for this?
Or do you have an AXI4-Lite port for configuration and a AXI stream port for data?

AXI defines how bytes are transferred, not the meaning of the bytes. So, for example, you can decide to convert an AXI-Lite to an AXI Stream by sending address first and then data. Or using a wide word including both the address and the data. However, the concept is that you don't use Streams if you need addressable content. Yet, you can use specific "commands" in the stream, which the receiver can interpret. I believe there are specific (optional) signals which you can use for that. But it's not different from just widening your word size, thus, having some additional bits for encoding the opcodes. Widening the word size is something you can only do if you design all the cores. When you use, e.g., Vivado you are forced to some widths (say 32 bits), because multiples of bytes are the "reasonable solution. That's why AXI specifies additional optional signals, but it is explained that they are handled together with the data signal (same valid/registering/behaviour).

With regard to the CORDIC unit, it's a tradeoff:

A single Stream and send first the opcode, then the data.
A single Stream and send first the opcode, then multiple data until TLAST clears the opcode.
A single Stream with an additional signal for the opcode, use a 32 bit word for sending two 16 bits operands at once.
One or multiple Streams and a single AXI-Lite; use the Lite for controlling the configuration of multiple CORDICs, one for each Stream.

Furthermore, CORDIC(s) can be iterative or pipelined. If pipelined, AXI Streams can receive one data element each clock cycle. Otherwise, it needs to wait until the previous computation is done. This completely modifies the criteria for selecting the options above.

This is also what makes Verification Components and virtual queues so valuable. You can protoype all those strategies using VHDL but without actually describing the hardware (writing single process architectures with "push" and "pop" procedures/functions). After you evaluate which is the best solution for an specific application/use case, your can keep all the interface structure/architecture and make the VHDL synthesisable. You can use C/C++ threads, golang channels, Python queues or some other "fancier" environment such as Simulink for that high-level modelling. However, it makes all sense if the final target is precisely describing a SoC in VHDL.

1.
Sure! The "problem" is how to define this in the processor's top entity!? Define 8 input stream ports and 8 output stream ports (as a maximum) and only the ones defined by some generics are really active?

This is the same discussion we are having in #9. Let's solve it there 😉

2.
I have just looked up the AXI4 stream specs again - which are really simple and straightforward. Basically, it is just a register interface (like Wishbone) without an explicit address channel. So it behaves like writing to the same address over and over again when using Wishbone or AXI4-Lite, right?
It should be no problem to add this to the IO space of the processor. But I wonder if one would need some additional control overhead?

Yes. In the most basic solution, the software writes to the same register again and again, and each write generates a transaction. By the same token, after a value is received, it's read and the interface is ready for receiving a new value. Obviously, this is a very simplistic description, and you already pointed out additional requirements:

How to know when an output stream is ready to be written (the previous value was sent). At the same time, how to know when an input stream received a new value. I.e. at least one bit (flag) needs to be used for each stream. In hardware, the bit is the valid/ready signal (the busy/empty in a FIFO).
- In fact, an optional FIFO can be supported between the address space and the stream interface, so that the software can do memcpy with a maximum burst size. In this case, two bit flags might be required, because memcpy needs to check if there is enough space for a full burst write (or content for a burst read).
A signal I use a lot is TLAST. When sending matrices, I use it for signaling the end of a vector. Typically I provide the starting address first, and I then send large vectors of data. As commented above, that might be any other signal. The point is that an additional bit is required, apart from the word.
An AXI Stream interconnect can handle multiple channels. That is, a single master connected to multiple slaves. In that context TDEST can be used for telling the arbitrer which of the slaves to address. So, there is no addressing with a single AXI Stream, but there is some minimal addressing with multiple AXI Streams. I believe the size of TDEST is customisable, but I don't remember whether it's multiples of 1, 2 or 4 bits.
- Therefore, it might be interesting to reserve maybe 4 bits in the configuration register of each Stream, in case the user wants to connect an interconnect there. In fact, the first implementation might be not to support an arbitrary number of Stream interfaces, but just a single master and a single slave.
I guess that the 32-(RDY/VALID + FIFO + len(TLAST) + len(TDEST)) remaining bits in the configuration register in each Stream can be mapped to TSTRB, TUSER, and other signals.

I'm not specifically interested in TDEST, TSTRB, TUSER, etc. yet. I'm good with them being unused or fixed for now. However, I guessed you wanted an overall view of what might be needed.

3.
If the link's hardware is mapped to IO space, a direct access is possible using just a single load or store assembly instruction. So no need for custom instructions. That would also highly simplify access when using the links from C code.

Yes. However, if TDEST is supported, it might be handy to hide that complexity from the user. Say it has 4 bits, users could use a single statement for writing to an specific channel or reading from it. So, loadS5 would set TDEST to "0101" in the configuration register and load the data in the data register. Yet, I see that this does not scale well.

Alternatively, it can be done in C, by providing a function with two arguments, and using two instructions instead of one. However, reading and setting specific fields by masking each of them can reduce performance y we add TLAST, TSTRB, TUSER, etc.

So, rather than asking about "an specific instruction" the question is what is the recommended software-hardware approach for a reasonable solution. Does it make sense to provide additional addresses for writing/reading specific fields, which do "hardware masking" of the same register? That is, add some area to the module, for reducing the software overhead.

Personally, I don't need the custom instructions. To me, it's enough with having a data register and a configuration register for each Stream interface. And it's enough with a single master and a single slave. I prefer having that well implemented (with LAST, STRB, DEST, USER, etc. and with nice to use software), than having lots of interfaces with less features. I'm willing to understand whether something can be achieved which is better than using an intermediate AXI-Lite interconnect and bridge.

But: How do you handle the communication via AXI stream links in software? What happens if

the AXI stream output (driven by the CPU) cannot start sending data since the slave is busy? does this raise an exception? or does the CPU stall until it can send data? or do we have a FIFO that catches this? what if the FIFO is full?

The user should check the flag before writing. Otherwise, yes, I guess it's an exception. Strictly, the user might overwrite the value in the register before it's sent, but then it's not possible to know whether the old or the new one was sent.

In AXI terminology, the slave is not busy, it's said it is not ready (just inverse logic).
However, the master does not need to wait until the slave is ready for raising the valid signal. So, the flag that the software checks is not whether the slave is ready, but whether the last value was sent already. So, VALID is set when a load operation is performed in that address, and it is cleared when a transaction is done.

Well, at the moment, I'm not sure whether it's the master or the slave the one that is forced not to wait for the other. That's part of the specification.

If a FIFO is used, the need of an additional flag depends on how long chunks you want to support. See above.

an AXI stream input (read by the CPU) cannot be processed by the CPU because it is busy? just discard this packet and maybe set some overrun flag? also use a FIFO?!

A system composed of AXI Stream components is, by definition, a systolic array. That is, if a single component cannot receive/send whatever it needs to, the whole system is stalled. Therefore, if the CPU receives an item, the READY flag is unset. No other values can be received. It is up to the user to decide whether the system needs to be blocked or if the flag can be set again (no need to read the data, although doing it might be easier than setting the flag). It is also up to the user to add FIFOs wherever they need, given the topology of their system.

I don't think it's intuitive that the AXI component discards the data and sets an overrun flag. That could potentially silently drop hundreds/thousands of items. By concept, if the CPU wants to decide when to handle and when to drop data, then it wants to be the master in the communication. It'd better use an AXI-Lite (probably with 1-2 registers/addresses) which connects to a bridge where the overrun flag is handled. Conversely, if an (stream) slave interface is used, the CPU is expected to respond with an acceptable delay. Attending the interrupt (or pooling) should be prioritary.

Overall, I think that supporting a FIFO is interesting for dealing with multiple channels (TDEST) and/or supporting chunk copies. But that should be an optional and relatively very simple to implement enhancement, after the software-stream is implemented.

Or do we need a specific control and status register for each link (to check if data is available and to check if we can send new data). If we do so, then I do not see a real advantage here. A Wishbone/AXI4-Lite to AXI-stream converter would then be the better choice.

As commented, if an optional FIFO is used, or if the user adds it, memcpy can be used. That allows copying large chunks of data as fast as the CPU can handle. Hence, the interface, the interconnect and the slave need to support the throughput. Using AXI Stream interconnects and slaves, the CPU is likely to be the limit. However, AXI-Lite is a bidirectional interface. Therefore, area and maximum frequency are to be affected. Sure, if there is a single external AXI-Lite slave, and that is an Stream, most of the signals from the AXI-Lite interface can be ignored and optimised. However, if an interconnect is used and at least another slave is connected which needs read/write, the additional logic for dealing with the "stream" won't always be optimised.

From my point of view, AXI-stream without a dedicated DMA is not more or less performant as the normal AXI4-Lite interface. However, it would provide a different type of interface that might be better suited for some kinds of applications. So I'm open for this idea 👍 But again, using an AXI-lite-to-AXI-stream converter does also seem to be a valid solution 🤔

I guess it's a matter of how the traffic is prioritised. Having AXI-Lite and AXI Stream interfaces makes an statement about expecting two subnets, one addressable low-throughput and one non-addressable high-performance. If AXI-Lite is provided only, users can emulate that by customising the interconnect for prioritising the AXI-Lite-to-Stream bridge over other slaves. However, having an open source interconnect with configurable priority might not be straightforward.

I guess a sensible approach is to write a "2 x n x 32 bit register to AXI master and from AXI slave" component. Then, evaluate the performance by wrapping it in an "32 bit AXI-Lite to 2 x n register" module. Last, try mapping the register control signals to the CPU.

As you might understand due to my explanation above, I have some internal inconsistencies because I'm thinking about both using a single NEORV32 as the main orchestrator, and using lots of instances as black-box modules. Hence, there is some friction with regard to having to instantiate an interconnect and a bridge, or not. When it's used as an orchestrator, interconnects will be required anyway, so having a few levels of hierarchy there is ok. In the other user case, the most compact and easy to use, the best. Please, excuse me if this dicotomy makes some explanations blurry.

FPGA-Wlh Dec 6, 2024

..这些需要用一些 CPU 来补充，以便移动数据和结果......

我认为这是软核 CPU 最常见的用例：编排_自定义_系统，而不是完成所有工作。

AXI 流

我熟悉 AXI4 流，但我从未真正使用过它。顾名思义，它用于流数据以馈送转换器/加密单元/数据接口等。但是如何配置这些设备呢？例如，如果您有一个可编程的 CORDIC 单元，您如何配置实际功能？您真的为此使用 “side band” 通道吗？或者您是否有一个用于配置的 AXI4-Lite 端口和一个用于数据的 AXI 流端口？

NEORV 有一个 Wishbone 组件，该组件标记为 en 类型为 Wishbone b4 或 AXI4-Lite 的外部总线。我假设它是一个连接到某个互连的单个主端口，以防添加多个外部外围设备。

正确。

我们能否提供一个可配置的 “链接” 数量，1）可以是 type out only， in only 或 “instantiate” 其中的一个;2）映射到 CPU 内存中;3）是否可以通过具体说明使用？

1. 当然可以！“问题”是如何在处理器的 top 实体中定义它！？定义 8 个输入流端口和 8 个输出流端口（最多），并且只有某些泛型定义的端口真正处于活动状态？

2. 我刚刚再次查看了 AXI4 流规格，这些规格非常简单明了。基本上，它只是一个 register 接口（如 Wishbone），没有明确的地址通道。因此，在使用 Wishbone 或 AXI4-Lite 时，它的行为就像一遍又一遍地写入同一地址，对吧？将其添加到处理器的 IO 空间应该没有问题。但我想知道是否需要一些额外的控制开销？

3. 如果链路的硬件映射到 IO 空间，则只需使用单个或汇编指令即可直接访问。因此，无需自定义说明。这也将极大地简化使用 C 代码中的链接时的访问。**但：**您如何处理通过软件中的 AXI 流链接进行的通信？如果load``store

AXI 流输出（由 CPU 驱动）无法开始发送数据，因为从站繁忙？这会引发异常吗？或者 CPU 是否会停滞，直到可以发送数据？或者我们是否有捕获此内容的 FIFO ？如果 FIFO 已满怎么办？

AXI 流输入（由 CPU 读取）无法由 CPU 处理，因为它很忙？只需丢弃此数据包，也许设置一些 overrun 标志？也使用 FIFO？！

或者我们是否需要为每个链接设置一个特定的控制和状态寄存器（以检查数据是否可用并检查我们是否可以发送新数据）。如果我们这样做，那么我看不到真正的优势。Wishbone/AXI4-Lite 到 AXI 流转换器将是更好的选择。

目前，我不关心性能......

在我看来，没有专用 DMA 的 AXI-stream 的性能并不比_普通_的 AXI4-Lite 接口高或低。但是，它将提供一种不同类型的接口，可能更适合某些类型的应用程序。所以我对这个想法👍持开放态度但同样，使用 AXI lite 到 AXI 流转换器似乎也是一种有效的解决方案 🤔

请问如何实现AXI4-Full到AXI4-Stream的转换呢

stnolting Dec 7, 2024
Maintainer

请问如何实现AXI4-Full到AXI4-Stream的转换呢

How to realize AXI4-Full to AXI4-Stream conversion?

If you are using AMD then I can highly recommend the AXI Streaming FIFO: IP module: https://www.xilinx.com/products/intellectual-property/axi_fifo.html#overview

For an open source version, this looks promising - but I haven't tested it myself.

Btw, English is the default language here. So please use deepl or any other translator. 😉

umarcor · 2021-06-01T21:18:02Z

umarcor
Jun 1, 2021
Collaborator Author

AXI-Lite

AXI-Lite is an obvious candidate for any not complex addressable register/memory. In umarcor/SIEAV, there is some content I use for teaching cosimulation and testing/verification with VHDL and open source tooling (GHDL, VUnit, Octave...). The system we use as a reference is a typical closed-loop control system with a controller, a plant, and drivers/actuators and capture/holds in-between.

The design is complemented with an AXI Slave component, in order to modify the setpoint and/or the constants/parameters of the controller, at runtime. Similarly to the AXI Stream example above, we use a Verification Component and cosimulation for testing the software-hardware interaction:

The motivation is to abstract away the specific implementation of the CPU and/or any other peripheral in the SoC, and focus on the "logic" of our application only, both software and the Core.
Note that, VUnit and other verification frameworks/methodologies provide bus models, memory models and queues which allow to model bottlenecks and throughput in the system.
Compare the previous figure with the following:

The actual implementation (synthesis) of the whole system is out of the scope of the course (for now). Hence, I didn't advance further. For future courses (and for enhancing the learning resources about using VHDL with open source tooling), I would like to add a working example with a minimal synthesisable setup:

However, I was missing some free and open source CPU written in VHDL, with a trivial build procedure (for someone used to hardware design in VHDL) and with a responsive maintainer 😄. As you might guess, NEORV32 is a nice candidate for showcasing how to go ahead with the system integration, synthesis and implementation:

I would like to prototype this on a Fomu, using GHDL, Yosys and nextpnr. As commented in some other issues, @tmeissner did already contribute a setup for UPduino v3.0 and I added the CI plumbing for it. Next step is to conciliate the structure with https://github.com/im-tomu/fomu-workshop/tree/master/hdl and/or https://github.com/dbhi/vboard/tree/main/vga. That is, having some common *.mk files that can be used by multiple "board projects". Also, probably creating a devices subdir for containing components/resources which are used by multiple boards. In other words, all open source boards using UP5K devices do share almost the whole design, except the top-level ports and the PLL parameters. Therefore, there are still some contributions to be done before I can cleanly prototype the integration of NEORV32 and the controller in a Fomu. After a NEORV32 only example is contributed here, I will probably do the actual integration in umarcor/SIEAV, since I believe you don't want to have an sketchy controller in this repo (note that students implement their own controller, which they use for replacing ControllerWithExtParams in umarcor/SIEAV).

Moreover, for some reason, I was not watching this repo, and I did not see the PRs that @LarsAsplund opened these last days. Since I'm already using VUnit in umarcor/SIEAV, I think I might add NEORV32 as a submodule, and execute the VUnit tests in CI. I believe there is no explicit public example about that yet: two repos (maintained by different people) submoduling each another repo, both of them using VUnit.

@stnolting, where do you suggest me to start reading? What is the closest to "How to add an external AXI4-Lite peripheral to the Wishbone bus in NEORV32"?

12 replies

umarcor Jun 5, 2021
Collaborator Author

@AWenzel83, thanks for explaining!

Save everything, let VIVADO create a HDL-Wrapper for the block-design and choose this as your Top Level Design

So, that's the opposite architecture I'd like to use. I.e., instead of having the top level HDL wrapper generated by Vivado, and NEORV32 included as a block in a Vivado block-design (TCL), I would like to manually handle the top level VHDL source.
Therefore, the block-design needs to have the ports that connect to the NEORV32 declared as external, as well as the ones which are really external.

I have no idea how the process differs in other vendor's or open source tools, or how you would do that in VHDL only for example, but I have not looked into that, to be honest, as the workflow above works for me.

With open source tools, there are no block-design (TCL) components. HDLs are used only. That's why it would make it easier if we
could handle the block-design as a black-box and instantiate it in our top level. Doing so would allow modifying the NEORV32 subsystem, or adding other peripherals, without updating the block-design. That would only be required if someone wanted to modify which Vivado IP cores to use.

I'm pretty busy at the moment, but I have the plan to create an example design for the examplary board setups.

Whenever you create the example design, would you mind trying this approach?

smunaut Jun 5, 2021

You should also look into the "bitsy" version of the icebreaker, a cheaper version without the FTDI that relies on USB bootloader and comms (like fomu) : https://github.com/icebreaker-fpga/icebreaker#icebreaker-bitsy

It you want to port NeoRV32 and add usb support to it using my no2usb core, I can send you a spare one. (They're not generally for sale yet, first batch is in progress).

I can also send you a few spare fomu I have here (hacker version from mithro). I don't have cases for all of them so you'll need to 3d print some.

umarcor Jun 5, 2021
Collaborator Author

Thanks @smunaut! Is there any estimate for the general availability? Will it be available in Europe? I've been checking the european shop, but didn't see it :(

@stnolting, I guess you want to accept that proposal 😉. That is almost the same as the Fomu, but including more I/O. You can adapt it to a feather, and maybe to a baseboard with PMOD.

smunaut Jun 5, 2021

Just asked @esden on discord, should be 2-3 weeks for the US store. Then probably a couple more weeks to get to the EU stock.
There are PMOD breakout boards available for it. I can ship one of those PCBs with the bitsy but without the pin headers / sockets, you'd have to get & assemble those because I don't have any spares.

stnolting Jun 5, 2021
Maintainer

@umarcor

Much appreciated. Yet, please, don't take me as a leader, even though I may play that role in certain areas.

Don't get me wrong. I just wanted to say that I am really impressed by your work and all the effort and also love you put into it 👍

There are lots of very complex/complete solutions, but most are lacking good (even decent) docs, and are poorly implementing all the common tasks which are not critical for the project.

I feel you.
I know that people need to focus on energy, time and even money when working on project. And I can absolutely understand if they don't have the time (or simply don't want) to provide some documentation. I am grateful for every open-sourced project - even if it is just a bunch of undocumented stuff. There are a lot here in on GitHub and I like just clicking through files like reading some magazine when waiting for the doc. Sometimes you find some gems (it's the curiosity of chance).

❤️ Not something to do in the short term, but let's keep it as a plausible dream 😄

You gotta dream big! 😎😆

Ey! umarcor it's me! no2muacm is by @smunaut 😆

Oh I am sorry! Unfortunately, ~~sometimes~~ often my fingers are faster than my head...

I think there are no "spare" Fomus in Europe at the moment. I mean, Fomus lend/given by @mithro for learning/testing purposes. We might ask him whether we can have an small bag (I have ~5 people in mind), and handle "internal shipping" ourselves.

Thanks for that, but it's ok. I have a UPduino here and I think it would be possible the solder the USB front-end by myself.

In fact, from a licensing point of view, I find it nicer than the other way (AXI default).

This is a good point. I know there are thousands of open-source projects out there that use some kind of AMBA-related interface. But what about the patent/copyright issues? It is legal? In theory? 😅

That is, we don't need to add AXI Stream to NEORV32 per se, but to evaluate whether a direct stream interface makes sense as a complement to the existing external addressable one.

I like that. One could provide a "proprietary" stream interface that could be translated into AXI stream (by simple signal re-naming? 😉)

Related to #59, I would like to add an Arty-SmallProcessor target in examples/Makefile which would generate a bitstram just as I get with UPduino_v3-SmallProcessor. Can we achieve that? First, without any Block Design.

I don't get what your question is pointing at?! There already is a Vivado project for the Arty board. The TCL script only requires some more command to make Vivado do the actual synthesis.

After the dialogues we had, and related to #59, I believe we can provide a SystemTop_axi4litestream, which includes the AXI-Lite-to-AXI-Stream bridge. In the future, if you/we decide to make it direct,

I thought about that as well. I would recommend using a direct Wishbone-to-AXI-Stream approach.

@smunaut

You should also look into the "bitsy" version of the icebreaker, a cheaper version without the FTDI that relies on USB bootloader and comms (like fomu) : https://github.com/icebreaker-fpga/icebreaker#icebreaker-bitsy

I will have a look! 👍

It you want to port NeoRV32 and add usb support to it using my no2usb core, I can send you a spare one. (They're not generally for sale yet, first batch is in progress).
I can also send you a few spare fomu I have here (hacker version from mithro). I don't have cases for all of them so you'll need to 3d print some.

Thank you so much! That is really nice :)
But I think I could modify my UPduino to have a USB front end. Seems like it only requires some pull-ups and Zener diodes... :think

smunaut · 2021-06-02T19:31:59Z

smunaut
Jun 2, 2021

@umarcor
(1) First don't hesistate to contact me ( like on 1bitsquared discord is probably the best ) for any help regarding no2muacm.
(2) Indeed in the case of NeoRV32 it's probably better to add the no2usb directly connected to it. However if I trust the first page readme ... the fpga is full to the brim (97%), no space whatsoever to add anything there ... The no2muacm core is ~ 1100 LCs. The no2usb core alone is ~ 650 LCs.

2 replies

umarcor Jun 2, 2021
Collaborator Author

Thanks @smunaut!
The NEORV32 can probably fit in half of the Fomu. The configuration you saw (97%) is selected for using as much of the device as possible. Yet, I don't know if it will fit in less than half of the Fomu.
Nevertheless, I am first interested in trying no2muacm and NEORV32 separatedly, because I have use cases for both of them.

stnolting Jun 2, 2021
Maintainer

Great, thank you very much @smunaut 👍

However if I trust the first page readme ... the fpga is full to the brim (97%), no space whatsoever to add anything there ...

Right, but that was just something like a proof-of-concept to test how much of the SoC we can squeeze into that FPGA.

A minimal setup with CPU (base ISA rv32i + privileged architecture Zicsr) + UART + Bootloader ROM occupies ~half of the FPGA's logic resources:

Number of slice registers: 954 out of  5280 (18%)
Number of LUT4s:           2537 out of  5280 (48%)

(data generated for an older version of the NEORV32. the current version should be even a little bit smaller)

stnolting · 2021-06-28T13:50:38Z

stnolting
Jun 28, 2021
Maintainer

I am currently working on a "Stream Link Interface" that is compatible to the AXI4-Stream base protocol. The interface will support up to 8 independent RX and TX links - each link provides a configurable internal FIFO.

This is what the top entity might look like:

    -- Stream link interface --
    SLINK_NUM_TX  : natural := 0; -- number of TX links (0..8)
    SLINK_NUM_RX  : natural := 0; -- number of TX links (0..8)
    SLINK_TX_FIFO : natural := 1; -- TX fifo depth, has to be a power of two
    SLINK_RX_FIFO : natural := 1; -- RX fifo depth, has to be a power of two

    -- TX stream interfaces (available if SLINK_NUM_TX > 0) --
    slink_tx_dat_o : out sdata_8x32_t; -- output data
    slink_tx_val_o : out std_ulogic_vector(7 downto 0); -- valid output
    slink_tx_rdy_i : in  std_ulogic_vector(7 downto 0) := (others => '0'); -- ready to send

    -- RX stream interfaces (available if SLINK_NUM_RX > 0) --
    slink_rx_dat_i : in  sdata_8x32_t := (others => (others => '0')); -- input data
    slink_rx_val_i : in  std_ulogic_vector(7 downto 0) := (others => '0'); -- valid input
    slink_rx_rdy_o : out std_ulogic_vector(7 downto 0); -- ready to receive

The top signals always implement all 8 links even if less links are configured by the generics (the remaining links are terminated internally; so no extra logic). Of course one could constrain the simple std_ulogic_vector typed signals, but I am not sure about the actual data interfaces (slink_*x_dat). Maybe one could use an unconstrained array types and constrain that via a generic? It feels like some synthesis tools will struggle with that (looking at you Vivado!) 😄 Does anyone have some experience there?

I am not sure about additional "tag" signals like last or some kind of package destination address. Maybe this is something very application-specific and one could use some simple GPIOs for implementing that. Actually, I have never used a streaming interface for anything else but "pumping" data to a co-processor (something like CORDIC). Maybe anyone can provide some additional application-specific use cases that also require these additional tag signals?

I know a stream link is basically a simple FIFO interface that should not be too hard to verify even with a simple testbench. However, I would like to do some "stress tests" someday (like randomized traffic). I am looking through VUnit's streaming verification components (https://vunit.github.io/verification_components/vci.html#stream-master-vci) but I couldn't find any example setups so far. @LarsAsplund @umarcor do you have any hints? 😉

1 reply

LarsAsplund Jun 30, 2021
Collaborator

@stnolting Let me have a look and get back to you how it can be done. I see that you have a loopback from CPU back to the CPU. If we want to stress the CPU we could do a loopback starting in the testbench. Apply random input, wait for the CPU to loop that data back on the output and verify the correctness.

FPGA-Wlh · 2024-12-07T09:34:54Z

FPGA-Wlh
Dec 7, 2024

thank you

…

--------------原始邮件-------------- 发件人："stnolting ***@***.***>; 发送时间：2024年12月7日(星期六) 中午1:59 收件人："stnolting/neorv32" ***@***.***>; 抄送："汪龙河 ***@***.***>;"Comment ***@***.***>; 主题：Re: [stnolting/neorv32] How to add AXI-Lite and AXI Stream peripherals (Discussion #52) ----------------------------------- 请问如何实现AXI4-Full到AXI4-Stream的转换呢 How to realize AXI4-Full to AXI4-Stream conversion? If you are using AMD then I can highly recommend the AXI Streaming FIFO: IP module: https://www.xilinx.com/products/intellectual-property/axi_fifo.html#overview For an open source version, this looks promising - but I haven't tested it myself. Btw, English is the default language here. So please use deepl or any other translator. 😉 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

1 reply

stnolting Dec 8, 2024
Maintainer

You're welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to add AXI-Lite and AXI Stream peripherals #52

{{title}}

Replies: 5 comments 22 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to add AXI-Lite and AXI Stream peripherals #52

umarcor Jun 1, 2021 Collaborator

Replies: 5 comments · 22 replies

umarcor Jun 1, 2021 Collaborator Author

AXI Stream

stnolting Jun 2, 2021 Maintainer

stnolting Jun 2, 2021 Maintainer

umarcor Jun 2, 2021 Collaborator Author

FPGA-Wlh Dec 6, 2024

stnolting Dec 7, 2024 Maintainer

umarcor Jun 1, 2021 Collaborator Author

AXI-Lite

umarcor Jun 5, 2021 Collaborator Author

smunaut Jun 5, 2021

umarcor Jun 5, 2021 Collaborator Author

smunaut Jun 5, 2021

stnolting Jun 5, 2021 Maintainer

smunaut Jun 2, 2021

umarcor Jun 2, 2021 Collaborator Author

stnolting Jun 2, 2021 Maintainer

stnolting Jun 28, 2021 Maintainer

LarsAsplund Jun 30, 2021 Collaborator

FPGA-Wlh Dec 7, 2024

stnolting Dec 8, 2024 Maintainer

umarcor
Jun 1, 2021
Collaborator

Replies: 5 comments 22 replies

umarcor
Jun 1, 2021
Collaborator Author

stnolting Jun 2, 2021
Maintainer

stnolting Jun 2, 2021
Maintainer

umarcor Jun 2, 2021
Collaborator Author

stnolting Dec 7, 2024
Maintainer

umarcor
Jun 1, 2021
Collaborator Author

umarcor Jun 5, 2021
Collaborator Author

umarcor Jun 5, 2021
Collaborator Author

stnolting Jun 5, 2021
Maintainer

smunaut
Jun 2, 2021

umarcor Jun 2, 2021
Collaborator Author

stnolting Jun 2, 2021
Maintainer

stnolting
Jun 28, 2021
Maintainer

LarsAsplund Jun 30, 2021
Collaborator

FPGA-Wlh
Dec 7, 2024

stnolting Dec 8, 2024
Maintainer