Skip to content

Commit

Permalink
CPU is now plus another 0.5 MIPS at 14.5 MIPS
Browse files Browse the repository at this point in the history
This commit includes a significant improvement of the timing behvior compared to the last commit (0.206ns slack, 9.268 total delay) at identical MIPS. It also includes the documentation and adjusted emulator and the adjusted Q-TRIS.
  • Loading branch information
sy2002 committed Nov 8, 2020
1 parent 65da481 commit 83aed0d
Show file tree
Hide file tree
Showing 7 changed files with 79 additions and 48 deletions.
20 changes: 10 additions & 10 deletions demos/q-tris.asm
Original file line number Diff line number Diff line change
Expand Up @@ -402,34 +402,34 @@ GAME_WON .EQU 10 ; game is won, when "Level 10" is reached
; speed is defined by wasted cycles, both numbers are multiplied
; second number is also used for blinking frequency, so adjust carefully
; (preferably only adjust the first number)
Level_Speed .DW 946, 290 ; Level 1 (was 800, at ISA V1.21,
Level_Speed .DW 946, 301 ; Level 1 (was 800, at ISA V1.21,
; was 946, 251 until ISA V1.5
; was 946, 269 until ISA V1.6)
.DW 827, 290 ; Level 2 (was 700 at ISA V1.21
.DW 827, 301 ; Level 2 (was 700 at ISA V1.21
; was 827, 251 until ISA V1.5
; was 827, 269 until ISA V1.6)
.DW 709, 290 ; Level 3 (was 600 at ISA V1.21
.DW 709, 301 ; Level 3 (was 600 at ISA V1.21
; was 709, 251 until ISA V1.5
; was 709, 269 until ISA V1.6)
.DW 591, 290 ; Level 4 (was 500 at ISA V1.21
.DW 591, 301 ; Level 4 (was 500 at ISA V1.21
; was 591, 251 until ISA V1.5
; was 591, 269 until ISA V1.6)
.DW 532, 290 ; Level 5 (was 450 at ISA V1.21
.DW 532, 301 ; Level 5 (was 450 at ISA V1.21
; was 532, 251 until ISA V1.5
; was 532, 269 until ISA V1.6)
.DW 473, 290 ; Level 6 (was 400 at ISA V1.21
.DW 473, 301 ; Level 6 (was 400 at ISA V1.21
; was 473, 251 until ISA V1.5
; was 473, 269 until ISA V1.6)
.DW 414, 290 ; Level 7 (was 350 at ISA V1.21
.DW 414, 301 ; Level 7 (was 350 at ISA V1.21
; was 414, 251 until ISA V1.5
; was 414, 269 until ISA V1.6)
.DW 355, 290 ; Level 8 (was 300 at ISA V1.21
.DW 355, 301 ; Level 8 (was 300 at ISA V1.21
; was 355, 251 until ISA V1.5
; was 355, 269 until ISA V1.6)
.DW 296, 290 ; Level 9 (was 250 at ISA V1.21
.DW 296, 301 ; Level 9 (was 250 at ISA V1.21
; was 296, 251 until ISA V1.5
; was 296, 269 until ISA V1.6)
.DW 296, 290 ; non existing "Level 10" => Game Won
.DW 296, 301 ; non existing "Level 10" => Game Won

; ****************************************************************************
; HANDLE_END
Expand Down
38 changes: 33 additions & 5 deletions doc/MIPS.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,21 +15,49 @@ QNICE-FPGA Performance Characteristics
workload that is being executed.

* Peak performance has been measured in a math-heavy demo such as the
mandelbrot demo: 14.84 MIPS. Q-TRIS runs at 13.97 MIPS.
mandelbrot demo: 15.13 MIPS. Q-TRIS runs at 14.47 MIPS.

* For the sake of the VGA and WASM emulator, the average QNICE performance
is defined as **14.0 MIPS**.
is defined as **14.5 MIPS**.

MIPS measurements on November, 7 2020 using CPU V1.7
MIPS measurements on November, 8 2020 using CPU V1.7
----------------------------------------------------

This CPU version implements INCRB and DECRB in one CPU cycle.

### Mandelbrot: 15.13 MIPS

```
0000 007E 6F71 = 8,286,065 cycles => 0,1657 sec
0000 0026 411E = 2,507,038 instructions => 3.31 cycles / instruction
=> 15.13 MIPS
```

* Compared wth V1.6 this is a 11% speed-up.
* Compared with the version from November, 7 this is a 2% speed-up.

### Q-TRIS: 14.47 MIPS

```
0009 7F52 956C = 40,790,824,300 cycles => 815.82 sec => 13:36 min
0002 BF97 A906 = 11,804,322,054 instructions => 3.46 cycles / instruction
=> 14.47 MIPS
=> 14.50 MIPS for Emulator
```

* Compared with V1.6 is a 12% speed-up.
* Compared with the version from November, 7 this is a 4% speed-up.

MIPS measurements on November, 7 2020 using CPU in commit #XxXxXxX
------------------------------------------------------------------

The refactored CPU that also implements the new ISA V1.7 has an optimization
that branches, that are not taken only need 1 CPU cycle.

### Mandelbrot: 14.84 MIPS

```
0000 0080 E54F = 8,447,311 cycles => 0,1689 sec
0000 0080 E54F = 8,447,311 cycles => 0.1689 sec
0000 0026 411E = 2,507,038 instructions => 3.37 cycles / instruction
=> 14.84 MIPS
```
Expand All @@ -41,7 +69,7 @@ that branches, that are not taken only need 1 CPU cycle.

```
0007 3717 1ACB = 30,989,032,139 cycles => 619.78 sec => 10:20 min
= 8,649,514,605 instructions => 3.58 cycles / instruction
0002 038D 1E6D = 8,649,514,605 instructions => 3.58 cycles / instruction
=> 13.97 MIPS
=> 14.00 MIPS for Emulator
```
Expand Down
6 changes: 3 additions & 3 deletions emulator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ Getting Started
`R` and then `qbin/q-tris.out` to run Q-TRIS.

* Go to the graphical window, press `SPACE` and start playing. The emulator
automatically attempts to regulate the speed to `14.0 MIPS`, which is
automatically attempts to regulate the speed to `14.5 MIPS`, which is
the speed of the original QNICE-FPGA hardware running at 50 MHz. Toggle
the MIPS and FPS display using `ALT+F`.

Expand Down Expand Up @@ -214,7 +214,7 @@ WebAssembly/WebGL in a Web Browser: Emulation of the VGA Screen and the PS/2 Key

* In contrast to the native qnice-vga version of the emulator, the
WebAssembly/WebGL version is not capable to automatically regulate the
speed to match the `14.0 MIPS` of the FPGA hardware that runs at 50 MHz.
speed to match the `14.5 MIPS` of the FPGA hardware that runs at 50 MHz.
Probably your computer will be faster, so you will need to slow down
the emulation speed.

Expand Down Expand Up @@ -326,7 +326,7 @@ Emulator Architecture
`int vga_timebase_thread(...)` in `vga.c` is updating the global variable
`gbl$sdl_ticks` every millisecond.

* The original QNICE-FPGA hardware performs `14.0 MIPS` while running at
* The original QNICE-FPGA hardware performs `14.5 MIPS` while running at
50 MHz. Most modern systems will emulate QNICE-FPGA much faster. The
speed regulation mechanism is implemented in the function `void run()`
by calculating how many QNICE instructions per 10 milliseconds need to
Expand Down
10 changes: 5 additions & 5 deletions emulator/qnice.c
Original file line number Diff line number Diff line change
Expand Up @@ -215,20 +215,20 @@ unsigned long gbl$instructions_per_iteration = gbl$ipi_default;
#endif

/* According to ../doc/MIPS.md, the current QNICE hardware,
which runs at 50 MHz performs at 14.0 MIPS.
which runs at 50 MHz performs at 14.5 MIPS.
The speed regulation occurs every 10 milliseconds, which is why we introduce
a "target instructions per 10-milliseconds" measurement using gbl$target_iptms.
Since there is system jitter, the gbl$target_iptms needs to be adjusted,
which is done by sampling the actual MIPS every 3 seconds and calculating
the gbl$target_iptms_adjustment_factor in the thread "mips_adjustment_thread"
Linux gcc does not allow gbl$qnice_mips to be used within gbl$target_mips and
gbl$target_iptms, therefore the value 13.21 is repeated.
gbl$target_iptms, therefore the value 14.5 is repeated.
*/
const float gbl$qnice_mips = 14.0;
const float gbl$qnice_mips = 14.5;
const float gbl$max_mips = INFINITY;
float gbl$target_mips = 14.0;
unsigned long gbl$target_iptms = ((14.0 * 1e6) / 1e3) * 10;
float gbl$target_mips = 14.5;
unsigned long gbl$target_iptms = ((14.5 * 1e6) / 1e3) * 10;
float gbl$target_iptms_adjustment_factor = 1.0;
const unsigned long gbl$target_sampling_s = 3;
bool mips_adjustment_thread_running = false;
Expand Down
1 change: 0 additions & 1 deletion vhdl/block_rom.vhd
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,6 @@ signal brom : brom_t := read_romfile(FILE_NAME);

begin

-- process for read and write operation on the rising clock edge
rom_read : process (clk)
begin
if falling_edge(clk) then
Expand Down
9 changes: 4 additions & 5 deletions vhdl/block_rom_ise.vhd
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ signal output : std_logic_vector(ROM_WIDTH - 1 downto 0);
constant C_LINES : natural := 8*1024;

type brom_t is array (0 to C_LINES - 1) of bit_vector(ROM_WIDTH - 1 downto 0);


impure function read_romfile(rom_file_name : in string) return brom_t is
file rom_file : text is in rom_file_name;
variable line_v : line;
Expand All @@ -46,13 +46,12 @@ begin
end if;
end loop;
return rom_v;
end function;

signal brom : brom_t := read_romfile(FILE_NAME);
end function;

signal brom : brom_t := read_romfile(FILE_NAME);

begin

-- process for read and write operation on the rising clock edge
rom_read : process (clk)
begin
if falling_edge(clk) then
Expand Down
43 changes: 24 additions & 19 deletions vhdl/qnice_cpu.vhd
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ signal reg_shadow_spr : std_logic; --combinatorial signal to control the sh
-- direct access to the special registers within the register bank
signal SP : std_logic_vector(15 downto 0); -- stack pointer (R13)
signal SR : std_logic_vector(15 downto 0); -- status register (R14)
signal SR_tbw : std_logic_vector(15 downto 0); -- for being able to use the SR already in cs_fetch
signal PC : std_logic_vector(15 downto 0); -- program counter (R15)
signal PC_Org : std_logic_vector(15 downto 0); -- shadow register copy of PC

Expand Down Expand Up @@ -244,6 +245,13 @@ begin
ADDR <= ADDR_Bus;
HALT <= '1' when cpu_state = cs_halt else '0';

-- For being able to use the SR already in cs_fetch, where the SR-write-back might still be in progress:
-- In case of a MOVE to SR, the new value has not arrived in SR, yet, so we need to take it from
-- reg_write_data. And if it was a MOVE 0, SR, we need to make sure, that bit 0 is always 1, otherwise
-- for example a "RSUB XYZ, 1" directly after a "MOVE 0, SR" would fail.
SR_tbw <= SR(15 downto 1) & "1" when cpu_state /= cs_fetch or reg_write_en = '0' or reg_write_addr /= regSR
else reg_write_data(15 downto 1) & "1";

-- Fastpath: When we have a direct register to register operation or a branch, where the address is in a register
-- In cs_fetch, the Instruction register is not set, yet, so we need to listen to the data bus
FastPath <= true when (cpu_state = cs_fetch and diOpcode /= opcCTRL and (
Expand All @@ -255,7 +263,7 @@ begin
(Opcode /= opcBRA and Src_Mode = amDirect and Dst_Mode = amDirect) or
(Opcode = opcBRA and Src_Mode = amDirect))
)
else false;
else false;
Src_Value_Fast <= reg_read_data1 when FastPath and cpu_state = cs_execute else Src_Value;
Dst_Value_Fast <= reg_read_data2 when FastPath and cpu_state = cs_execute else Dst_Value;

Expand Down Expand Up @@ -375,7 +383,7 @@ begin
end if;
end process;

fsm_output_decode : process (cpu_state, ADDR_Bus, SP, SR, PC, PC_org,
fsm_output_decode : process (cpu_state, ADDR_Bus, SP, SR, SR_tbw, PC, PC_org,
DATA_IN, DATA_To_Bus, WAIT_FOR_DATA, INT_N, Int_Active,
Instruction, Opcode, diOpcode, Ctrl_Cmd, diCtrl_Cmd, FastPath,
Src_RegNo, diSrc_RegNo, Src_Mode, diSrc_Mode, Src_Value,
Expand All @@ -391,7 +399,6 @@ begin
variable var_C : std_logic;
variable var_V : std_logic;
variable var_X : std_logic;
variable var_SR_tbw : std_logic_vector(15 downto 0);

procedure writeReg(signal dstreg : in std_logic_vector(3 downto 0);
signal value : in std_logic_vector(15 downto 0);
Expand Down Expand Up @@ -475,21 +482,12 @@ begin
fsmPC <= PC + 1;
fsm_reg_read_addr1 <= diSrc_RegNo; -- read Src register number
fsm_reg_read_addr2 <= diDst_RegNo; -- rest Dst register number

-- In case of a MOVE to SR, the new value has not arrived in SR, yet, so we need to take it from
-- reg_write_data. And if it was a MOVE 0, SR, we need to make sure, that bit 0 is always 1, otherwise
-- for example a "RSUB XYZ, 1" directly after a "MOVE 0, SR" would fail.
if reg_write_en = '1' and reg_write_addr = regSR then
var_SR_tbw := reg_write_data(15 downto 1) & "1";
else
var_SR_tbw := SR(15 downto 1) & "1";
end if;


-- if a branch is meant to be executed but the branch will *not* be taken, then return directly to cs_fetch
-- this is on the one hand an optimization (speed increase) and on the other hand this implements
-- the new ISA of V1.7 where predec and postinc of registers inside branches are only performed,
-- if the branching condition holds (i.e. the branch would have been taken)
if diOpcode = opcBRA and var_SR_tbw(conv_integer(diBra_Condition)) = diBra_Neg then
if diOpcode = opcBRA and SR_tbw(conv_integer(diBra_Condition)) = diBra_Neg then
-- special treatment for postincremented R15/PC: the postincrement is always executed, even if
-- the branch is not taken; this is an exception due to constant addresses being implemented
-- as something like ABRA @R15++, Z
Expand All @@ -507,17 +505,24 @@ begin
elsif FastPath then
fsmNextCpuState <= cs_execute;

-- INCRB and DECRB can be done in one cycle
-- INCRB and DECRB can be done in one cycle
-- We are using the more indirect fsm_reg_* way of storing SR, because these registers
-- are sitting physically closer inside the CPU in contrast to SR (via fsmSR). This leads
-- to significantly better timing behavior
elsif diOpcode = opcCTRL and (diCtrl_Cmd = ctrlINCRB or diCtrl_Cmd = ctrlDECRB) then
fsmNextCpuState <= cs_fetch;
fsmCPUAddr <= PC + 1;

if diCtrl_Cmd = ctrlINCRB then
-- increment the register bank address by one and leave the SR alone while doing so
fsmSR(15 downto 8) <= var_SR_tbw(15 downto 8) + 1;
-- increment the register bank address by one and leave the SR alone while doing so
fsm_reg_write_addr <= regSR;
fsm_reg_write_data <= (SR_tbw(15 downto 8) + 1) & SR_tbw(7 downto 0);
fsm_reg_write_en <= '1';
else
-- decrement the register bank address by one and leave the SR alone while doing so
fsmSR(15 downto 8) <= var_SR_tbw(15 downto 8) - 1;
-- decrement the register bank address by one and leave the SR alone while doing so
fsm_reg_write_addr <= regSR;
fsm_reg_write_data <= (SR_tbw(15 downto 8) - 1) & SR_tbw(7 downto 0);
fsm_reg_write_en <= '1';
end if;
end if;
end if;
Expand Down

0 comments on commit 83aed0d

Please sign in to comment.