Reliability issues when doing memtests #39

jeanthom · 2020-07-23T13:38:20Z

We are currently running into a reliability issue with the memtests:

The memtest fails whenever we introduce delay between the write and the read (this looks like a refresh issue and would be backed by Fix bank activation failure #32)
The memtest fails when we want to read/write too much data => same as 1? or address slicer issue (or similar)?
Even if we don't fall into 1 or 2, we sometimes struggle to have a successful memtest

jeanthom · 2020-07-24T13:58:12Z

ce72afb improves #1 by a bit.

jeanthom · 2020-07-24T17:01:37Z

Is this a UARTBridge reliability issue?

jeanthom · 2020-07-24T17:34:29Z

There seems to be a pattern:

jeanthom · 2020-07-28T11:17:20Z

Looks like a PHY error. That "desynchronization" is reproducible with the simulation testbench. What bothers me is that I can get good memtests on real hardware (not all the time), and I can't figure out what is the root cause of this D:

jeanthom · 2020-07-29T16:12:42Z

I ran some tests on a complete SoC (minimal Minerva system in soc.py). I get similar behavior to what I had with UARTBridge: https://gist.github.com/jeanthom/85c00ffc5402df95fcf4967ea806fe49

jeanthom · 2020-07-29T16:34:43Z

~~The glitches in the gist above are related to the "-retime" option. Without "-retime" I still get memtest failure, but without the odd bitflips.~~ Glitches also appear when the retime option isn't enabled.

jeanthom · 2020-07-31T09:49:53Z

TODO:

Ensure pads are correctly assigned
- (LiteDRAM) DQ were set to 75Ohms termination
Make sure TimingSettings are correct

jeanthom · 2020-07-31T11:53:01Z

Fixing #38 seems to improve the situation, however in doing so I'm forced to "-retime" which might introduce bugs.

jeanthom · 2020-08-03T11:09:44Z

Synthesis without retiming isn't that much better.

I noticed that on a normal memtest we get those values for rdly:

Firmware launched...
DRAM init... done
Auto calibrating... done
Auto calibration profile:p0 rdly:00000002 p1 rdly:00000002
DRAM test... 
done

When the test is failed, we get different values for rdly:

Firmware launched...
DRAM init... done
Auto calibrating... done
Auto calibration profile:p0 rdly:00000003 p1 rdly:00000004
DRAM test... 
fail : *(0x1000000C) = DEFF00FF
fail : *(0x1000001C) = DEFF00FF
fail : *(0x1000002C) = DEFF00FF
fail : *(0x1000003C) = DEFF00FF
fail : *(0x1000004C) = DEFF00FF
fail : *(0x1000005C) = DEFF00FF
fail : *(0x1000006C) = DEFF00FF
fail : *(0x1000007C) = DEFF00FF
fail : *(0x1000008C) = DEFF00EF
fail : *(0x1000009C) = DEFF00FF
fail : *(0x100000AC) = DEFF00EF
Test canceled (more than 10 errors)

jeanthom · 2020-08-03T11:54:04Z

Actually we can also have error with rdly=2 (normal value for the ECPIX5):

Firmware launched...
DRAM init... done
Auto calibrating... done
Auto calibration profile:p0 rdly:00000002 p1 rdly:00000002
DRAM test... 
fail : *(0x10000000) = DEAF000C
fail : *(0x10000004) = DEAF0000
fail : *(0x10000008) = DEAF0004
fail : *(0x1000000C) = DEAF0008
fail : *(0x10000010) = DEAF001C
fail : *(0x10000014) = DEAF0010
fail : *(0x10000018) = DEAF0014
fail : *(0x1000001C) = DEAF0018
fail : *(0x10000020) = DEAF002C
fail : *(0x10000024) = DEAF0020
fail : *(0x10000028) = DEAF0024
Test canceled (more than 10 errors)
done

jeanthom · 2020-08-06T15:35:47Z

Here's where it gets a bit funky: I do all my testing on two 85F ECPIX-5 dev boards. One is R01, the other is R02, but the RAM routing hasn't really changed between the two revisions so I don't expect different behaviour between the two.

On the R02, I can't get a single test to pass, and it always fail like this:

fail : *(0x10000000) = DEAF000C
fail : *(0x10000004) = DEAF0000
fail : *(0x10000008) = DEAF0004
fail : *(0x1000000C) = DEAF0008

On the R01, I can get it to work 50-60% of the time, but when it fails, it fails like this:

fail : *(0x1000000C) = DEFF00FF
fail : *(0x1000001C) = DEFF00FF
fail : *(0x1000002C) = DEFF00FF
fail : *(0x1000003C) = DEFF00FF

jeanthom · 2020-08-06T17:59:27Z

In a sane memtest:

Readclksel: 0 1 2 3 4 5 6 7
Burstdet:   0 1 1 1 0 0 1 1

In a buggy memtest:

Readclksel: 0 1 2 3 4 5 6 7
Burstdet:   0 0 0 0 0 1 1 1

jeanthom · 2020-08-06T18:05:15Z

Looks like we are one clock cycle desynchronized... Why?

jeanthom · 2020-08-07T09:52:55Z

Taking a look at both p0 and p1 rdly:

Sane:

Rdly
p0: 01110011
Rdly
p1: 01110000

Non-functional: (results from p1 are garbage)

Rdly
p0: 00000111
Rdly
p1: 00000111

jeanthom added bug Something isn't working help wanted Extra attention is needed phy core labels Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliability issues when doing memtests #39

Reliability issues when doing memtests #39

jeanthom commented Jul 23, 2020

jeanthom commented Jul 24, 2020

jeanthom commented Jul 24, 2020

jeanthom commented Jul 24, 2020 •

edited

Loading

jeanthom commented Jul 28, 2020

jeanthom commented Jul 29, 2020

jeanthom commented Jul 29, 2020 •

edited

Loading

jeanthom commented Jul 31, 2020 •

edited

Loading

jeanthom commented Jul 31, 2020

jeanthom commented Aug 3, 2020

jeanthom commented Aug 3, 2020

jeanthom commented Aug 6, 2020 •

edited

Loading

jeanthom commented Aug 6, 2020

jeanthom commented Aug 6, 2020

jeanthom commented Aug 7, 2020

Reliability issues when doing memtests #39

Reliability issues when doing memtests #39

Comments

jeanthom commented Jul 23, 2020

jeanthom commented Jul 24, 2020

jeanthom commented Jul 24, 2020

jeanthom commented Jul 24, 2020 • edited Loading

jeanthom commented Jul 28, 2020

jeanthom commented Jul 29, 2020

jeanthom commented Jul 29, 2020 • edited Loading

jeanthom commented Jul 31, 2020 • edited Loading

jeanthom commented Jul 31, 2020

jeanthom commented Aug 3, 2020

jeanthom commented Aug 3, 2020

jeanthom commented Aug 6, 2020 • edited Loading

jeanthom commented Aug 6, 2020

jeanthom commented Aug 6, 2020

jeanthom commented Aug 7, 2020

jeanthom commented Jul 24, 2020 •

edited

Loading

jeanthom commented Jul 29, 2020 •

edited

Loading

jeanthom commented Jul 31, 2020 •

edited

Loading

jeanthom commented Aug 6, 2020 •

edited

Loading