Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliability issues when doing memtests #39

Open
jeanthom opened this issue Jul 23, 2020 · 14 comments
Open

Reliability issues when doing memtests #39

jeanthom opened this issue Jul 23, 2020 · 14 comments
Labels
bug Something isn't working core help wanted Extra attention is needed phy

Comments

@jeanthom
Copy link
Owner

We are currently running into a reliability issue with the memtests:

  1. The memtest fails whenever we introduce delay between the write and the read (this looks like a refresh issue and would be backed by Fix bank activation failure #32)
  2. The memtest fails when we want to read/write too much data => same as 1? or address slicer issue (or similar)?
  3. Even if we don't fall into 1 or 2, we sometimes struggle to have a successful memtest
@jeanthom jeanthom added bug Something isn't working help wanted Extra attention is needed phy core labels Jul 23, 2020
@jeanthom
Copy link
Owner Author

ce72afb improves #1 by a bit.

@jeanthom
Copy link
Owner Author

Is this a UARTBridge reliability issue?

@jeanthom
Copy link
Owner Author

jeanthom commented Jul 24, 2020

There seems to be a pattern:
Capture d’écran de 2020-07-24 19-31-01

@jeanthom
Copy link
Owner Author

Looks like a PHY error. That "desynchronization" is reproducible with the simulation testbench. What bothers me is that I can get good memtests on real hardware (not all the time), and I can't figure out what is the root cause of this D:

@jeanthom
Copy link
Owner Author

I ran some tests on a complete SoC (minimal Minerva system in soc.py). I get similar behavior to what I had with UARTBridge: https://gist.github.com/jeanthom/85c00ffc5402df95fcf4967ea806fe49

@jeanthom
Copy link
Owner Author

jeanthom commented Jul 29, 2020

The glitches in the gist above are related to the "-retime" option. Without "-retime" I still get memtest failure, but without the odd bitflips. Glitches also appear when the retime option isn't enabled.

@jeanthom
Copy link
Owner Author

jeanthom commented Jul 31, 2020

TODO:

  • Ensure pads are correctly assigned
    • (LiteDRAM) DQ were set to 75Ohms termination
  • Make sure TimingSettings are correct

@jeanthom
Copy link
Owner Author

Fixing #38 seems to improve the situation, however in doing so I'm forced to "-retime" which might introduce bugs.

@jeanthom
Copy link
Owner Author

jeanthom commented Aug 3, 2020

Synthesis without retiming isn't that much better.

I noticed that on a normal memtest we get those values for rdly:

Firmware launched...
DRAM init... done
Auto calibrating... done
Auto calibration profile:p0 rdly:00000002 p1 rdly:00000002
DRAM test... 
done

When the test is failed, we get different values for rdly:

Firmware launched...
DRAM init... done
Auto calibrating... done
Auto calibration profile:p0 rdly:00000003 p1 rdly:00000004
DRAM test... 
fail : *(0x1000000C) = DEFF00FF
fail : *(0x1000001C) = DEFF00FF
fail : *(0x1000002C) = DEFF00FF
fail : *(0x1000003C) = DEFF00FF
fail : *(0x1000004C) = DEFF00FF
fail : *(0x1000005C) = DEFF00FF
fail : *(0x1000006C) = DEFF00FF
fail : *(0x1000007C) = DEFF00FF
fail : *(0x1000008C) = DEFF00EF
fail : *(0x1000009C) = DEFF00FF
fail : *(0x100000AC) = DEFF00EF
Test canceled (more than 10 errors)

@jeanthom
Copy link
Owner Author

jeanthom commented Aug 3, 2020

Actually we can also have error with rdly=2 (normal value for the ECPIX5):

Firmware launched...
DRAM init... done
Auto calibrating... done
Auto calibration profile:p0 rdly:00000002 p1 rdly:00000002
DRAM test... 
fail : *(0x10000000) = DEAF000C
fail : *(0x10000004) = DEAF0000
fail : *(0x10000008) = DEAF0004
fail : *(0x1000000C) = DEAF0008
fail : *(0x10000010) = DEAF001C
fail : *(0x10000014) = DEAF0010
fail : *(0x10000018) = DEAF0014
fail : *(0x1000001C) = DEAF0018
fail : *(0x10000020) = DEAF002C
fail : *(0x10000024) = DEAF0020
fail : *(0x10000028) = DEAF0024
Test canceled (more than 10 errors)
done

@jeanthom
Copy link
Owner Author

jeanthom commented Aug 6, 2020

Here's where it gets a bit funky: I do all my testing on two 85F ECPIX-5 dev boards. One is R01, the other is R02, but the RAM routing hasn't really changed between the two revisions so I don't expect different behaviour between the two.

On the R02, I can't get a single test to pass, and it always fail like this:

fail : *(0x10000000) = DEAF000C
fail : *(0x10000004) = DEAF0000
fail : *(0x10000008) = DEAF0004
fail : *(0x1000000C) = DEAF0008

On the R01, I can get it to work 50-60% of the time, but when it fails, it fails like this:

fail : *(0x1000000C) = DEFF00FF
fail : *(0x1000001C) = DEFF00FF
fail : *(0x1000002C) = DEFF00FF
fail : *(0x1000003C) = DEFF00FF

@jeanthom
Copy link
Owner Author

jeanthom commented Aug 6, 2020

In a sane memtest:

Readclksel: 0 1 2 3 4 5 6 7
Burstdet:   0 1 1 1 0 0 1 1

In a buggy memtest:

Readclksel: 0 1 2 3 4 5 6 7
Burstdet:   0 0 0 0 0 1 1 1

@jeanthom
Copy link
Owner Author

jeanthom commented Aug 6, 2020

Looks like we are one clock cycle desynchronized... Why?

@jeanthom
Copy link
Owner Author

jeanthom commented Aug 7, 2020

Taking a look at both p0 and p1 rdly:

Sane:

Rdly
p0: 01110011
Rdly
p1: 01110000

Non-functional: (results from p1 are garbage)

Rdly
p0: 00000111
Rdly
p1: 00000111

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core help wanted Extra attention is needed phy
Projects
None yet
Development

No branches or pull requests

1 participant