thread failed with: thread failed with: small_delta >= 256 #979

ALardu · 2021-12-07T14:02:47Z

Debian 11 without Gui + Mad Max Chia Plotter without Gui + what(): thread failed with: thread failed with: small_delta

cd chia-plotter
sudo ./build/chia_plot -n 1 -r 30 -u 128 -t /mnt/nvme/ -d /mnt/ssd/share/ -c xch1xxxxx -f 8bebxxxxx
size NVME 943Gb
size SSD 235Gb
size RAM DDR4 64 Gb
CPU xeon 2*2667v3

Problem:
P1-P2 the Mad Max writes full disk the /mnt/nvme/ (943 Gb!!!) and the beginning of the third stage P3 is interrupted by an error:

Wrote plot header with 252 bytes
[P3-1] Table 2 took 112.406 sec, wrote 3429424212 right entries
terminate called after throwing an instance of 'std::runtime_error'
what(): thread failed with: thread failed with: small_delta >= 256 (34292629513)
Aborted

RAM memtest PERFECT
NVME Health PERFECT

Help me please!!!

ALardu · 2021-12-07T14:23:50Z

ALardu · 2021-12-07T14:28:06Z

ALardu · 2021-12-08T08:22:59Z

bladeuserpi · 2022-04-22T13:55:53Z

The same error message occured here:

    1	Multi-threaded pipelined Chia k34 plotter - ecec17d
     2	(Sponsored by Flexpool.io - Check them out if you're looking for a secure and scalable Chia pool)
     3	
     4	Network Port: 8444 [chia]
     5	Final Directory: /farm
     6	Number of Plots: 40
     7	Crafting plot 1 out of 40 (2022/04/22 12:38:43)
     8	Process ID: 10272
     9	Number of Threads: 36
    10	Number of Buckets P1:    2^8 (256)
    11	Number of Buckets P3+P4: 2^8 (256)
    12	Pool Puzzle Hash:  ...
    13	Farmer Public Key: ...
    14	Working Directory:   /plot4/
    15	Working Directory 2: /plot4/
    16	Plot Name: plot-k34-2022-04-22-12-38-...
    17	[P1] Table 1 took 80.5456 sec
    18	[P1] Table 2 took 421.855 sec, found 17179895526 matches
    19	[P1] Table 3 took 692.764 sec, found 17180041892 matches
    20	[P1] Table 4 took 849.356 sec, found 17180079257 matches
    21	[P1] Table 5 took 850.095 sec, found 17180234707 matches
    22	[P1] Table 6 took 808.098 sec, found 17180657540 matches
    23	[P1] Table 7 took 632.536 sec, found 17181473914 matches
    24	Phase 1 took 4335.28 sec
    25	[P2] max_table_size = 17181473914
    26	[P2] Table 7 scan took 64.9589 sec
    27	[P2] Table 7 rewrite took 350.204 sec, dropped 0 entries (0 %)
    28	[P2] Table 6 scan took 116.656 sec
    29	[P2] Table 6 rewrite took 173.673 sec, dropped 2324880433 entries (13.532 %)
    30	[P2] Table 5 scan took 111.72 sec
    31	[P2] Table 5 rewrite took 168.228 sec, dropped 3047672881 entries (17.7394 %)
    32	[P2] Table 4 scan took 108.128 sec
    33	[P2] Table 4 rewrite took 164.688 sec, dropped 3315222793 entries (19.2969 %)
    34	[P2] Table 3 scan took 107.269 sec
    35	[P2] Table 3 rewrite took 165.895 sec, dropped 3420069244 entries (19.9072 %)
    36	[P2] Table 2 scan took 108.445 sec
    37	[P2] Table 2 rewrite took 164.671 sec, dropped 3462035936 entries (20.1517 %)
    38	Phase 2 took 1865.99 sec
    39	Wrote plot header with 252 bytes
    40	[P3-1] Table 2 took 257.428 sec, wrote 13717859590 right entries
    41	[P3-2] Table 2 took 214.897 sec, wrote 13717859590 left entries, 13717859590 final
    42	[P3-1] Table 3 took 263.115 sec, wrote 13759972648 right entries
    43	[P3-2] Table 3 took 213.227 sec, wrote 13759972648 left entries, 13759972648 final
    44	[P3-1] Table 4 took 267.454 sec, wrote 13864856464 right entries
    45	[P3-2] Table 4 took 412.839 sec, wrote 13864856464 left entries, 13864856464 final
    46	[P3-1] Table 5 took 288.155 sec, wrote 14132561826 right entries
    47	[P3-2] Table 5 took 401.173 sec, wrote 14132561826 left entries, 14132561826 final
    48	[P3-1] Table 6 took 290.407 sec, wrote 14855777107 right entries
    49	[P3-2] Table 6 took 385.246 sec, wrote 14855777107 left entries, 14855777107 final
    50	[P3-1] Table 7 took 305.812 sec, wrote 17181473914 right entries
    51	terminate called after throwing an instance of 'std::runtime_error'
    52	  what():  thread failed with: thread failed with: small_delta >= 256 (407)
    53	Command terminated by signal 6
    54	198313.73user 15600.11system 2:43:47elapsed 2176%CPU (0avgtext+0avgdata 125113576maxresident)k

bladeuserpi · 2022-04-25T20:42:51Z

In my case I root-caused this to:
- I recently enabled XFS "discard" option, default for RHEL-8.5 is "no discard"
- Also I am using Raid0 with 3x NVME + 1x Raid0 with 2 Sata SSD
- initially I suspected using a P3700 engineering sample with older firmware was 
  a contributing factor, but it also reproduced when replacing this with other SSD

There are also other error messges when running multiple times
P3-1] Table 7 took 344.503 sec, wrote 17183710592 right entries
terminate called after throwing an instance of 'std::runtime_error'
  what():  thread failed with: thread failed with: small_delta >= 256 (950)

[P3-1] Table 7 took 343.122 sec, wrote 17182007329 right entries
free(): invalid next size (normal)
Command terminated by signal 6

[P1] Table 2 took 790.174 sec, found 17179640565 matches
terminate called after throwing an instance of 'std::runtime_error'
  what():  thread failed with: input not sorted

[P1] Table 2 took 711.174 sec, found 17180065469 matches
Command terminated by signal 11

[P3-1] Table 7 took 343.122 sec, wrote 17182007329 right entries
free(): invalid next size (normal)
Command terminated by signal 6

These were sometimes logged (but not for all crashes) with P3700 ES,
so initially I suspected that would be contributing factor:
blk_update_request: critical target error, dev nvme2n1, sector 1558722560 op 0x3:(DISCARD) flags 0x0 phys_seg 1 prio class 0
blk_update_request: critical target error, dev nvme2n1, sector 1560120320 op 0x3:(DISCARD) flags 0x0 phys_seg 1 prio class 0

Conclusion:
- When mounting without discard, the problem goes away, madmax runs without errors.
- When mounting with discard, the problem comes back for my configuration stacked mdraid configuration

bladeuserpi · 2022-05-02T20:39:56Z

A closer look shows this might be related to mdadm-Raid0 with different size SSDs:
-1.2TB
-750 GB
-1.6 TB

I expected mdadm-Raid0 would then use the smallest size, e.g. 3x750GB,
but it actually uses the full amount 1.2+0.75+1.6TB.
From my understanding it stripes across all 3 disks until it reaches
the capacity of the smallest disk, then it stripes over the left-over disks etc.

My current assumption is "mdadm-raid0+discard" has a data corruption bug when combining
different size disks (at least for RHEL-8.5 kernel in my testing; did not yet test newer/upstream kernel).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thread failed with: thread failed with: small_delta >= 256 #979

thread failed with: thread failed with: small_delta >= 256 #979

ALardu commented Dec 7, 2021

ALardu commented Dec 7, 2021

ALardu commented Dec 7, 2021

ALardu commented Dec 8, 2021

bladeuserpi commented Apr 22, 2022

bladeuserpi commented Apr 25, 2022 •

edited

Loading

bladeuserpi commented May 2, 2022 •

edited

Loading

thread failed with: thread failed with: small_delta >= 256 #979

thread failed with: thread failed with: small_delta >= 256 #979

Comments

ALardu commented Dec 7, 2021

ALardu commented Dec 7, 2021

ALardu commented Dec 7, 2021

ALardu commented Dec 8, 2021

bladeuserpi commented Apr 22, 2022

bladeuserpi commented Apr 25, 2022 • edited Loading

bladeuserpi commented May 2, 2022 • edited Loading

bladeuserpi commented Apr 25, 2022 •

edited

Loading

bladeuserpi commented May 2, 2022 •

edited

Loading