Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relion 5 Class3D MPI_ABORT was invoked on rank 1 #1238

Open
PierrePaillard opened this issue Feb 13, 2025 · 1 comment
Open

Relion 5 Class3D MPI_ABORT was invoked on rank 1 #1238

PierrePaillard opened this issue Feb 13, 2025 · 1 comment

Comments

@PierrePaillard
Copy link

Describe your problem

Doing cryoET S.T.A. analysis on coated protein from purified bacterial vesicles using Relion 5.0.0 stable release.
I encounter a MPI issue when 3Dclassification.

The first iteration works but in any cases, changing box size, changing GPU,..., the iter2 crash.

Environment:

  • RELION version [e.g. RELION-5.00-commit-e62d3b]
  • Running on SLURM cluster = each node has at least 16cpu, 4 GPU,
    and 256Gb system memory.
  • The GPU's range from GTX1080, GTX1080Ti,
    RTX2080Ti, RTX A5000, RTX A6000, A40.

Dataset:

  • Box size: [e.g. 60 px]
  • Pixel size: [e.g. 2.14 Å/px]
  • Number of particles: [>41,000]
  • Description: [e.g. A decameric protein of about 100 kDa in total]

Job options:

  • Type of job: [ Classification 3D]
  • Number of MPI processes: [ 5]
  • Number of threads: [ 3]
  • Full command:

which relion_refine_mpi --o Class3D/job076/run --ios Extract/job049/optimisation_set.star --ref Reconstruct/job050/merged.mrc --firstiter_cc --trust_ref_size --ini_high 90 --dont_combine_weights_via_disc --pool 3 --pad 2 --ctf --ctf_intact_first_peak --iter 10 --tau2_fudge 1 --particle_diameter 200 --K 5 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --sigma_ang 1.66667 --offset_range 5 --offset_step 2 --allow_coarser_sampling --sym C1 --norm --scale --j 3 --gpu "" --pipeline_control Class3D/job076/-


**Error message:**
ERROR: 
**No orientation was found as better than any other.**

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some 
lower-precision calculations, but multiple fallbacks have since been 
implemented. Please report this error to the relion developers at 
github.com/3dem/relion/issues  

[hippo-20.cryst.bbk.ac.uk:47088] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[hippo-20.cryst.bbk.ac.uk:47088] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
==================


RUN.OUT MESSAGE:

Auto-refine: Estimated accuracy angles= 0.05 degrees; offsets= 1 Angstroms
Coarser-sampling: Angular step= 7.5 degrees.
Coarser-sampling: Offset search range= 49.9999 Angstroms; offset step= 9.99998 Angstroms
CurrentResolution= 85.7141 Angstroms, which requires orientationSampling of at least 45 degrees for a particle of diameter 200 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 405
OrientationalSampling= 15 NrOrientations= 1
TranslationalSampling= 20 NrTranslations= 81
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 25920
OrientationalSampling= 7.5 NrOrientations= 8
TranslationalSampling= 9.99998 NrTranslations= 648
=============================
Expectation iteration 2 of 10
000/??? sec ~~(,_,">                                                          [oo]
RELION version: 5.0.0-commit-e62d3b
exiting with an error ...

RELION version: 5.0.0-commit-e62d3b
exiting with an error ...

RELION version: 5.0.0-commit-e62d3b
exiting with an error ...

RELION version: 5.0.0-commit-e62d3b
exiting with an error ...
--------------------------------------------------------------------------
**MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
with errorcode 1.**

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

See below a screenshot of the first iteration which looks good. I thought that the membrane presence could facilitate 3D classification but can not exceed classification iter2...
Image

What could be the problem?

Thank you a lot for any help ; ).
Cheers,
Pierre

@PierrePaillard
Copy link
Author

update/additional informations regarding the pipeline:

  • Pre-processing until Reconstruct tomograms using Relion V5 stable release
  • Denoising externally from Relion using IsonetV2
  • Membrane segmentation externally using MemBrain-seg for manual picking
  • Extract coordinates using matlab and relion2dynamo scripts to obtain a table.star
  • Extract subtomos using Relion to obtain particles.star for further reconstruction particle/3D classification in Relion5

Below, header of the particle.star that I used for Reconstruct particle or 3D class:

version 50001

data_general

_rlnTomoSubTomosAre2DStacks 1

version 50001

data_optics

loop_
_rlnVoltage #1
_rlnSphericalAberration #2
_rlnAmplitudeContrast #3
_rlnTomoTiltSeriesPixelSize #4
_rlnOpticsGroup #5
_rlnOpticsGroupName #6
_rlnCtfDataAreCtfPremultiplied #7
_rlnImageDimensionality #8
_rlnTomoSubtomogramBinning #9
_rlnImagePixelSize #10
_rlnImageSize #11
300.000000 2.700000 0.100000 2.140000 1 Position_1_2 1 2 4.672897 10.000000 60
300.000000 2.700000 0.100000 2.140000 2 Position_1 1 2 4.672897 10.000000 60
300.000000 2.700000 0.100000 2.140000 3 Position_2_2 1 2 4.672897 10.000000 60
300.000000 2.700000 0.100000 2.140000 4 Position_2_3 1 2 4.672897 10.000000 60
300.000000 2.700000 0.100000 2.140000 5 Position_2 1 2 4.672897 10.000000 60
300.000000 2.700000 0.100000 2.140000 6 Position_3_2 1 2 4.672897 10.000000 60
300.000000 2.700000 0.100000 2.140000 7 Position_3 1 2 4.672897 10.000000 60

version 50001

data_particles

loop_
_rlnAngleRot #1
_rlnAngleTilt #2
_rlnAnglePsi #3
_rlnTomoName #4
_rlnOpticsGroup #5
_rlnTomoParticleName #6
_rlnTomoVisibleFrames #7
_rlnImageName #8
_rlnOriginXAngst #9
_rlnOriginYAngst #10
_rlnOriginZAngst #11
_rlnCenteredCoordinateXAngst #12
_rlnCenteredCoordinateYAngst #13
_rlnCenteredCoordinateZAngst #14
167.910000 161.070000 -52.69600 Position_1 2 Position_1/1 [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1] Extract/job049/Subtomograms/Position_1/1_stack2d.mrcs 0.000000 0.000000 0.000000 -4543.19860 4171.502000 -669.99120
-153.74000 146.660000 -104.67600 Position_1 2 Position_1/2 [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] Extract/job049/Subtomograms/Position_1/2_stack2d.mrcs 0.000000 0.000000 0.000000 -4493.20820 4171.502000 -669.99120
36.068000 148.840000 67.380000 Position_1 2 Position_1/3 [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] Extract/job049/Subtomograms/Position_1/3_stack2d.mrcs 0.000000 0.000000 0.000000 -4523.18960 4221.578000 -669.99120
-160.78000 153.920000 -150.64200 Position_1 2 Position_1/4 [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] Extract/job049/Subtomograms/Position_1/4_stack2d.mrcs 0.000000 0.000000 0.000000 -4443.19640 4181.560000 -659.99740
43.555000 149.560000 118.140000 Position_1 2 Position_1/5 [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] Extract/job049/Subtomograms/Position_1/5_stack2d.mrcs 0.000000 0.000000 0.000000 -4473.19920 4221.578000 -669.99120
109.460000 154.360000 -57.65300 Position_1 2 Position_1/6 [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1] Extract/job049/Subtomograms/Position_1/6_stack2d.mrcs 0.000000 0.000000 0.000000 -4583.19520 4141.542000 -650.00360
130.890000 139.550000 26.970000 Position_1 2 Position_1/7 [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1] Extract/job049/Subtomograms/Position_1/7_stack2d.mrcs 0.000000 0.000000 0.000000 -4593.21040 4191.618000 -650.00360
-95.40000 154.400000 -155.69500 Position_1 2 Position_1/8 [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] Extract/job049/Subtomograms/Position_1/8_stack2d.mrcs 0.000000 0.000000 0.000000 -4403.19980 4161.658000 -640.00980

I wonder if the issue comes from this star file content which does not fit with Relion5?
Do somebody have any ideas on why the 3D classification failed?

run out with the error message: MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

Error message:
No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some
lower-precision calculations, but multiple fallbacks have since been
implemented. Please report this error to the relion developers at
github.com/3dem/relion/issues

Thank you a lot for any help with this issue.
PierreP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant