Meeting 2016 02 Minutes

Vender/Distributions should ALL use --with-ident-string.
- Went around the room and asked each vendor to do this if they are not.
- Change the code in git to be "Unknown distribution", and then at release time change it to "Open MPI Community"
  - ACTION: Jeff Squyres
Do we really need --enable-mpi-thread-multiple any more?
- only 40 nano-seconds ping-pong 0 byte shared memory.
- not many thread mulitple tests - handful of old sun thread mulitple tests.
- MUCH discussion about.
- ISSUE - OPAL layer might want to take advantage of threading, but can't if MPI layer doesn't support.
  - Can it be seperated? OPAL used libevent, because couldn't use OPAL threads.
- ACTION: Nathan will create Pull Request so we can investigate - this will make this config option useless.
- How many checks on lock? Request, Comm, free list, ____, probably a dozen times?
Git etiquette: GIT-ism Please have first line of GIT commit messages be: <area>: <= 50 characters of commit.
ACTION: Howard - Git have a template of "DON't submit issue here, instead use ompi/issues" in the ompi-release repo.
Discuss ways to help with creating pull requests for cherry picking master changes to other release branches?
- Either use labels like Howard does with fork of libfabric for cray, or have it auto create an issue.
SLURM binding issue - srun myapp - if it sees there's no constraint on how it can be run, then if np <= 2 bind to core, np > 2 bind to socket.
- If this app is a hybrid Open MP + MPI, this can mess them up.
- If direct launching, SHOULD Open MPI be doing ANY binding?
Ralph - Suggests splitting up OPAL and ORTE and OMPI configure.ac stuff.
- whenever someone touches these, Still some places refering to ompi in the opal layer.
- Do we want to split these up, so that this is cleaner? Or continue where it's all mushed?
- Generally sounds good. Resource issue, but we could do this gradually.
Discussion about configure - users not configuring what they think they have.
- ompi_info - print out what was configured. ompi_info (-a doesn't print blurb).
- ACTION: fix ompi_info -a to print out components.
- TWO OPTIONS:
  - Summary blurb to end of configure and ompi_info to print frameworks: and list of components.
  - ACTION: Nathan will create a summary list of
  - SLURM -YES (maybe param)
  - TCP BTL - YES ...
- ACTION: Geoff - come up with a script to grep through m4 files, and generate a platform file of all of the options possible, hopefully by ordering them and a bit of comments.
  - Version this platform file to work for a particular release.
- Linux kernal does this the right way: yes, no, auto, no-auto.
  - Like a Platform File. Have a tool that exports a complete Platform File for ALL options.
  - Can't use Linux system because it's all GPL, but perhaps other systems out there.
- configure --help is build from M4 system.
--host and --hostfile behavior (Ralph+Jeff)
- PR 1353 seems to have gone too far.
- Can we avoid a flip-flop between 1.10.x and 2.x?
- NEW BEHAVIOR: --host when running under a resource manager and user didn't say -np X
  - No longer auto-compute number of procs, it's an ERROR.
  - add an option "filme up" that says to run as many procs as to fill up all slots on each node.
  - one way or another, you must specify number of processes.
  - If you don't specify the number of processes using any of these flags, then number of processes per node will be error.
  - -novm - implys Homogoneous cluster, and so mpirun can infer based on it's node.
  - SLURM has a regular expression to communicate whats on each node, so can compute.
  - otherwise, need to start up daemons to each detect and communicate back just to compute.
- Some use cases (Fault Tollerant MPI) where users want to specify MORE hosts, but run on less hosts to start.
- Check to see if command line request REQUIRES deeper node topogy info, then ask PMIx for the info, if PMIx can't give us info, then we'll launch DVM daemons to query that info.
- Jeff documented -host / --hostfile behavior in PR 1353.
- OKAY with IBM just adding Platform MPI --hostlist syntax to specify both number of procs and hostfile. And document.
  - ACTION: Geoff will make this happen and file PR.
1.10.3 -host -np 2 will error saying OVERSUBSCRIBED.
- Went in, reviewed, went through the cycle.
- We hate this, but no one is going to fix.
SLURM direct launch auto-binding.
- Resolved to leave srun myapp binding.
- Resolved to fix srun --cpu_bind=none (can detect, and not bind)
- Already works when users specify srun binding, or mpi bindings.
MPI / OpenMP hybrid, especially Nested OMP (#parallel do loops).
- Howard presented some info about KNL 4 Hypertread per core with MANY cores.
- Easy way for app writer to specify placement in Open MP: OMP_PLACES = {0,1,2,3} <- spec is very wishy washy.
PR 1308 - MPI_Comm_info_set/get - This is a mess, and if we implement according to the standard, the GET is fairly useless. George and Dave will write up the issues, and what Open MPI suggestes that the MPI Forum discuss and give clarity to. THEN "we" will implement THAT.
Features for 2.1.0 -
- OpenMP/Open MPI interop -
- Discussed --entry feature from Platform-MPI for loading and running mulitple PMPI_ profiling libraries.
  - Jeff thinks Giles changes Fortran to call PMPI_ for Open MPI 2.0 - because there are a small number of places that you have to know if you were called from C or Fortran.
  - Open MPI uses -runpath (LD_LIBRARY_PATH overwrites -rpath in some/most cases).
  - ACTION - Mark will create a RFC with his write up and some example code to discuss further.
- Discussed --aff from Platform-MPI. Mostly just syntactic sugar on top of existing Open MPI commands.
  - Displayed some -aff and -prot output.
- Discussed --prot. Platform-MPI prints in MPI_Init, but we don't know much until after the connections are demanded.
  - Had some old discussion about 3 levels of printing: modex(?), what we WANT to use, establish connections & print.
  - Nathan - could do some setting up of BTLs to ask "Is this peer reachable", with some new BTL functionality.
    - BTL send array in endpoint. OB1 rotates messages over endpoints.
  - Specifying the network interface using. Multi-rail would be REALLY useful to see which and how many are in use.
  - Nathan suggested adding this as downcall into PMLs, and have them report back up.
  - Jeff - Would like to see number of rails, and number of ports, and list of network interfaces, maybe addressing.
  - Suggestion: TCP2 - 2 TCP rails.
  - Suggestion: has to be optional, because launching / teardown. George points out much of info in Modex.
  - Suggestion: gather some info about what was actually used in Finalize.
  - Suggestion: Name, compare all names for BTLs - if there is an EXCEPTION print that out LOUDLY.
  - could print just * for each "node" in NxN graph.
Mellanox/LANL - customer don't want different multiple Open MPI installations in their environment.
- Intent - Vendor would supply platform.path, and the customer would put it into a special directory. Got the mechanism, but didn't take the specific patches.
- --with-platform=FILE - Mellanox can publish these patches, and customers can build with it.
- Something shows up in ompi_info when they do --with-platform=FILE.

--- Wednesday ----

What do we want / need in v2.1.0
- Taking minutes here Meeting-2016-02-Minutes-v2.1-Talk
PMI-x
- Two barriers to eliminate. One at beggining of Finalize - can't get rid of now.
- One at end of MPI_Init.
- BUT (Only for fabrics that can pre-compute endpoints).
- Optional flag to say if want Barrier or not. Today PML selection is local decision.
PMI-x Fault Response APIs
- MPI has some error registration capability.
- in PMI-x, offered the capability for the application to describe what the response should be.
- Reg Err (callback fnc) - MPI has a way to register callback.
- PMI-x adding Error response.
- One option would be if the app gets the callback, then the app can call one of PMIx Error handling functions.
- Or can come up with some MPI wrappers to eventually push into standard.
- Question: Does it have to be an MPI function? - No, but for Open MPI it's coming to the application through MPI API.
Status update on new MTT database / Schema
UCX - Collaborative
- Would be nice to have UCX source be included in OMPI sources.
- Precident for this: libfabric, PMI-x, hwloc, others.
  - Do it as a framework.
  - Must support internal / external. Have to do some configury work.
  - Same License.
Question - for subprojects, at a given level of OMPI say 2.1, can the subcomponent rev with new features?
- There is precident: brought in a new version of hwloc for some bug fixes in 1.10.x, but that brought in some more features.
Names of high speed networks are very Open MPI specific: vader, OB1, CM, etc.
- In addition, there are mulitple paths to a particular type of network through Open MPI.
- Have talked about --[enable|disable] NETWORK_TYPE:QUALIFIER.
- Now Tools want this same information.
- What Platform-MPI does for Network types: MPI_IC_ORDER xxx,yyy,zzz,tcp (TCP is always last since it's slowest).
  - On command line can take lowercase (hint) or Uppercase (Demand) - -xxx, -XXX.
  - When doing "hybrid" approaches, like ROCE or IPoIB, command line looks like protocol, and some additional parameters to supply additional information when needed.
  - Seperate -intra=nic option to specify don't use shared memory for ranks on same node.
- Open MPI is similar, but inclusive rather than exclusive.
- https://www.open-mpi.org/community/lists/devel/2015/10/18154.php
- Have the delema of choice.
- Do Providers have preferences of paths? Ex: MxM BTL, MxM MTL, UCX, openib
  - Not a convient way to disable multiple paths to same thing.
- Probably going to get into multiple conflicting use-cases.
- Each component COULD register multiple short names. and disabling one short name would disable everything that registered that short name.
- If you ask to include something that's not possible, it's an abort error.
- If you exclude something that isn't possible continue running.
- If user specifies "MxM" autodetection will determine if it uses PML or MTL based on priorities.
- IDEA: Why not just assume shared memory and Self?
- Direct Launch? Yes this will apply to both.
- --net
- Components can register multiple names.
- ^ for disable?? can disable a "group" of protocols.
- tricky one is libfabrics that can do
- QUALIFIER - could specify path through Open MPI.
- Conflicts should abort.
- How do we override if a system admin put a variable in the default params file?
  - No way to zero out an mca paratemer set in a file.
  - Maybe we need to address THAT!
  - perhaps an 'unset' option.
- Can add an explicit call for MCA base framework - components decide what grouping they are in.
- Right now for an mca parameter, can't negate specific items in the list.. .only the entire list.
- MCA base can resolve the multiple framework.
  - When components register themselves.
  - Call these "option-sets".
- Do we need to specify an option that seperates runtime (ORTE) from MPI layer
- Would be nice if this was syntactic sugar on top of what we have today, with also new names.
- UUD component in OOB will register in Infiniband. User turns this off since they're thinking MPI.
  - Should this apply ONLY to MPI layer? What about OPAL layer?
  - RTE might need it.
  - As we move to point where runtimes / startup is using the fabric, we might want it to apply to both?
  - If you cut TCP - makes sense for MPI, but need it for IO, and other RTE stuff.
- TCP o IB is complex too - looks like TCP, but runs over Infiniband.
- lets not let this paralize us. Still some stuff we can do.
- First step, lets take the first step. Don't get screwed up with TCPoIB.
- Just apply to MPI, 1sided, pt2pt and coll traffic today!
- In register, register short names, btls, mtls, OSC, COLLS, PMLs, MCA base will filter, R2 will have to filter, OSC will filter. Give it a component structure, and filter type, and will return a boolean if you can use it.
- Any reason to specify self or shared memory? (cross memory attach, etc).
- Okay with always imply self. but why is shared memory different?
- How can the users figure out what the options are? If we use registration system, can use ompi_info.
- Sticky because head node doesn't neccisarily.
- libfabric - fi_info shows options - lists everything that's built in, UCX? We dont want to initialize library.

--- Wednesday after Lunch --

Takahiro Kawashari - Fujitsu presents. Fujitsu Roadmap - K Computer FX10 and FX100 in operation now.

Flagship 2020 project is post K computer.
2016-2017 - Fujitsu MPI is OMPI 1.8.4 based. (Skipping 1.10, but backporting many bugfixes)
2018 sometime Fujitsu MPI will update to OMPI 2.0.x
late 2019 Fujitsu MPI will move to OMPI 3.0.x - Post K computer (mid 2020). Post K computer - true use of several million processes. targeting < 1GB memory usage per process.
Open MPI 2.0 add-procs will be a big win here.
SPARC 64 IXFX - TWO CCNUMA NODES.
TOFU2 - 6D TORUS/MESH; 4 RDMA engines per node (put, get, atomic); Global barrier with reduction operation.
tofu BTL, tofu LLP
tofu specific collectives mtofu COLL.
tofu OSC. Progress thread on assistant core (OPAL runtime) - developed for tofu purpose (not same as Open MPI progt).
Tofu - Processess management (no orted)
FEFS (Fujitsu's filesystem) ad_fefs in ROMIO.
statistical information for tuning.... and others. Challenges:
Reducing memory consumption for exa-scale (10M)
Reducing MPI_Init/Finalize time for exa-scale
Collective communication algorithms tunes for the new interconnect.
Supporting the next MPI standard such as MPI-4.0?
many-core processing, power saving, etc. Collaboration:
Submitted PRs and patches for bug fixes.
Plan to merge some enhancements.
Develop new features and performance improvements
run MTT on our machines
Provide test programs.
Some source code contribution plan target for OMPI 2.1 or 3.0?
- Statistical tuning.
Plan to reimplement statistical information based on MPI_T performance variable and contribute back.
Feedback PAPI has a system to expose software counters. PAPI will go through MPI_T.
Fujitsu MPI has timeout-based deadlock detection feature. (~ 1 minute)
- --mca mpi_deadlock_timeout
- not intellegent, but sufficient for most deadlock bugs. timing Requests call WAIT on.
MPI Function hook (1/3) like PMPI but at configure time instead of link or runtime.
- Procedures are implemented as components of MFH framework.
- can wrap ALL functions at MPI layer via CPP magic by defining
- OMPI_MFH_FOO_DEFINE() in ompi/mca/mfh/foo/mfh_foo_call.h
- which functions to hook are configured by #if.
- can hook C or Fortran mpif.h or both, also Doesn't support F08 yet.
- Originally designed for exclusion control between an application thread and a progress thread.
- Fujitsu MPI is NOT thread safe, so needs exclusion control.
Shows the code for cpp magic. Massive file that includes lots of info (args, name, etc) for each MPI routine.
Other use cases. Statistical info; Handle errors; Log function calls for debug; check of MPI_THREAD_SERIALIZED.
Can install multiple hooks, defined at configure time.
Memory consumption MPI_Init / Finalize time. Big challenges for exa-scale.
Fujitsu will evaluate add_procs - improvement soon. it's great!
MPI-4 would like to see candiate features:
- ULFM and endpoints - in Open MPI 2.1 or 3.0 and post K computer 2020.
- ULFM: fault-tolerance.org - George is hoping pull it back in after PMI-x integration.
- endpoints: thought it'd be accepted into MPI standard by now.
  - We don't see endpoints in MPI Forum next week, Open MPI is watching, seems plausable, but hasn't tried yet.
PMI-x - considering using PMIx API instead of Fujitsu proprietary API.
- may collaborate to design exa-scale and fault tolerant capable API.
Fujitsu interestes: Memory footprint; launch scaling; ULFM; power control/management; common predefined info keys. Network topology information, etc.
Display graph of parallelizing MPI_Pack, can get good parallelization IF large blocks and many blocks, and extra cores to do the work.
Fujitsu MPI - OMPIO or ROMIO?
- FEFS is based on Lustre - which to use?
- OMPIO datatype engine is much better than ROMIO. Today Edgar says ROMIO.
Non-blocking collectives - libNBC? want to investigate.
- No - just have it for conformance.
Planning to move to a new TOFU interface, and if/when they do that, then they can contribute BTL to community.
Fujitsu is not currently using hwloc, though hwloc WORKs on K computer. *Run MTT on SPARC64 cluster - maybe weekly or nightly.
still need negotiate with owner. Cannot guarantee.
SPARC64 based CPUs - 32core (FX100) or 16 core (FX10) per node.
~ 100 nodes
Would Fujitsu be allowed to publish test results? - Yes they want to publish.
How can Fujitsu publish MTT if not directly connected to Internet? - Some Relay mechanism needed.
Currently only support MPI_THREAD_SERIALIZED with progress thread. Not MPI_THREAD_MULTIPLE.
Edgar reported some OMPI vs ROMIO performance numbers in various cases.
- Bottom line is OMPI performance is quite on par, or better in many cases due to more efficent datatype representation and progression engine hooks (only for BTLs???)
Framework rename from mca_framework_component, name project_framework_component in package libdir. Shooting for Open MPI 2.1
- A couple of ISVs that do this.
Multiple Libraries issue:
- App links against 3 libs. orte, orcm, mpi.
- ACTION: agreed to slurp 3 libraries back to 1 library, and hopefully only expose MPI symbols in libmpi.
Static Framework for licensing.
- name it more generically: ompi/mca_hooks - mpi init_fini hooks.
- opal_init - borth orted, and ranks - no communicators available. - Can use PMI-x to do communication.
  - opal_init_util - don't have to get key here.
- Framework should live in opal, but can open the framework wherever.
- orte_init - has communication. available.
- use string instead of enum to plugin to framework to get hooks multiple.
  - many of many - allow multiple callbacks.
- opal_showhelp something nice and abort - If want abort and don't want mpirun to show another error.
- Don't worry about threading.

--- Thursday ---

Single Outstanding Request performance enhancement.
- Nathan will look at C11 thread stuff for ‘super fast path’ - should detect when debugger - GCC5. icc16 probably.
- Mark and nsyal look at super fast path as another set of eyes.
- george will look at converter initialization for contiguous.
  - possibly only call sendi on contiguous data if contiguous converter
Pro-tip: if you do git clone, configure will see .git dir and it turns on enable debug, enable memdebug.
- If you clone, Need to configure it properly.
ACTION: Jeff pretty sure it turns on 3 things. but just in one place detect .git. Update hacking and readme.
- Geoff - will do this - HACKING file intended for HACKING developers.
  - README - just check if we mention it here, and edit if it does.
  - configure —help does not mention that this happens.
  - code in: opal_configure_options.m4 - in config. if test .git - opal_devel =1.
MPOOL discussion:
MPOOL users: grdma, gpusm,
- When you create MPOOL, depspite the name, they are not BASE resources, they create resources in the component.
  - Abstraction VIOLATION!
- Memory Pool: alloc, realloc, free,
- registration cache: register, deregister, find, relelase_addre_range.
RCACHDE - vmc - two users: mpoolgdrdma and btl/vader.
- tree_add, invalidate... literally just exposes a VM tree.
Nathan's proposal
1. break the abstraction violation
2. break apart the mpool and registration.
- Now a few allocators: huge_page, default, gpusm, rgpum
- rcaches: gdrdma, (udreg == Cray/MPICH). - base handles mca memory system framework.
- freelist now take an additional argument, can pass in allocator, and can pass in a pointer to a registrator.
- Now a memkind component - now can tell openib to use memkind and
  - USER can expose this to usernow pick what they want.
  - Since we dont have opal_info_t key=value, mpool lookup takes a string "huge_page" - would get 2M pools.
    - Because info isn't there, it takes a comma seperated value "huge_page,page_size=8M".
    - Inside the code would set this hint, but components Could expose controls.
  - May have a couple different allocators. Cray supports multiple huge pages.
  - Consistant string from what we could get from a user or BTL. Want to keep it a string.
  - Pull request for this, on nathan's branch PR open - openib can take a hint for mempool.
- Does NOT work for SHared memory? No - different allocator,
- GPU buffer registration DOES go through this new RCACHE.
  - Untested until nVidia gets
- original mpool was old code, and Nathan cleaned up.
- can be put into Open MPI 2.1, since this is all internal, but will require compoents to rebuild.
- Discussion to Pull to Master sooner than later to get some testing.
- Where is MEMKIND these days? - still under active development.
Status of Comm_info_ code.
- Slit it into opal_info_t and ompi_info_t (wraps the opal_info_t).
Intel's desire to use some of the frameworks directly rather than calling
- There is begining to be more projects who want to use OMPI, ORTE, or OPAL frameworks.
- They don't want everything, but you have to pull in the whole thing in, because there is init functions for each layer, that has a magical sequence of opening.
  - Not described anywhere, hand coded in init functions.
  - others who just want something, have to manually figure this out.
  - Wouldn't it be great if the frameworks KNEW what their dependancies were?
- Would simply opening just a few frameworks.
- Nathan wanted to do this when frameworks were first introduced, but was shot down.
- Someone will have to encode and check if it's already been opened.
- Frameworks dont really depend on frameworks, but components depend on frameworks.
- Could reduce future bugs potentially.
- It's a lot of book keeping. Moving the book keeping from orte_ess to every component.
- Directly or indirectly? consensus seems Directly.
- Alternatively do framework open in each component. Each component knows what it depends on due to header files.
  - At THAT time it does the open.
- Can you do this with CPP magic based on what headers are #included?
  - No, need to look at all .c in component
- Common messes up this up because different shared codes depends on different components.
  - May require commons to explicitly open frameworks each stuff it needs.
- MCA_Depend_On(...) - I can do this because I can open these, or they're already open, or detect cyclical
  - Have to do this in every component.
  - But just need to explicitly declare what YOU need.
- Consensus we all like this, but no one wants to do it NOW.
ACTION: Geoff will create an 'nm' test against the libMPI.so to detect bad exported symbols.
Still have issues with no test to catch component calling another component directly.
Edger has an IO issue that OMPIO specific symbols used by fcoll components and fbtl components.
- Has a structure that has a structure of function pointers so not directly linking, but calling through function ptrs.
- Either replicate the code.
- Or add it to IO base, or add it to structure.
- Or add an OMPIO section to common framework.
- This never an issue because FCOLL and FBTL is only loaded when OMPIO is used.
- Resolution - Create an OMPI specific subcomponent in common, and link it directly to all of the components of fcoll
  - Then fcoll / fbtl components - can directly call - commmon/ompio stuff, dont need to do function pointers.
- This okay to go into 2.0.1 or 2.1 since won't affect users.
make check discussion continuation from last night.
- easiest to have make check do a make install to somewhere under build, and then run check tests against that.
- Jeff and George will talk. Last night jeff opened issue about singletons.
George has a change that improves performance of multi-thread!
- george will send nathan a patch
- and george will file a PR - and we'll see if can get into 2.0.0
Status of MPI_THREAD_MULTIPLE?
- Mellanox is running threaded test but it's running bind to core. Can catch some issues...
- Should also test bound to socket, so more chance of threads stomping on each other.
- IBM has some more actual THREAD_MULTIPLE tests. Trying to see if can get in.
- Cisco compiles with enable thread mulitple, and runs 2 ways - run MPI_INIT, and then also a magical env you can set MPI_THREAD_MULTIPLE - so runs single threaded tests with MPI_THREAD_MULTIPLE.
- Sun tests pass now, Cisco will turn on sun tests for community testing.
  - May be a test in vader - George is seeing vader hang. Pass all tests, but has a sync collective app. Passes over SM, and OpenIB, but hangs in vader.
  - Nathan would like to fix for 2.0.0 - usually a deadlock / request never completes.
- What do we say in README - THREAD_MULTIPLE included, but only
  - Not only TCP, SM, SELF - are considered Thread multiple.
  - There is a deadlock in TCP BTL - George knows what this is.
- Nathan hit OSC pt2pt hard with multiple threads.
  - Will commit new 1sided threaded tests.
- Is MT better than 1.10?
  - Yes - better, and with George's performance much better
openib has a progress thread - but it gets bound to same core.
- It would be nice to unbind it to either the socket or the NUMA.
- Nathan does see UDCM retransmits just because the thread doesn't wake up enough.
- Nathan will do test to go into opal_prog_thread and unbind prog threads to NUMA and test.
OMPI-tests Licensing might be a problem - Private Redistribution rights - all tests in here from Internet.
- OMPI is not redistributing this. Just using.
- Downloaded - Stuff from internet, or wrote ourselves.
- How does MPICH do this? - The MPICH developers just download stuff directly.
Async progress?
- Howard added it to UGenie. Don't care anymore since doing it all in OFIE MTL - libfabric.
  - You will need to set OPAL_USING_THREADS.
- For networks that don't have MATCHING, no point of having a matching layer in OMPI.
- George - if turn on threading then every btl or mtl is free to do what they want.
- Have progress thread out, and reistration so others can register hooks into progress thread.
Nonblocking collectives?
- Only if lower levels make callbacks into OMPI.
- ACTION: Components ALL have to make the upcall back to OMPI to notify OMPI layer when
  - Standard tags upcall.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meeting 2016 02 Minutes

Clone this wiki locally