Meeting Minutes 2016 08

Attendees

All 3 days:

Jeff Squyres, Cisco
Howard Pritchard, LANL
Geoffrey Paulsen, IBM
Ralph Castain, Intel
Josh Hursey, IBM
Takahiro Kawashima, Fujitsu
Shinji Sumimoto, Fujitsu
Brian Barrett, Amazon Web Services
Nathan Hjelm, LANL
Mark Allen, IBM
Edgar Gabriel, UHouston

Wednesday and Thursday Only:

George Bosilca, UTK
Sameh Sharkawi, IBM

Wednesday Only:

Josh Ladd, Mellanox

Tuesday 9am

Annual git committer audit
- Google spreadsheet
  - Asked everyone to update Wed afternoon.
- Also update your Schedule A - "people who are approved by your organization to commit". We can provide the current schedule A if needed.
  - Discussed Wednesday Afternoon.
  - In shared google doc, Jeff, Howard, and Ralph all have access.
Discussion of Open MPI 2.0.1
- Open Items -
  - PR 1318 - if ssh agent hasn't been given, then check for qrsh and friends.
    - OB1 striping across BTLs.
    - Omnipath - want to stripe across all BTLs, faster than vader.
  - PR 1321 - v2.x osc/pt2pt fixes
    - Was posting request from callback, which was going heavily recursive under load.
  - Issue 1966 - OSHMEM fails without pml/yalla.
    - Need to have members add OSHMEM testing to MTT. Ralph has it in his .ini, but he turns it off.
    - looks like dynamic add procs broken in yalla?
  - Issue 1875 - Pepsc testing MPI_REQUEST_FREE issue.
  - PR 1306 - improve Linux CMA detection
    - Why was this pushed to 2.1.0? Adding detection. Agreed to have this in 2.0, but didn't get PRed.
    - Change in behavior. If glibc doesn't expose, we have to add syscall #s.
- These fixes if they go in this week, on track for shipping next Tuesday.
AFTER 2.0.1 - Merging ompi and ompi-release into single
- What to do with 2.1.0 PRs? - They are mostly stale.
- currently 23 open PRs on ompi-release. Would replicate these in ompi repo. Leave ompi-release repo around for referencing.
Branching for 2.0.1, and 2.1.0 -
- Right now we just v2.x branch.
- Talking about 2.0.1. Also there almost certainly will be a 2.0.2 on v2.0.x branch.
- Will create a v2.0.x branch (from 2.0.1 tag)
What to do about Master and v2.x getting out of sync?
- Two issues - What do we call fork off of master? 2.2, or 3.0?
  - To answer this we need to UNDERSTAND differences on master.
- If Master is meeting our compatibility requirement, we can do merge.
- Will want PMIx 2.0 in 2.1.0 - for PMIx shared memory.
  - PMIX - just decided to put JUST shared memory in PMIX 2.0.0, put the rest in future PMIX versions.
  - PMIX 3.0 will need to be off OMPI master.
- No on really wants to do merge... So probably
- Another idea - do 'faster releases'
- TIMELINE
  1. 2.0.1
  2. merge repos
  3. branch for 2.0.2
  4. 2.1.0
  5. fork or merge master for v2.2.x or v3.0.0? - Need to UNDERSTAND diff of master.
    - Are there things in master needed for Coral?
    - Debugger held a meeting to go with PMIX for nex-gen debuggers.
- Timing - if we had predictable 6 month branches from master, could keep master from diverging too much.
  - More feature branches so other community members could get access to this.
- Would we as a community like to get to place where can branch from master any day?
  - Problems with this is that jenkins testing is that signal to noise ratio is difficult.
    - Anyone can add testing. Much of this is coming out of "allowing anyone to test for us".
  - MTT report - a lot of failures is cluster issues, not source issues.
  - Carrot and stick - Venders have interest in certain things working, but no pain if their cluster setup is failing.
  - Idea - Server side turn off vendor's failing jenkins due to cluster issues?
  - The further that master diverges from released branches, causes a lot of extra stress
    - results of testing is unclear.
    - when you submit a change, there is no feedback signal that you caused someone problems.
    - We don't test command line options. Many bugs are combo of command line flags are broken.
    - automate portable tests is difficult.
- Setting aside the logistics, do we want the ability to release from master at any time?
  - A lot less churn if we say "A commit to master is expected NOT to cause a break".
  - breaking isn't a bad, but breaking without a plan is bad - Brian
- Josh is willing to lead the discussion about future testing effort.
  - good to talk about what we want, and how do we get there.
  - everyone wants the large matrix tested, but each member doesn't have to cover it all.
- Can Jenkins trigger post commit instead of after commit?
  - Yes but then it causes others to have to go figure out who caused the issue, and ping them.
Migration to new cloud services update for website, database, etc.
- DONE:
  - DNS:
    - All 6 domains transferred to Jeff's GoDaddy account
  - Web site:
    - Migrate www.open-mpi.org to HostGator
    - Install initial LetsEncrypt SSL certificates on www.open-mpi.oreg ($10 each, and last 3 months)
    - Submit CSR to U Michigan for 3-year SSL certificates on www.open-mpi.org (thank you, U. Michigan!)
    - rsync web mirroring method shut down. Do we need website mirroring anymore?
      - May want mirroring of tarballs. Amazon might be able to help, move them to S3.
      - Stutguard still runs their own mirrors, and they bought the .de domains, etc.
  - Mailing lists:
    - Migrate mailman lists to New Mexico Consorcium (NMC) Also will put on SSL.
    - Freeze old mailing list archives, add to ompi-www git
    - Add old mailing list archives to mail-archive.com
    - Setup new mails to archive to mail-archive.com
  - Email
    - Setup 2 email legacy addresses: rhc@ and jjhursey@
  - Infrastructure
    - Nightly snapshot tarballs being created on RHC's machine and SCPed to www.open-mpi.org
  - Github push notification emails (i.e., "gitdub")
    - Converted Ruby gitdub to PHP
    - Works for all repos... except ompi-www (due to memory constraints)
  - Contribution agreements
    - Stored in Google Drive under [email protected] (and shared to a few others)
- Still to-do:
  - Spend time migrating Jenkins: IU -> Ralph's server. Jenkin's is a java process, can run at hostgator.
  - Spend time migrating MTT: IU -> Ralph's server.
    - Amazon could help run a server to run this on here.
    - Amazon has 1 year of funding, for support of a 2x2 'small'.
  - What do we want to do after Ralph's server for Jenkins and MTT?
    - MTT: new server / cherrypy - doesn't work on hostgator (no longer than 90 sec processes)
    - Jenkins: Java
    - Amazon's T2 (never sleep the VM, don't oversubscribe many resources, just network and cpu).
    - needs a [email protected] - not archived for username / password.
  - Revamp / consolidate: ompi master:contrib/ -- there's currently 3 subdirs that should really be disambiguated and overlap removed. Perhaps name subdirs by the DNS name where they reside / operate?
    - infrastructure
    - build server
    - nightly
  - Spend time documenting where everything is / how it is setup - wiki page or google doc, etc.
  - Fix OMPI timeline page: https://www.open-mpi.org/software/ompi/versions/timeline.php
    - Gilles submitted a PR: https://github.com/open-mpi/ompi-www/pull/14
  - Possible umbrella non-profit organization
    - RHC is looking into how to do it.
    - Not too hard to do under US Law. Problem is that it has to file quarterly, not hard, but criminal if fails.
      - Looked at framework organizations that serve open source projects.
      - SPI (http://spi-inc.org/projects/) contacted RHC back.
      - They do this at no cost, or meeting costs.
      - Ask if they'll consider, they have a meeting, and decide if we fit their interests.
      - then they invite, and we have to accept.
      - We would name a lead rep to their non-profit company, attend their board, but not a member.
      - Certain things come up that they need help with.
        
        Could transfer domains to them, they they "own" them.
        
        Soft assets like that (domain registrations) they will take, but they prefer NO hard assets.
        
        Collect money for us, credit card fees are taking out, plus 5% overhead for them.
        
        They could own hostgator account or SPI account.
      - They don't want source code or licensing, etc.
      - Lets Decide in 1month tops. - Send email in devel-core
  - Update Open MPI contrib agreements
    - Possibly depending on whether we join an umbrella org
    - Probably should create a new contributions@ email address
    - Get rid of snail mail address.

Tuesday After Lunch

Travis - taking 6 hours to run for each PR. Is it worth it?
- We could check pricing to see if we could get better turn around, but we should take a hard look and see if it's valuable to keep doing. Perhaps it's not valuable. Travis is TOO expensive price wise.
- Not clear we're getting any value out of it. Many Devs not even waiting on it.
  - We get Apple coverage, but perhaps can get that elsewhere?
MCA support as a separate package?
- Now that we have multiple projects (PMIx) and others using MCA plugins, does it make sense to create a separate repo/package for MCA itself? Integrating MCA into these projects was modestly painful (e.g., identifying what other infrastructure - such as argv.h/c - needs to be included) - perhaps a more packaged solution will make it simpler.
- Need to "tag" the component libraries with their project name as library confusion is becoming more prevalent as OMPI begins to utilize MCA-based packages such as PMIx.
- What if there are different libraries? What if at different versions?
  - Need to GNu library version them.
- Interest in separating MCA out, as a separate library, since PMIx has renamed variables.
  - One thing that would be nice, is MCA naming scheme doesn't include project name.
  - PMIx is now using MCA, had to rename MCA symbols to pull it in. But still some work to complete fix of MCA project naming.
- If you have 2 frameworks that are the same name, and two components of the same name, and the same variable.
  - Right now, if we implement this, it would provide 2 versions of the same value PMIx and OPAL.
- MCA parameters, scoping issues, setting locally or globally.
  - IBM wrote some code to propagate env vars. orthogonality to MCA_base_envlist and -x. There's no -gx.
    - This only works for ssh, not for SLURM, or LSF because those are system setting.
      - No way to propagate env vars to orteds, but after those are up, then mpirun can send these env vars to remote orteds (launch message, not PMIx).
    - if read env from file, it will read in MPI_Init.
    - srun --export
  - Would there be any value of having mpirun and orted's being PMIx clients (they are), and PMIX_Set and get the launch messages, and env.
Plans for folding ompi-release Github repo back into ompi Github repo
- https://github.com/open-mpi/ompi/issues/1512
- Should be 30 minutes of work.
(Done) Remove atomics from OBJ_RETAIN/OBJ_RELEASE in the THREAD_SINGLE case.
- See https://github.com/open-mpi/ompi/issues/1902.
  - The part that Nysal is still looking at is if obj retain/release on intrinsics is worth doing for thread single, since atomic is needed for thread multiple.
- In Jan last year, decided for 1.10, we didn't want this. but left it open, since we haddn't decided how to handle threads in 2.0. Nathan did work to remove for 1.10, and then disappeared for 2 months.
- So cherry picked from 1.10 to master, and then back to 2.x.
- Current situation is this is done in 1.10, master and 2.x
Continue --net mpirun CLI option discussion from Feb 2016 meeting
- Originally an IBM proposal.
- Tied to issues of "I just want to use network X" user intent, without needing to educate users on the complexities of PML, MTL, BTL, COLL, ...etc.
- We didn't come to any firm conclusions in Feb.
- Get rid of BTL/sm, but create an alias to vader.
- BTL self should not be possible to compile out, must be staticly linked in. MCA component required.
  - removes flexibility in btl code. PR on master. coll/self, coll/basic is necessary in most cases.
  - IBM using it.
- Refer to #20 in Feb minutes: https://github.com/open-mpi/ompi/wiki/Meeting-2016-02
  - MPI_T folks want to know what protocol each rank is using.
- Some consensus: Do this in iterations.
  - Do this in MCA Base Framework.
  - Each MCA component provides a char ** of "tags"
  - Each MCA component provides a char ** of qualifier tags also, but all NULL at first iteration.
  - Each above are const strings using C99 initializers
  - Example tags discussed are: infiniband, mxm, verbs, libfabric, shared-memory, xpmem, knem, cma, cico, mach, tcp, usnic, libfabric, ethernet, tofu, gni, portals, ugni, libfabric.
MPI_Reduce_local - any reason not in coll framework?
- No reason. IBM will PR that transition, and update the man pages also. Symbol versioning
Per request from Debian: https://github.com/open-mpi/ompi/pull/1955
1. Symbol visibility. Per his PR, it looks like we're leaking a lot of symbols that do not need to be public. .so version numberinf for MPI and OSHMEM libraries. We have made promises about this; we just need to honor those promises.
  - Yes we should clean up our visible symbols to be properly prefixed.
2. Symbol versioning can fix the visibility problem, but that's really a side effect. We should just fix the visibility issue with proper use of DECLSPEC.
  - Per the MPI spec, we don't need to version the MPI API calls (because the MPI Forum will not break APIs)
  - Will it help to version things like MPI_Comm? (e.g., if we grow the size of the communicator struct)
  - We decided NO for now, but will go back to fedora and other re-packagers and see if symbol versioning would HELP them alot, and reconsider if they say yes.
3. Combining libmpi liborte, libopal:
  - dynamic you just want to link against libmpi, and static you MUST have all libs on link line.
Slack for developer communication?
- Josh used, and kinda nice.
- George used for a while, and fell off.
- Community decided on No. just use devel email list.

Wednesday Afternoon ---

PMIx roadmap discussions
- Roadmap changed in last few weeks.
- 1.1.3 - 1/2016
- 1.1.4 - 6/2016
- 1.1.5 - 8/2016
- 2.0.0 - 10/2016 - Shared memory data-store enabled - footprint benefit, and speed in high PPN.
  - Normally when you put data into PMIx and then do a PMIx_Fence, data gets pushed out to all Daemons.
  - if want data from all data on other nodes, that's a message from each local process for every remote process, from each process on the node.
  - Instead, now shared memory will check the shared memory, instead
  - Portals, TCP with static, no modex.
  - Is the fence needed in Open MPI, now that we have dynamic add procs?
    - Today it's there by default, but MCA param to disable.
    - with sparce connectivity, this is a benefit, but with full connectivity, it's horrible.
  - If you set Direct Modex, we don't do a fence at all, just happens in the background.
  - More work to be done. Now we have the tools, but need to figure out how to use them.
- 3.0.0 - Tool support, debugger support, data logging, cross-version support.
- Launch scaling
  - Input requested, where are we now at scale?
  - Open MPI still needs work.
    - Pushing all info regardless - need to add logic to only push if needed.
    - Investigate need for RTE barriers - MPI_Init, MPI_Finalize.
    - Work with Networks to eliminate ModeX.
- Advanced tool support
  - Debugger connect and initial spawn
    - Support for co-located daemon spawn/wireup.
  - Query proctable and stat info
    - any supported info can be requested.
  - Added partners
    - MRNet - Totalview depends on.
  - MPI Debugger WG.
- Why not standardize the key? Like hostname. Maybe.
- Any info on how much info we exchange?
  - 2 process, just TCP - 45KB. wow!
- PMIx - adopted a standards like protocols, changes come through RFCs, etc.
- Is PMI-x an Open MPI project? - no.
  - they have their own repos, but don't have a website. Only Open MPI website for nightly tarballs.
  - Might start using "release tab" of PMIx github.* PMIx internal storage,
  - Or Github large filesystem.
- Data logging
  - Stderr/stdout/stddiag/syslog
  - Datastores
- Passes data to Release Manager (RM)
  - RM can log it.
  - RM can pass to indicated PMIx server
- PMIx internal storage - currently simple list, ramework to support dstore libs.
- Fault Tollerance
  - Notification
    - App can register for event notifications, incipient faults. RM-app negotiate to determine response
- What comes after December?
  - Flexible Allocation Support (PMIx_Allocate)
    - Request additional resources (Grow or Shrink)
      - Compute, memory, network, NVM, etc.
  - I/O suporrt
    - Anticipatory data fetch, staging
    - Advise time to complete
    - Notify upon available
  - Storage policy requests
    - Hot/warm/cold data movement
    - Desired locations, and striping/replication patterns.
    - Persistence of files, shared memory regions across jobs, sessions.
    - ACL to generated data across job, sessions. *Spawn support - PMIx_Spawn
    - Staging support
  - Network Integration
    - Query, what are my capabilities, and request changes to that, based on what Resource manager allows.
  - Provide backward compatibility with PowerAPI.
  - Just dropped in MCA subsystem, to allow folks to create frameworks, and components.
How to help alleviate "drowning in CI data" syndrome?
- One example: https://github.com/open-mpi/ompi/pull/1801
- One suggestion: should we actively market for testers in the community to help wrangle this stuff?
- If Jenkins detects an error, can we get Jenkins to retry the tests without the PR changes, and then compare the results to see if the PR itself is introducing a new error?
- How do we stabilize Jenkins to alleviate all these false positives?
IBM gave demo on Spectrum MPI v10.1
- http://www-03.ibm.com/systems/spectrum-computing/products/mpi/
Fujitsu gave status presentation.
- Memory consumption evaluation
  - Ralph: occurred to me that solving the memory problem can be done in a much less intrusive way then previously thought. PMIx only stores pmix_kval_t objects - the pmix_info_t structure is only used for interfacing to the outside. Thus, we don't have to impact any of the RM's or applications that have implemented their integration. This is a good thing as that would have been highly disruptive. Instead, we can just convert the pmix_kval_t object to use flexible string lengths.
- MTT status
- PMIx status

Friday 9am

Continue Fujitsu Presentation
- PMIx usage in Fujitsu MPI with Tofu RM.
- Discussion - which keys are mandatory, and which are optional?
  - PMIx meetings is the forum to discuss now.
  - pmix_common.h - many are optional, some are manditory
  - Pmix wiki page
  - Now that next gen debuggers will use PMIx for setting up trees, it will need to read network specific keys for specific environments.
    - Going to start up a working group to discuss keys.
- Discussion - naming conventions for RM/network specific keys?
  - PMIx ask that you don't start RM or network specific keys with PMIX_ - since everyone else is required to understand and possibly return error if they dont handle PMIx. Want "out of environment" PMIx applications to ignore keys they don't understand or need.
- MTT running status. Target is running MTT on SPARC and publishing results. Needs fujitsu job scheduler to run jobs.
1. run on x86 without fujitsu scheduler - complete.
2. run on x86 cluster with the fujitsu job scheduler - complete.
3. run on run on sparc cluster with the fujitsu job scheduler.
  - working now.
  - cross -compiling of both MPI library and MPI programs is a pain. Complete this week.
4. Upload the test results - next.
  - may need help with relay for indirect internet access.
- build and make check on all RC on SPARC - no issues in recent RCs.
- Support of ARM and SPARC - CPU of post-K computer will be based on ARM v8 (aarch64)
  - can work with ARM specific stuff with community, but don't have enough machines for MTT at this time.
Revive btl/openib memalign hooks?
- Unaligned memory is 30% worse (but glibc gets things reasonably right for Hosts. Exotic hardware has exotic memalignment requirements).
- Many offload RDMA performs best with 64byte aligned buffers.
- glibc returns 16 byte aligned memory.
- Users SHOULD align their own memory, but they almost never do.
- Alignment adds memory overhead.
- Platform-MPI and others only did this (returned aligned buffers) for buffers > some threshold (as eagar buffers were preallocated as aligned).
- Platform-MPI also did 'protocol' alignment for large pipelined buffers, so first chunk (2MB?) would be unaligned, but all future chunks WOULD be aligned.
  - Wouldn't be too much work to make the first unaligned chunk send via eager.
- Nathan will look to do receiver side memory alignment protocol (sending the first unaligned part via eager). Looking at 128 Byte (hardcode).
- Also would be nice having a malloc wrapper framework.
  - it'd be nice to have a generalized malloc hook framework not just in OB1.
  - glibc now has a nice way to intercept malloc - could make a framework around that.
Request completion and callback Multithreaded safety.
- Already in Master PR1325
- if we return the callback to PML, Then the next step is to set it to complete.
- Fix, if the callback to PML returns 1, then do NOT modify request.
  - If callback sets it as complete, then it's complete.
Thread-safety design
- Need some good multi-threaded performance tests (per Nathan and Artem discussion)
  - Do we need to write them ourselves?
- in ompi-test, some MT tests from SUN.
- Edgar started to develop MT file-IO tests.
- Mark Allen keeps hoping to get some committed back.
  - Mark will commit a couple of tests.
- George can push what they developed for testing.
  - can bash different or same tests.
- Review/define the path forward
  - Nysal - was working on some MPI_Request handling performance improvements.
  - Need OpenMP / MPI Hybrid applications.
    - Edgar said that he might know of some apps out there.
    - George said he's not sure.
      - Also can get his students to write some hyrbrid.
    - George is working on some research / papers.
  - How is endpoints going to affect us?
  - OpenMP Thread affinity - Need to play nicely
    - PMIx can be arbitrator for MPI and OpenMP binding, and can negotiate with RM.
  - non-OMP thread affinity
  - In general if we have different runtimes (like OMP), then there is collision.
    - Use PMIx for binding collision
  - Some ideas:
    - intercept pthread_create and assign binding then
    - OMPI_extention to do thread binding specified on command line.
    - Use PMIx to coirdinate will take some time to get bye in from OpenMP vendors, and in their runtimes.
  - 3 usecases:
    1. OpenMP Hybrid (all OpenMP on node) all MPI off node.
    2. pthreads threading + MPI communication.
    3. Users have their own threads, but they pause them all when they make calls into Math library (assuming math library uses threads). So they always oversubscribe
    4. A big mess.
  - Would be nice to begin working on problem.
  - Ralph begins working on PMIx to coordinate theading coordination between resource manager, and OpenMP runtime, and MPI runtime.
  - Open MPI could detect can throw errors if conflicts in binding.
  - binding is hard, because we just do binding for duration of process, but threads can come and go or be oversubscribed.
Discuss appropriate default settings for openib BTL
- Email thread on performance conflicts between RMA/openib and SM/Vader
- This is striping across all available BTLs. Some you want this, but Don't want to do this on Mellanox.
- This has been done for 2.0.1
What to do about MPI_Info PR from IBM / MPI Forum gyrations about MPI_Info?
- Still can't do anything because this is stuck at the forum
- code is flexible that we can change it later.
- some discussion if we should make an inconsistent
- Should we return "hints" that the user gave us, or all "hints" being used by the system?
- 2 parts of the PR - create an opal key/value, and then implement the MPI_ calls later after MPI_Forum has decided.
  - Decided to split the PR to just do internal plumbing NOW, and leave the MPI_ changes until later.
Performance Regression tracking
- What do we want to track, and how are we going to do that.
- https://github.com/open-mpi/ompi/issues/1831#issuecomment-229520276
- https://github.com/open-mpi/mtt/issues/445
- What to track?
  - Shared memory bandwidth and latency and message injection rate with both MPI_Init, and MPI_Thread_Init(MPI_THREAD_MULTIPLE)
    - These same tests with Network X
  - What to know which day a commit ruined someone else's performance
  - Only comparison we can do, is Cluster A to Cluster A, Can't compare cluster A to Cluster B.
  - If a way to inject configuration meta data strings.
    - Like "today I installed firmware drop X"
  - Josh proposes a tag that cluster owner submits, and he will compare across only those same tags, and MAYBE compare to last tag.
    - might be nice for a human to be able to say "Human looked at this, and it's Good|Bad".
  - Just stick with pt2pt for now!
  - each test outputs a line (message size on X access, and time on Y access)
  - Need to tag the benchmark as well. md5sum on benchmark.
Discussed Nathans new --net infrastructure piping PR
- mca_component_enable|disable
  - internal only.
- mca_filter_include - MCA gives a list of "tags", "infiniband"
- disable is easy, but disable is HARD because not all components provide a filter.
  - this is HIGH level, and nothing knows anything about the type.
  - Could fix this by having a filter TYPE.
  - Right now orteds don't use TCP BTLs, but this might break that.... need to do filtering carefully.
- Problem with this approach is that it's removed before the component has any ability to object.
  - right now disables it, but could set it asside.
- different clients want different components (for different reasons, MPI vs ORTEDs)
  - need to be careful that we don't box ourselves out of certain situations.
- Top down approach where user gives possible flags
- Some desire to not do too much later in component, but want to have intent of why we're opening.
- Might want some desire to close and reopen a component later, so need to take that into consideration.

Talk about OSHMEM
Ralph offers to give presentation on "Flash Provisioning of Clusters", if folks are interested

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meeting Minutes 2016 08

Attendees

Tuesday 9am

Tuesday After Lunch

Wednesday Afternoon ---

Friday 9am

Clone this wiki locally