Skip to content

Conversation

@wujingyue
Copy link
Collaborator

@wujingyue wujingyue commented Oct 2, 2025

For #5289

@github-actions
Copy link

github-actions bot commented Oct 2, 2025

Review updated until commit 89f3173

Description

  • Added stream-parallelized matmul support via loop sharding

  • Introduced shardByStream helper for tensor stream sharding

  • Implemented findStreamIterDomain to detect stream-parallelized domains

  • Refactored ForLoop creation with static createFromIterDomain


Changes walkthrough 📝

Relevant files
Enhancement
host_ir.cpp
Add shardByStream and update ForLoop creation                       

csrc/host_ir/host_ir.cpp

  • Added shardByStream function to create stream-sharded tensor views
  • Implemented conservative allocation domain setup using self-replay
  • Converted createForLoopFromIterDomain to static method of ForLoop
  • +17/-1   
    lowering.cpp
    Support stream-parallelized lowering with for-loops           

    csrc/host_ir/lowering.cpp

  • Added findStreamIterDomain to detect stream-parallelized loop domains
  • Enhanced lowerSegment to handle stream-parallelized expressions via
    for-loops
  • Implemented cloning and sharding of expressions in stream-parallel
    context
  • Added allocation and replacement logic for stream-sharded tensors
  • +115/-5 
    host_ir.h
    Declare shardByStream and ForLoop factory method                 

    csrc/host_ir/host_ir.h

  • Declared shardByStream helper function
  • Added static createFromIterDomain method to ForLoop
  • Maintained existing class structures and clone declarations
  • +6/-2     
    Tests
    test_stream.cpp
    Add stream-parallelized matmul test                                           

    tests/cpp/test_stream.cpp

  • Added StreamTest fixture with Host IR lowering enabled
  • Added Matmul test for stream-parallelized matmul correctness
  • Updated test structure to use fixture-based setup
  • +54/-3   
    Miscellaneous
    container.h
    Suggest performance improvement for insertion                       

    csrc/host_ir/container.h

  • Added comment suggesting linked list for faster insertion
  • Kept existing top-level expressions and executor structures
  • +3/-0     
    Dependencies
    multidevice.h
    Include vector header                                                                       

    csrc/multidevice/multidevice.h

  • Added vector include for future container use
  • No functional changes to multidevice declarations
  • +2/-0     
    transform_replay.h
    Update includes in transform_replay.h                                       

    csrc/transform_replay.h

  • Added unordered_map include
  • Removed unused algorithm, unordered_set, and vector includes
  • +2/-5     
    Bug fix
    fusion_executor_cache.h
    Fix warning typo in comment                                                           

    csrc/runtime/fusion_executor_cache.h

  • Fixed typo: "WARING" to "WARNING"
  • No other changes to fusion executor cache interface
  • +1/-1     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The function findStreamIterDomain(TensorView*) checks only the front of the loop domain for a stream IterDomain, based on the assumption that FinalizeMultideviceDomains places it there. However, this assumption may not always hold, leading to missed stream IterDomains if their position is not guaranteed. The code should validate this positioning assumption or search the entire loop domain.

    IterDomain* findStreamIterDomain(TensorView* tv) {
      const std::vector<IterDomain*>& loop = tv->getLoopDomain();
      // FinalizeMultideviceDomains pass puts the stream IterDomain to the
      // front.
      if (!loop.empty() && loop.front()->isStream()) {
        return loop.front();
      }
      return nullptr;
    }
    Performance

    The comment suggests that setting the allocation domain via setAllocationDomain(out->getLoopDomain(), false) in shardByStream is conservative and suboptimal. This may lead to inefficient memory layouts. Consider implementing the suggested algorithm from alias_analysis.cpp to make a more informed decision on contiguity.

    Maintainability

    There are two similar functions, findStreamIterDomain and getShardedIterDomain, searching for stream IterDomains in different domains (loop vs. allocation). This duplication increases maintenance cost. Consider unifying them into a single function with a domain type parameter as suggested in the comment.

    // Finds the stream-parallelized IterDomain in the loop domain of a TensorView,
    // or nullptr if not found.  This is different from `getShardedIterDomain(tv,
    // ParallelType::Stream)`, which searches the allocation domain.  Consider
    // unifying them into one function with an extra DomainType parameter.
    IterDomain* findStreamIterDomain(TensorView* tv) {
      const std::vector<IterDomain*>& loop = tv->getLoopDomain();
      // FinalizeMultideviceDomains pass puts the stream IterDomain to the
      // front.
      if (!loop.empty() && loop.front()->isStream()) {
        return loop.front();
      }
      return nullptr;
    }
    
    // Finds the stream IterDomain in the outputs of a segment.
    IterDomain* findStreamIterDomain(const std::vector<Val*>& outs) {
      for (auto* out : ir_utils::filterByType<TensorView>(outs)) {
        if (auto* stream_id = findStreamIterDomain(out)) {
          return stream_id;
        }
      }
      return nullptr;
    }

    @wujingyue
    Copy link
    Collaborator Author

    !test


    NVFUSER_DECLARE_CLONE_AND_CREATE

    static ForLoop* createFromIterDomain(Val* index, IterDomain* iter_domain);
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I'm on the fence about this. The method is coupled with the ForLoop class so I moved here to save some typing. The downside is less access control because createFromIterDomain could access private fields/methods of ForLoop.

    @wujingyue
    Copy link
    Collaborator Author

    !test

    @wujingyue
    Copy link
    Collaborator Author

    !test

    @wujingyue wujingyue marked this pull request as ready for review October 8, 2025 01:02
    Copy link
    Member

    @nsarka nsarka left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM, I just had a minor question

    }

    std::vector<Val*> cloned_outs = ir_cloner.clone(group.outputs());
    // All expressions in the group are expected to be stream parallelized in
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Do we enforce this constraint? If so is there an assertion somewhere?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    We don't but we should. I'm waiting for a isResharding-like method to do that easily.

    Copy link
    Collaborator

    @Priya2698 Priya2698 left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Apologies for the delayed review, fell off my radar.
    I have left some initial questions.
    I am working on #5309, and hope to have a PR soon, which should unblock this PR.


    // Finds the stream IterDomain in the outputs of a segment.
    IterDomain* findStreamIterDomain(const std::vector<Val*>& outs) {
    for (auto* out : ir_utils::filterByType<TensorView>(outs)) {
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    So we are finding the stream ID in any of the outputs of a segment? Why not use the above variation directly with any of the segment outputs as they must have mapped stream IDs.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Because I'm not sure about CPU-scalar TensorViews from composite ops. But I should probably harden the check to enforce every TensorView to have a Stream IterDomain. Wdyt?

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    In their blackbox state, it does not look we can currently support SDPA ops, for example. So adding an assert makes sense to signal something is wrong. I guess this is something I need to fix in PropagateShardingsPass also.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    In their blackbox state, it does not look we can currently support SDPA ops, for example.

    Why not? At least, batch and/or head can be easily parallelized on stream without changing the implementation of the SDPA op, assuming ShardByStreams are added properly of course.

    auto* out = ops::newValLike(in, *in->getDataType())->as<TensorView>();

    TransformReplay::selfReplay(in->domain(), out->domain());
    // This is conservative and suboptimal. Consider reusing the algorithm in
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    This will resolved using #5316?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    No. It's one of the cases where out's contiguity ought to be different from in due to the slicing effect.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Oh okay, got it!
    So in such cases the replay may in fact overwrite a correct contiguity as most users of selfReplay create the new TensorDomain using ops API, which sets the contiguity correctly. This is something we should consider for #5316.

    @wujingyue
    Copy link
    Collaborator Author

    !test

    @wujingyue wujingyue merged commit 3260d70 into main Oct 19, 2025
    65 of 67 checks passed
    @wujingyue wujingyue deleted the wjy/matmul branch October 19, 2025 23:09
    tbqh pushed a commit that referenced this pull request Nov 12, 2025
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    4 participants