Skip to content

Conversation

@samnordmann
Copy link
Collaborator

Improve printing of HostIrContainer by printing the index computations which are not explicitly part of the topLevelExprs.
Example from #5259

%HostIrContainer { (T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) :
  T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false)
  T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false)
  GetCurrentStream into Stream 0
  FOR streamIdx in istreamIdx10{8}:
    SetCurrentStream to Stream ( streamIdx % numberOfStreams )
    Synchronize Stream 0
  FOR streamIdx in istreamIdx10{8}:
    SetCurrentStream to Stream ( streamIdx % numberOfStreams )
    T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i84 )
    IF Manual ( ( ( 8 + ( rank - streamIdx ) ) % 8 ) == rank ):
      T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
         = HirAliasSelect( T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = ideviceIdx.x0{8}, index = 0 )
      T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
         = Set( T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), cache_op=Streaming )
    ELSE:
      ShareMemHandles(P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA),
      P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA)
      P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA)
      Wait Communication 38
      Wait Communication 37
    T7_l___bfloat[iS17{128}, iS18{1024}, rS19{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = HirAliasSelect( T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx6{8}, index = i84 )
    T7_l___bfloat[iS17{128}, iS18{1024}, rS19{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = linear(T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}),
                T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})      ,
          T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})      )
    SetCurrentStream to Stream 0
    Synchronize Stream ( streamIdx % numberOfStreams )
} // %HostIrContainer

Index definitions:
  i111 = streamIdx % numberOfStreams;
  i90 = i88 % 8;
  i32 = i30 * 1024;
  i30 = 8 * 128;
  i86 = rank - streamIdx;
  i82 = rank + streamIdx;
  i74 = 8 * 128;
  i76 = i74 * 1024;
  i84 = i82 % 8;
  i88 = 8 + i86;

@github-actions
Copy link

github-actions bot commented Oct 6, 2025

Review updated until commit 8f4be43

Description

  • Print index definitions in HostIrContainer print output

  • Add conditional debug printing for index values

  • Improve debug visibility of scalar index computations

  • Enhance IR debugging with structured index info


Changes walkthrough 📝

Relevant files
Enhancement
printer.cpp
Print index definitions in HostIrContainer                             

csrc/ir/printer.cpp

  • Added printing of index definitions in HostIrContainer
  • Only prints when debug option 'indices' is enabled
  • Filters for scalar index-type values with definitions
  • Increases indentation for better output structure
  • +14/-0   

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review

    Debug Output Control

    The debug print logic for index definitions is gated by a debug dump argument, but there is no clear indication of how this affects existing logging behavior or whether it could produce excessive output in certain configurations.

    // Print the definitions of the indices that are used in the host_ir_container
    if (hasDebugDumpArgument(DebugDumpOption::HostIr, "indices")) {
      os() << "Index definitions:\n";
      indent_size_++;
      for (Val* val : host_ir_container->vals()) {
        if (val->isScalar() && val->definition() != nullptr &&
            val->dtype() == DataType::Index) {
          os() << val->definition()->toString(indent_size_);
        }
      }
      indent_size_--;
      os() << "\n";
    }
    Val Filtering Logic

    The filtering of Vals to print index definitions relies on type checks and definition presence, but does not verify if the values are actually used in the host IR container, potentially leading to irrelevant or redundant output.

    for (Val* val : host_ir_container->vals()) {
      if (val->isScalar() && val->definition() != nullptr &&
          val->dtype() == DataType::Index) {
        os() << val->definition()->toString(indent_size_);
      }
    }

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @wujingyue
    Copy link
    Collaborator

    which are not explicitly part of the topLevelExprs

    Can you remind me why they aren't part of topLevelExprs? Analogously, imagine you write a C++ loop

    for (int i = 0; i < 10; i++) {
      int j = i * 2;
      a[j] = ...
    }
    

    j is in the same loop scope as a[j] = ... even though it's a merely scalar operation.

    @samnordmann
    Copy link
    Collaborator Author

    which are not explicitly part of the topLevelExprs

    Can you remind me why they aren't part of topLevelExprs? Analogously, imagine you write a C++ loop

    for (int i = 0; i < 10; i++) {
      int j = i * 2;
      a[j] = ...
    }
    

    j is in the same loop scope as a[j] = ... even though it's a merely scalar operation.

    I am not sure to understand your comment correctly. Index computations are not part of topLevelExprs just because they don't need to. When we call ExpressionEvaluator::evaluate, the evaluation might involve some computation, which is done by the ExpressionEvaluator at runtime, but the computation does not explicitly appear in the HostIrContainer's top_level_exprs_.

    The example you wrote looks good to me, but I am not sure to understand what you suggest by it.

    The example I provided in the PR description explains well the use case for the present patch.

    @wujingyue
    Copy link
    Collaborator

    just because they don't need to

    That's fair enough and LGTM. I recall that for MultiDeviceExecutor we also have to find and ExpressionEvaluator::invalidate index calculations that depend on the loop index so they can get different values in different iterations.

    When Hanlin worked on host IR JIT, we realized that finding what indices to invalidate at "run" time creates problems for host latency. So, for the FusionExecutorCache integration, I let host IR lowering find these index calculations at "compile" time and put them in the scope of the for loop. This is done on a separate code path so doesn't affect MultiDeviceExecutor. I didn't get a chance to check with you -- hence my question earlier.

    @samnordmann
    Copy link
    Collaborator Author

    just because they don't need to

    That's fair enough and LGTM. I recall that for MultiDeviceExecutor we also have to find and ExpressionEvaluator::invalidate index calculations that depend on the loop index so they can get different values in different iterations.

    When Hanlin worked on host IR JIT, we realized that finding what indices to invalidate at "run" time creates problems for host latency. So, for the FusionExecutorCache integration, I let host IR lowering find these index calculations at "compile" time and put them in the scope of the for loop. This is done on a separate code path so doesn't affect MultiDeviceExecutor. I didn't get a chance to check with you -- hence my question earlier.

    Ok, now I understand. I like the idea of what was done for host IR JIT. Do you have a pointer to the PR? It would make sense to do the same in MultiDeviceExecutor.

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann samnordmann merged commit ef5a717 into main Oct 8, 2025
    64 of 65 checks passed
    @samnordmann samnordmann deleted the host_ir_print_index branch October 8, 2025 14:12
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants