Skip to content

GCC ICE: Segmentation Fault when building PyTorch/XLA #9589

@ysiraichi

Description

@ysiraichi

🐛 Bug

GCC fails to compile PyTorch/XLA (163193e -- master branch), ending with an internal compiler error (ICE), apparently caused by a segmentation fault.

ERROR: external/xla/xla/service/spmd/shardy/stablehlo_round_trip/BUILD:44:11: Compiling xla/service/spmd/shardy/stablehlo_round_trip/export_ops.cc failed: (Exit 1): gcc failed: error executing CppCompile command (from target @@xla//xla/service/spmd/shardy/stablehlo_round_trip:export_ops) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 128 arguments skipped)
  In file included from <command-line>:
  /usr/include/stdc-predef.h: In substitution of 'template<class _Functor, class, class> std::function<std::unique_ptr<mlir::Pass>()>::function(_Functor) [with _Functor = <missing>; <template-parameter-1-2> = <missing>; <template-parameter-1-3> = <missing>]':
  external/xla/xla/service/spmd/shardy/stablehlo_round_trip/export_ops.cc:249:53:   required from here
  /usr/include/stdc-predef.h:32:70: internal compiler error: Segmentation fault
     32 |    whether the overall intent is to support these features; otherwise,
        |                                                                      ^
  0x7e2852d42ddf ???
        ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
  0x7e2852d2dd79 __libc_start_main
        ../csu/libc-start.c:308
  Please submit a full bug report,
  with preprocessed source if appropriate.
  Please include the complete backtrace with any bug report.
  See <file:///usr/share/doc/gcc-10/README.Bugs> for instructions.
  [17,006 / 19,133] Compiling xla/mlir_hlo/mhlo/IR/hlo_ops.cc; 99s local ... (24 actions, 23 running)
  Target //:_XLAC.so failed to build
  Use --verbose_failures to see the command lines of failed build steps.
  INFO: Elapsed time: 697.676s, Critical Path: 175.00s
  INFO: 17030 processes: 10181 internal, 6849 local.
  ERROR: Build did NOT complete successfully
  error: command '/usr/local/bin/bazel' failed with exit code 1
  error: subprocess-exited-with-error

While I'm not sure exactly what this is, it could be related to this gcc-10 bug.

Setup

Same image used in this CI run: https://github.com/pytorch/xla/actions/runs/17246611174/job/48937880278?pr=9588
Docker image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.12_tpuvm
Docker image sha256: c194788bb5ea6d76806371a80c59c08f33c5ed0186a88e4e65a54245cf0a9014

Additional Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingbuildBuild process related matters (e.g. build system).

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions