Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining C++ repos with module 'use_repo_rule' impacts windows (msvc/clang-cl) compile speed by up to 5x - tildes? #22865

Open
peakschris opened this issue Jun 23, 2024 · 18 comments
Labels
team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. team-Rules-CPP Issues for C++ rules type: bug untriaged

Comments

@peakschris
Copy link

Description of the bug:

We have had a curious regression in compile speed for the last 3 weeks. Whilst previously our compiles loaded 40 local cpus to 100%, after the regression cpus were only working at ~25% and compile times increased by 5x (3h to 15h)

I have spent ages trying to figure this out. Tried multiple windows machines, msvc and clang-cl, profiled the build, tested the filesystem, tried a ramdisk, multiple bazel versions...

I've finally narrowed it down to a reproducible example. It happened when I converted all our C++ external repos from workspace to module via use_repo_rule. I think it may be due to the tilde's (~) in the include paths. Perhaps it's an MSVC bug, but it also impacts clang-cl builds which is curious. When I run the Maybe it is a windows bug?

I can't share the code, but can share what we changed and the performance characteristics. This is a tiny example of low level code so only shows a 40% degradation. With 50 repos in lower level code the impact is much larger.

With workspace repo (30s)
image
The flat peak is the time when the compiles are all happening and cpus are loaded 100%

With one repo defined with use_repo_rule in MODULE.bazel (44s):
image
The CPUs never load at 100%

Build file, same in before and after:

cc_import(
    name = "OurTools",
    interface_library = "ourtools.lib",
    shared_library = "ourtools.dll",
    deps = [
        ":headers",
    ],
)

cc_library(
    name = "headers",
    hdrs = glob([
            "include/**/*.h", 
            "include/**/*.hpp", 
            "include/**/*.c"]),
    includes = [
            "include",
            "include/sub1/common",
            "include/sub2/include"],
)

Before: OurTools declared with a rule that calls download_and_extract in WORKSPACE

load("@our_toolchain//:tools/artifact_rules.bzl", "our_artifact_with_plat")
def define_OurTools_repo():
    our_artifact_with_plat(
        name = "OurTools",
        version = "2.1.0",
        build_file = "//src/repo/bazel/toolbox/build_files/wntx64:OurTools_BUILD.bazel",
        sha256 = "021fb717579ec6c3cc2f9eb8bacd8285801ac116e827ca7f430c0960fd0a2378",
    )

After: this is in MODULE.bazel:

our_artifact_with_plat = use_repo_rule("@our_toolchain//:tools/artifact_rules.bzl", "our_artifact_with_plat")
our_artifact_with_plat(
    name = "OurTools",
    version = "2.1.0",
    build_file = "//src/repo/bazel/toolbox/build_files/wntx64:OurTools_BUILD.bazel",
    sha256 = "021fb717579ec6c3cc2f9eb8bacd8285801ac116e827ca7f430c0960fd0a2378",
)

The reason this may cause a compile time difference is the include path. We use params files, the relevant differences are:

Workspace

/Iexternal/OurTools
/Ibazel-out/x64_windows-opt/bin/external/OurTools
/Iexternal/OurTools/include
/Ibazel-out/x64_windows-opt/bin/external/OurTools/include
/Iexternal/OurTools/include/icu/common
/Ibazel-out/x64_windows-opt/bin/external/OurTools/include/sub1/common
/Iexternal/OurTools/include/xalanc/include
/Ibazel-out/x64_windows-opt/bin/external/OurTools/include/sub2/include

Module & use_repo_rule:

/Iexternal/_main~_repo_rules~OurTools
/Ibazel-out/x64_windows-opt/bin/external/_main~_repo_rules~OurTools
/Iexternal/_main~_repo_rules~OurTools/include
/Ibazel-out/x64_windows-opt/bin/external/_main~_repo_rules~OurTools/include
/Iexternal/_main~_repo_rules~OurTools/include/icu/common
/Ibazel-out/x64_windows-opt/bin/external/_main~_repo_rules~OurTools/include/sub1/common
/Iexternal/_main~_repo_rules~OurTools/include/xalanc/include
/Ibazel-out/x64_windows-opt/bin/external/_main~_repo_rules~OurTools/include/sub2/include

Which category does this issue belong to?

C++ Rules, External Dependency

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I can't share my reproducer as it is proprietary.

Which operating system are you running Bazel on?

Windows 10

What is the output of bazel info release?

7.2.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

This is not a regression

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@github-actions github-actions bot added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. team-Rules-CPP Issues for C++ rules labels Jun 23, 2024
@fmeum
Copy link
Collaborator

fmeum commented Jun 23, 2024

Could you share a Starlark profile of the builds? If you are worried about leaking target names, sharing a redacted screenshot of each would already be very helpful.

@peakschris
Copy link
Author

peakschris commented Jun 23, 2024

I was wondering if it is due to the tildes ~ which are a special character in windows 8.3. filenames. I've disabled 8.3 filenames using fsutil 8dot3name set 3 but we have no way of knowing if any special code still executes in the windows kernel when it comes across a request for a file containing ~

@fmeum happy to provide a redacted profile, but starlark_cpu_profile is disabled on windows (#13748). Is there another way I can provide some data?

Another thought; would it be hard to make me a trial bazel build using some other character that is not ~ - e.g a x? I assume the character is not hardcoded in many places?

@peakschris
Copy link
Author

We have some new_local_repositories defined inside module extensions that appear to end up with a simple name on the filesystem and don't trigger this issue. Are there other ways to get simple names besides new_local_repository?

@peakschris
Copy link
Author

Here is some analysis from Windows Performance Analyzer, I used it to capture each build. I think this shows that the extra time is spent in the compiler or OS, and not bazel. I think it is bazel's directory names that are the issue though ;-)

Workspace:
image

Module:
image

You can see that "parsing" time with module include path increases from 7s to 20s

You can also see the difference reflected in C1DLL stage of the compiler:
Workspace
image

Module
image

@fmeum
Copy link
Collaborator

fmeum commented Jun 24, 2024

@peakschris I created this branch based on 7.2.1rc2 that uses . as a separator instead of ~. Can you build Bazel from that branch and test it? I don't have access to a Windows machine to compile a binary for you right now.

cc @meteorcloudy

@peakschris
Copy link
Author

Thank you! This isn't working:

D:\udu\units\cb_0523_bazel2>bazel-mod-dot build //src/server/soa_framework/common/cpp/...
INFO: repository @@our_toolchain.' used the following cache hits instead of downloading the corresponding file.
 * Hash 'b5d45176a23aaa3173d2b6560df57283490e6756253ed4fb11b3855c1178eae1' for <internal-url-redacted>
If the definition of 'repository @@our_toolchain.' was updated, verify that the hashes were also updated.
ERROR: <builtin>: fetching http_archive rule //:our_toolchain.: java.io.IOException: http_archive rule //:our_toolchain. must create a directory
ERROR: Error computing the main repository mapping: error during computation of main repo mapping: http_archive rule //:our_toolchain. must create a directory

I wonder if windows has some oddities around trailing dots in directory names:

D:\workdir>mkdir "our_toolchain."

D:\workdir>dir our_toolchain*
 Volume in drive D is Data
 Volume Serial Number is 982B-CBB1

 Directory of D:\workdir

06/24/2024  06:18 AM    <DIR>          our_toolchain

Dots in the middle seem fine, but dots at the end cause issues:

D:\workdir>mkdir "_main.our_toolchain"
D:\workdir>dir _main*
 Volume in drive D is Data
 Volume Serial Number is 982B-CBB1

 Directory of D:\workdir

06/24/2024  06:20 AM    <DIR>          _main.our_toolchain
               0 File(s)              0 bytes
               1 Dir(s)  161,841,446,912 bytes free

D:\workdir>mkdir "_main.our_toolchain."
A subdirectory or file _main.our_toolchain. already exists.

@peakschris
Copy link
Author

"+" seems to work on windows anywhere in filename without quoting in shell:

D:\workdir>mkdir _main+our_toolchain+

D:\workdir>dir _main+our_toolchain+
 Volume in drive D is Data
 Volume Serial Number is 982B-CBB1

 Directory of D:\workdir\_main+our_toolchain+

06/24/2024  06:31 AM    <DIR>          .
06/24/2024  06:31 AM    <DIR>          ..
               0 File(s)              0 bytes
               2 Dir(s)  161,841,442,816 bytes free

D:\workdir>dir _main*
 Volume in drive D is Data
 Volume Serial Number is 982B-CBB1

 Directory of D:\workdir

06/24/2024  06:31 AM    <DIR>          _main+our_toolchain+

@peakschris
Copy link
Author

I will make a local change to + and retry, now I know where the code changes should go

@fmeum
Copy link
Collaborator

fmeum commented Jun 24, 2024

+ looks promising. @Wyverald Do you happen to remember whether we explicitly decided against + at some point?

@peakschris
Copy link
Author

I tested:

It fixes the issue.

@fmeum
Copy link
Collaborator

fmeum commented Jun 24, 2024

@bazel-io flag

@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Jun 24, 2024
@fmeum
Copy link
Collaborator

fmeum commented Jun 24, 2024

@peakschris Thanks for testing this. It looks like we need to switch to a different scheme.

@peakschris
Copy link
Author

@fmeum you're welcome, glad to get to the bottom of this. Would this be a breaking change, behind a flag?

As an aside, when I'm building bazel should I expect Executing genrule //src:package-zip_jdk_minimal to take 1800s?

@fmeum
Copy link
Collaborator

fmeum commented Jun 24, 2024

Technically it wouldn't be as the docs clearly state that the particular naming scheme is an implementation detail, but due to limitations of the API users started to rely on it. Since the regression is so severe and silent, I do think that we need to fix it even in 7.2.

I've noticed this target being very slow to build on Windows, but not to that extent (a few minutes on my regular laptop).

fmeum added a commit to bazelbuild/bazel-gazelle that referenced this issue Jun 24, 2024
Instead of relying on a particular separator char such as `~`, we instead only rely on the guaranteed fact that the apparent name of an extension repo is the last component of its canonical name.

Prepares for bazelbuild/bazel#22865
fmeum added a commit to bazelbuild/bazel-gazelle that referenced this issue Jun 24, 2024
Instead of relying on a particular separator char such as `~`, we instead only rely on the guaranteed fact that the apparent name of an extension repo is the last component of its canonical name.

Prepares for bazelbuild/bazel#22865
@iancha1992
Copy link
Member

@bazel-io fork 7.3.0

@bazel-io bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Jun 24, 2024
@peakschris
Copy link
Author

There are two further lines that need changing in src\main\java\com\google\devtools\build\lib\bazel\bzlmod\ModuleExtensionId.java

@meteorcloudy
Copy link
Member

@peakschris Thank you so much for catching this and digging into the root cause.

I've disabled 8.3 filenames using fsutil 8dot3name set 3 but we have no way of knowing if any special code still executes in the windows kernel when it comes across a request for a file containing ~

So we can confirm that disabling 8.3 filenames on the volume doesn't fix this bug, right?

@peakschris
Copy link
Author

@meteorcloudy that's correct, disabling 8.3 filenames does not fix this bug

fmeum added a commit to bazelbuild/bazel-gazelle that referenced this issue Jun 25, 2024
* Remove reliance on specific canonical repo name scheme

Instead of relying on a particular separator char such as `~`, we instead only rely on the guaranteed fact that the apparent name of an extension repo is the last component of its canonical name.

Prepares for bazelbuild/bazel#22865

* Update go_repository_config.bzl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. team-Rules-CPP Issues for C++ rules type: bug untriaged
Projects
None yet
Development

No branches or pull requests

7 participants