Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add performance report #1

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

wesley-weiming
Copy link
Collaborator

  1. We submitted a performance report
  2. Supports automatically generating project-impact-graph.yaml for a given number of projects and dependencies, and then you can test it yourself
  3. In the performance test, we found that using "minimatch" for glob matching would lead to performance degradation (For example, in the case of 2,000 projects and 10,000 dependencies, the time to calculate 100 paths reaches minutes). So we need to change the path matching method of the algorithm and use string matching instead of glob matching. This means that fields such as includeGlobs and excludeGlobs in the file schema must also discard the representation of glob.

@CLAassistant
Copy link

CLAassistant commented Dec 6, 2023

CLA assistant check
All committers have signed the CLA.

projects:
A:
includedGlobs:
- projects/folder_A/**
- projects/folder_A/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate the syntax used here?

Copy link
Collaborator Author

@wesley-weiming wesley-weiming Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

projects:
    A:
        includedGlobs:
            - projects/folder_A/
        excludedGlobs:
            - projects/folder_A/README.md
        dependentProjects:
            - G

This semantics describes a project named 'A', and 'includedGlobs' is used to specify the files that should be included in project 'A'. 'excludedGlobs' is a subset of 'includedGlobs', used to specify which paths in project 'A' need to be filtered. 'dependentProjects' indicates which projects directly depend on 'A'

projects/folder_A/, Because we replaced glob match with startsWith, this folder path is used here to represent project A.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like it is no longer included/excludedGlobs. It's included/excluedPrefix now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it is no longer appropriate to continue using 'glob'

@@ -0,0 +1,148 @@
import path from 'path';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a REAMD to teach us how to run the performance?

My guess now is run this file with node. Is it true?

Even better, you can put the performance result with the running environment info.

Copy link
Collaborator Author

@wesley-weiming wesley-weiming Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, let me add a new commit

@@ -2,11 +2,11 @@ globalExcludedGlobs:
- OWNERS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is these file names still working? The implementation has been changed to match with startsWith

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OWNERS
build.sh
bootstrap.sh

These represent public configuration files in the root directory, OWNERS means repoRootDir/OWNERS, which is still available for startsWith

@@ -4,7 +4,6 @@
import fs from 'fs';
import yaml from 'yaml';
import _ from 'lodash';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is lodash used in any inner loops? Some time ago a perf investigation showed that Lodash's algorithms are often extremely inefficient due to so many layers of abstractions in its code base.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used in loops, lodash is only used twice in a complete calculation process (using its cloneDeep API to clone the graph structure)

| 3000 | 100000 | 1 | 1 | 9.46s |
| 3000 | 100000 | 10 | 10 | 10.533s |
| 3000 | 100000 | 100 | 100 | 11.029s |
| 3000 | 100000 | 1000 | 1000 | 11.984s |
Copy link
Member

@octogonz octogonz Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesley-weiming

In https://github.com/tiktok/project-impact-graph/pull/3/files I've added some instrumentation to count the number of times each part of the loop is executed.

Here's one of the test cases:

[
  {
    nodeCount: 2000,
    edgeCount: 10000,
    pathCountA: 1000,
    pathCountB: 1000,
    hasImpactIntersection: true,
    executeTime: '1.513s'
  }
]
{
  _integrateExcludedGlobs: 1,
  _integrateExcludedGlobs2: 2000,
  _validatePaths: 2,
  _validatePaths2: 4008000,
  lookUpProjectNamesByPathList: 2,
  lookUpProjectNamesByPathList2: 4000000,
  lookUpProjectNamesByPathList3: 4000000,
  lookUpProjectNamesByPathList4: 5780,
  getProjectImpactByProjectNames: 2,
  getProjectImpactByProjectNames2: 4000,
  getProjectImpactByProjectNames3: 24000,
  hasImpactIntersection: 1
}

The nodeCount is the number of projects, and pathCountA and pathCountB are the "before" and "after" lists of paths from the diff. The bottleneck of this algorithm seems to be 4,000,000 which is O(numberOfPaths * numberOfProjects).

But notice that our project paths have a well-behaved structure, for example:

  • INCLUDE apps/my-app/**
  • EXCLUDE apps/my-app/README.md
  • INCLUDE apps/my-app2/**
  • EXCLUDE apps/my-app2/README.md
  • INCLUDE libraries/my-lib3/**
  • EXCLUDE libraries/my-lib3/README.md
  • INCLUDE libraries/my-lib4/**
  • EXCLUDE libraries/my-lib4/README.md
  • INCLUDE libraries/my-lib4/bad-nested-project/**
  • EXCLUDE libraries/my-lib4/bad-nested-project/README.md

Even if we permit projects to be nested under other project folders, the prefixes of these globs still form a tree. For an example input path libraries/my-lib3/src/index.ts, imagine an O(n*log(n)) algorithm that would cheaply walk down libraries -> my-lib3, and then need to test only 2 globs ** and README.md.

This idea is similar to rush-lib/src/logic/LookupByPath.ts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants