Skip to content

Conversation

@abitrolly
Copy link

No description provided.

@abitrolly
Copy link
Author

Counting lines of code takes 3 seconds and calculating file extensions takes 3 minutes.

image

It appears that git ls-files | xargs -n 1 basename > 2 command is very slow - 3m14s compared to git ls-files > 1 which takes only 2.2s.

@abitrolly
Copy link
Author

Firefox repo contains almost 22 million files, so that means xargs needs to run external process 22 million times. Looks like a major bottleneck for all shell pipelines.

✗ git ls-files | wc -c
21857117

branches:
- main
pull_request:
workflow_dispatch:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing workflow in branches other than main.

- name: Gather commits by day and file extension statistics
run: |
./01stats.sh gecko-dev build
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we get out of this build step? Also I see that it takes pretty long time to complete.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to move stats collection commands out of GitHub Actions YAML, so that they could be run standalone.

As I continued experiments in my main branch after opening this PR, more things started to creep in. The command that takes the most time is git fetch --unshallow added in the last commit to start playing with historical data.

@4e6
Copy link
Owner

4e6 commented Jun 2, 2022

Hi! Thanks for your contribution. Just curious, what's the motivation behind these changes?

@abitrolly
Copy link
Author

@4e6 well, this PR is far from being finished. The final goal was to get the dataset for diagrams on Firefox Oxidization over time. (like in #10). Because I didn't know the codebase, I started with the code that seemed to be the easiest to get up. Like counting file extensions over time.

Because only full git checkout takes the whole 15 minutes, going commit over commit probably won't be feasible to do in one CI run, so the plan is to collect the data month by month over multiple CI runs.

Maybe it will be faster that restoring `gecko-dev` history with,
`git fetch --unshallow`, which takes about 15 minutes.
@abitrolly
Copy link
Author

Doing complete gecko-dev checkout through the action, and it tool 10 minutes, where the stats script took only 12s. That's still too slow. Maybe the initial checkout can be cached.

image

https://github.com/abitrolly/firefox-lang-stats/runs/6713143940

@abitrolly
Copy link
Author

abitrolly commented Jun 2, 2022

Opened the issue in actions/checkout#818 to maybe track possible solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants