Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

git operations take a lot of time #7358

Open
andinus opened this issue Jan 4, 2023 · 8 comments
Open

git operations take a lot of time #7358

andinus opened this issue Jan 4, 2023 · 8 comments

Comments

@andinus
Copy link
Contributor

andinus commented Jan 4, 2023

Currently there are over 70,000+ files in this repository and every week we're adding 100s of files (every week a directory is created for every user and the previous "README" is copied).

I started participating with challenge-076. According to my records I've submitted solutions for 25 challenges, so there are ~100 useless directories with my name and a README file. With around 300 users, I believe this adds up.

My primary machine is not very fast and it takes 70 seconds to run git status.

andinus@ ~//perlweeklychallenge-club > time git status                
On branch master                                                                              
Your branch is up to date with 'origin/master'.                                               
                                                                                              
                                                                                              
It took 54.97 seconds to enumerate untracked files. 'status -uno'                             
may speed it up, but you have to be careful not to forget to add                              
new files yourself (see 'git help status').                                                   
nothing to commit, working tree clean                                                         
                                                                                              
________________________________________________________                                      
Executed in   71.65 secs                                       
                                                                                              
andinus@ ~//perlweeklychallenge-club > time git status -uno                 
On branch master                                                                              
Your branch is up to date with 'origin/master'.                                               
                                                                                              
nothing to commit (use -u to show untracked files)
                                                                                              
________________________________________________________                                      
Executed in   16.89 secs

@ealvar3z
Copy link
Contributor

ealvar3z commented Mar 13, 2023

@andinus I am assuming that you've done a shallow clone?!

If you have and you're still suffering from performance issues, I can submit a patch (PR) for this issue. The following is what I have in mind:

a simple script that runs:

git repack && git prune-packed && git reflog expire --expire=1.month.ago && git gc --aggressive

Add it to a GH workflow that crons it every week.

Thoughts @manwar

P.S: @andinus if upstream does not want to the proposed PR. Please note, that you can do this to your local clone.

@ealvar3z
Copy link
Contributor

  • Update

I've just seen that the scripts directory has attempted this already. So the solution may not be upstream.

@rcmlz
Copy link
Contributor

rcmlz commented Sep 14, 2023

I am also in favour of doing some house keeping. I use zsh with some git integration and the meanwhile 90k files slow down the shell. Can the "historic" commits maybe automatically be squashed, so we have perhaps only a single commit per week on the master?

@ealvar3z
Copy link
Contributor

@andinus I think your recommendation is the best and quick approach (i.e. deleting stale dirs w/ README files). I ran a test locally and this is what i got:

Before I ran cleanup-readme-only.sh

╔ eax@nix:test_perlweeklychallenge-club(issue/7358)
╚ λ time gs
Refresh index: 100% (88731/88731), done.
On branch issue/7358
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        script/cleanup_readme_only

nothing added to commit but untracked files present (use "git add" to track)

real    0m3.350s
user    0m1.562s
sys     0m2.123s

Running clean-up-readme-only.sh

This is how long it took the shell script took to run. However, this may be just be a one time only since it deleted the entirety of the repo's history.

╔ eax@nix:test_perlweeklychallenge-club(issue/7358)
╚ λ time bash -c script/cleanup_readme_only

real    2m1.530s
user    4m33.764s
sys     3m12.803s

It got rid of 39k files (see below), but we could do better.

╔ eax@nix:test_perlweeklychallenge-club(issue/7358)
╚ λ git diff --name-only HEAD~ | wc -l
39066

Doing git status after running clean-up-readme-only.sh

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        script/cleanup_readme_only

no changes added to commit (use "git add" and/or "git commit -a")

real    0m1.082s
user    0m0.602s
sys     0m0.817s

Great improvement, but the script is too slow (even with xargs). So I rewrote it in Go! See below speed improvement.

╔ eax@nix:test_perlweeklychallenge-club(issue/7358)
╚ λ time bin/cleanup

real    0m2.658s
user    0m2.780s
sys     0m5.675s

Night and day!!!

@manwar : let me know if this is a desirable action, and I'll submit the PR (all the code and local tests are complete). See below GH Action workflow:

name: Cleanup Readmes From Repository

on:
  schedule:
    - cron:  '0 0 * * 0'  # Run at midnight every Sunday

jobs:
  cleanup:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v2

    - name: Setup Go
      uses: actions/setup-go@v2
      with:
        go-version: 1.17

    - name: Build Go Script
      run: go build -o bin/cleanup bin/main.go 

    - name: Execute Cleanup
      run: ./bin/cleanup

@andinus
Copy link
Contributor Author

andinus commented Sep 16, 2023

I am assuming that you've done a shallow clone?!

git repack && git prune-packed && git reflog expire --expire=1.month.ago && git gc --aggressive

IIRC even after a shallow clone, running this ^, it was slow. @ealvar3z Can you share the script? I'll try running that and report back.

@ealvar3z
Copy link
Contributor

ealvar3z commented Sep 16, 2023

@andinus

Please be advised that I ran this on a separate repo: cp -r perlweeklychallenge-club/ test_perlweeklychallenge-club

Here's main.go:

package main

import (
	"fmt"
	"os"
	"path/filepath"
	"runtime"
	"sync"
)

func isReadmeOnly(dir string) bool {
	files, _ := os.ReadDir(dir)
	if len(files) == 1 && (files[0].Name() == "README" || files[0].Name() == "README.md") {
		return true
	}
	return false
}

func cleanupReadmeOnly(wg *sync.WaitGroup, pathChan <-chan string) {
	defer wg.Done()
	for path := range pathChan {
		if isReadmeOnly(path) {
			os.RemoveAll(path)
		}
	}
}

func main() {
	var wg sync.WaitGroup
	ncores := runtime.NumCPU()
	pathChan := make(chan string)

	for i := 0; i < ncores; i++ {
		wg.Add(1)
		go cleanupReadmeOnly(&wg, pathChan)
	}

	err := filepath.WalkDir(".", func(path string, d os.DirEntry, err error) error {
		if d.IsDir() {
			pathChan <- path
		}
		return nil
	})

	if err != nil {
		fmt.Println("Error:", err)
	}

	close(pathChan)
	wg.Wait()
}

And the bash script:

#!/bin/bash

cleanup_readme_only() {
  num_cores=$(nproc)
  find . -type d -print0 | xargs -0 -I {} -P "$num_cores" bash -c \
  'if [ "$(ls -A {})" = "README" ] || [ "$(ls -A {})" = "README.md" ]; \
  then rm -rf {}; fi'
}

cleanup_readme_only

@andinus
Copy link
Contributor Author

andinus commented Sep 26, 2023

It does improve performance, previous these took 71, 16 seconds. Takes about 8, 4 seconds now.

andinus@~/d/o/C/perlweeklychallenge-club (master)> time git status > /dev/null
Refresh index: 100% (93480/93480), done.

________________________________________________________
Executed in    8.44 secs    fish           external
   usr time    1.65 secs    0.00 micros    1.65 secs
   sys time   14.08 secs    0.00 micros   14.08 secs

andinus@~/d/o/C/perlweeklychallenge-club (master)> time git status -uno > /dev/null
Refresh index: 100% (93480/93480), done.

________________________________________________________
Executed in    4.34 secs    fish           external
   usr time    1.01 secs    0.00 micros    1.01 secs
   sys time   10.64 secs    0.00 micros   10.64 secs

@jo-37
Copy link
Contributor

jo-37 commented May 24, 2024

Maybe this issue depends on the workflow in use. In my setup I don't experience such performance issues.

I'm operating on three branches in my fork of perlweeklychallenge-club:

  • master is pull-only and for safety I use --ff-only for pull.
  • contrib is push-only. Using --ff-only for merge to synchronize with own master mirror.
  • ch-xxx is a local working branch created from contrib.

Synchronize master and contrib from upstream, then create a new branch ch-xxx from contrib, build the solution therein, merge ch-xxx into contrib, push to github and create a pull request from the contrib branch.
Delete ch-xxx after it has been merged into master (and is finalized).

Updates are always fast-forward / incremental this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants