Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of projects that are good candidates for study #1

Open
vsoch opened this issue Mar 25, 2020 · 13 comments
Open

List of projects that are good candidates for study #1

vsoch opened this issue Mar 25, 2020 · 13 comments

Comments

@vsoch
Copy link

vsoch commented Mar 25, 2020

From issue urlstechie/urlchecker-python#13:

But here is an idea - what about some kind of fun project where we involve the research community, but do our own small analysis, and then that serves as a writeup that we can share on social media and encourage folks to use it? What I'd want to do is assemble a list of documentation served on repositories that are research oriented / groups, and then programmatically run the checker for all of them to calculate the number of total links, number of broken, etc. Actually, if we add the ability to specify an output file, we could even put the entire results into a data repository, include the scripts for running, and heck, if we can make an argument for a research tool for documentation (that has shown purpose and we have some hypothesis / conclusions about links) we might even have enough to at least submit a paper to JoSS (and then to ArXiv if it's rejected). What do you think? So my thinking for moving forward - after results can be saved to file:

  • We can make a list of repos that we want to run it for (that we would eventually want to suggest / PR to add the check) THIS ISSUE HERE.
  • Once we have that, then I'll start our little data analysis to check repos! Would we want to do something that not only runs across repos, but also across time? Would it be interesting to set up as a CI automated job that can collect metrics over a longer period of time?
  • we we can write it up, perhaps submit first or just share it
  • and then to all repos we test, we can offer to help fix the broken links by way of a PR to use the action.

This sounds like fun! I'm totally willing to take on the bulk of work stated above, I haven't done a little fun project like this in a while. Let me know your thoughts!


@SuperKogito
Copy link
Member

  • I kinda of forgot about this one. It is actually a great plan. I will seek to provide a list of projects I find interesting for us and from there we can test locally maybe and provide some PRs. I think with urlchecker-python, this is simple to do and in the PRs we can maybe mention urlchecker-action.

  • This was actually one of the small hacks I thought of to improve my contribution records on GitHub, but I thought that it can be bothering to some if every now and then some random developer shows up with a bunch of broken urls.

  • I am not sure if I understand your idea of a CI automated job to test over longer period of times? Isn't that the job of the GitHub action?

  • I will seek to provide a list as soon as I can, which shouldn't be complicated because every project with links or documentation urls is of interest to us. However, I think the list should be more dependant on who are more open to the feedback and who might adopt our tool.

  • One thing I would love to to explore more in the upcoming days is the badge. There is one in the readme me here https://github.com/urlstechie/urlstechie.github.io but it is static (I think). If we manage to make that dynamic and dependant on the last build (something like travis-ci badges) that might propel things for the project because badges are trending these days and they wrap the results beautifully.

@vsoch
Copy link
Author

vsoch commented Apr 5, 2020

I am not sure if I understand your idea of a CI automated job to test over longer period of times? Isn't that the job of the GitHub action?

@SuperKogito let's say that we have a list of repos - we would have some repository, let's call it "urlchecker-analysis" that uses the GitHub action:

  • run on a nightly / weekly basis
  • once per url that we've defined
  • saving a file of results, again one per repository

So you can imagine we would have a results structure something like this:

 # urlchecker-analysis
results/
    repo-checked-1     # this might be the research meeting list repo, for example
        results-<date-1>.csv
        results-<date-2>.csv
       ....
    repo-checked-2
    ...
    repo-checked-n 

And then you can imagine having an analysis script that can be run over any specific repository checked, and say things like "The percentage of urls broken on average is... the change from week to week is..." and more importantly, if we get enough repos, we might even be able to say things in a larger sense like "We found repos associated with this domain, or repos that were updated only this many times, had significantly more broken links." And of course that requires having metadata about the repos, which is something else we can get from the GitHub API, etc. But that's a later step, we can focus on first:

  1. collecting a list of urls
  2. creating the urlchecker-analysis with a GitHub workflow to use urlchecker-action, once per repository, to do the checks
  3. and then automating it to run weekly to save results

And then we can play around with developing the analysis bit when there is a tiny bit of data. I suspect that most repos won't have huge changes day to day, which is why I'm thinking the rate of monthly might be a good start.

And then once we have this analysis, we can write it up, make pretty plots, and give good reason to do the checks in the first place!

@vsoch
Copy link
Author

vsoch commented Apr 5, 2020

For the badges - definitely give it a go! Please again open feature branches for review first. I've made custom badges (I think with shields.io?) Here are a few purple ones I designed for the needs-love project :) https://github.com/rseng/needs-love

@SuperKogito
Copy link
Member

urlchecker-analysis, I love it. The whole concept, that's a genius idea <3 I will see which repository urls we can use :)

@vsoch
Copy link
Author

vsoch commented Apr 5, 2020

awesome! If you want to put together a first shot at a list, I can put together the skeleton of the repo (I've already thought about it a bit).

@SuperKogito
Copy link
Member

go on with the repo and I will add a list to it? or maybe better to put it here? I will try to make it, at the latest by tomorrow.

@vsoch
Copy link
Author

vsoch commented Apr 5, 2020

Just put it here since we have the nice issue :)

@vsoch
Copy link
Author

vsoch commented Apr 5, 2020

Actually even better - I can make the repo and transfer the issue! <3

@vsoch vsoch transferred this issue from urlstechie/urlchecker-python Apr 5, 2020
@vsoch
Copy link
Author

vsoch commented Apr 5, 2020

Done!

@SuperKogito
Copy link
Member

So after searching a bit and checking some projects, I came up with the list below. The projects listed below were not chosen for any specific criteria. I just tried to diversify the repositories (Python, JS, Html) but it is still missing others (c., c++ etc.). I also tried to include projects that are currently maintained and include many links.

Python projects

Js projects and websites.

This a list of various active projects of interest with many links.

Curated lists repositories are very interesting for us

This will help us test urlchecker with .md files

Academic projects and courses

HTML documentations

@SuperKogito
Copy link
Member

Let me know what you think of it and which ones we should add ;)

@vsoch
Copy link
Author

vsoch commented Apr 6, 2020

These are great! I don't see why we shouldn't add all of them? It's a very nice range of types of repos.

@vsoch
Copy link
Author

vsoch commented Apr 6, 2020

I need to finish up working on an API, but after that I should be able to put some time into this! If not today, definitely this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants