[crawler] Capture project details #26

themightychris · 2019-09-24T16:56:57Z

The prototype crawler from #19 only captures a few initial basic project details, Code for Kenya/HURUmap provides a good example of a record with all fields filled in well and some natural changes to them in the history already

What are some more details a v2 should capture? Any thoughts on how we should organize it? (TOML has great support for grouping things any number of levels deep)

I don't think we want to capture any details in the index that routinely change day-by-day in the life of a project (e.g. number of open issues, number of contributors), BUT maybe we do capture things like that as binary or tiered buckets (e.g. has-issue=true or contributors=5-10)

I think we should pull in the GitHub description and/or opening paragraph of the README directly, and then for other big wordy things record their presence, link to them, and maybe measure their health or summarize them if there's a valuable way to do so. (e.g. we can record which license is used and link to the license, we can record which of GitHub's standard community health files are present and link to them)

We should also record the presence of any civic.json or publiccode.yaml file, and pull in some or all of their contents into a normalized form.

The text was updated successfully, but these errors were encountered:

themightychris · 2019-09-24T17:08:53Z

I just found that publiccode.yaml got formalized and submitted to the OSI to be incubated as a stewarded project, it's been through a ton more iteration than civic.json and has first class support for describing that a project isBasedOn another project -- something our network sorely needs that is noticeably missing from civic.json

So it seems to me we should hew as closely as we can to publiccode.yml's schema and essentially just do our best to progressively fill out a massive database of project records that are autofilled as best we can

nikolajbaer · 2019-09-27T17:26:07Z

Sorry, itchy trigger finger.

i have a couple notes on this from the statusboard perspective. These are debatable, as they are more helpful for reading through it without a secondary "processing" task.

It would be nice to have a "manifest" at the organization folder so we can see what organizations are there. This might also help with routing renamed groups.
An explicit "name" in the .toml files for both orgs and projects would be nice, rather than the implicit folder / filename convention.
Likewise a manifest in the brigade's projects folder, maybe even indicating which ones are active vs. archived, would be helpful in processing.

As to civicjson and publiccode.yaml, i am looking forward to seeing that data in the .toml files!

gregboyer · 2019-09-29T18:26:05Z

Two questions I'm wondering about, and maybe they don't belong at this step but rather when presenting info:

How do we know which brigades to include? (Maybe scrape everything but only display opt-in brigades?)
Ditto, but more important for projects

My reasoning is that we need to have a threshold that allows us to present accurate and helpful information, but not add noise such as old or poorly documented projects

gregboyer · 2019-09-29T18:47:49Z

I do also want to add that I love this middle layer for data standardization and how it'll allow us to make more flexible decisions in the future regarding our sources of information without impacting anything downstream.

themightychris · 2019-09-29T19:12:40Z

@gregboyer for which brigades to include, we're pulling from the already-curated dataset maintained for cfapi: https://github.com/codeforamerica/brigade-information

My thinking is that we scrape everything into the index repo so people can build different sorts on tools/analysis on top of it. Whatever quality/completion metrics we can layer into the data for easy filtering by tools like what aspiring contributors might search through

tdooner · 2020-01-09T23:42:21Z

I'd like to add to the conversation a desire to have some information about project activity, for purposes of more easily determining what a Brigade is actively working on.

For example, imagine being able to sort OpenOakland's projects by last commit timestamp (i.e. Github default sort).

When combined with other relevance signals, this gets pretty close to making a fully useful search engine for brigade projects.

I don't think we want to capture any details in the index that routinely change day-by-day in the life of a project (e.g. number of open issues, number of contributors), BUT maybe we do capture things like that as binary or tiered buckets (e.g. has-issue=true or contributors=5-10)

I assume that this preference is a consequence of the fact that we're storing the entire revision history in Git and don't want it to be too big with daily changes to this stuff. I think we should give it a try, though, given how valuable this kind of thing could be. Perhaps if we change the Git commit pattern to be one commit per day instead of one commit per day per organization then the history will be a bit more scannable.

themightychris · 2020-01-10T02:07:20Z

Git could handle the volume, but we want to distribute the dataset in a fan-out to many applications and keeping track of granularity down to "time of last commit" is just needlessly noisey for all involved.

That would provide a much cleaner signal over time about how active a project is. Think evolving classifications rather than time-series datapoints. We'll have the URL to the git repo in there too if anyone wants to go analyze the timestamps of every commit for a project

nikolajbaer · 2020-01-16T21:58:40Z

Both are great thoughts! I agree on the signal/noise concerns, but also the value of a metric of activity.

What if we checked the most recent commit on a trigger, and then only update the index data if it slips from one "bucket" to another, e.g. when it transitions between the "last_commit_within" buckets (although maybe start at "quarter" as that might reduce the "noise" a bit). The bottom rung could be "active" for anything that has commit activity within the last 3 months. The challenge will be getting the threshold right so we don't have a lot of projects flip/flopping between "quarter" and "active".

Edit: realizing this was approach was already somewhat mentioned in the original and follow-up comments, so sorry for duplication but 👍 for the tiered idea.

For multiple types of our users (CfA Staff, Brigade Leader, Project Leaders), being able to tell which projects are still active is a crucial aspect of the index. In hackforla#26 we discuss using a bucketed approach so as to not create unnecessary noise by committing the timestamp for every update. This commit implements a coarse timestamp: for projects updated within the last week, month, year, or over a year ago.

For multiple types of our users (CfA Staff, Brigade Leader, Project Leaders), being able to tell which projects are still active is a crucial aspect of the index. In #26 we discuss using a bucketed approach so as to not create unnecessary noise by committing the timestamp for every update. This commit implements a coarse timestamp: for projects updated within the last week, month, year, or over a year ago.

giosce · 2020-12-10T00:31:36Z

Can we retrieve the programming languages?

giosce · 2021-09-16T00:59:01Z

It could be interesting to save the last_commit on the default branch (but it is a api call per repo).
We could also save the number of gh open_issues and language

themightychris · 2021-09-16T11:49:51Z

right now all we capture from github is what's included in the GitHub Repo API object

once we start capturing content from the git repo though we'll be fetching the latest commit on the default branch and yeah could record details about it. That could present churn issues though

themightychris · 2021-09-16T11:51:23Z

I wonder if we could get a list of committers from github without having to fetch complete history ourselves

k3KAW8Pnf7mkmdSMPHz27 · 2021-09-16T12:44:47Z

List of committers or contributors? Are we getting https://docs.github.com/en/rest/reference/repos#list-repository-contributors ?

giosce · 2021-10-19T19:35:36Z

Let's decide what we would like to add.
Also, @themightychris, how do you get the current info from GH? Which API? I was assuming organization/repos but wonder how you get readme/description and topics.

I'd like the day of last push on default branch, number of contributors, languages. Anything else?
And don't bucket last push and open issues

If we are making this call https://api.github.com/repos/codeforamerica/brigade-project-index
we should be getting this response

For more details we need additional calls, for projects
https://api.github.com/repos/codeforamerica/brigade-project-index/languages
https://api.github.com/repos/codeforamerica/brigade-project-index/contributors
https://api.github.com/repos/codeforamerica/brigade-project-index/branches/{default_branch returned by first call}

themightychris · 2021-10-20T22:24:35Z

Shared by @ExperimentsInHonesty , this page provides great practical examples for projects to publish community health files, which we should capture: https://100automations.org/guides/community-support-for-automations.html

themightychris · 2021-10-25T22:39:22Z

I think this issue can be closed by splitting off a few more specific tickets:

enable crawler to load content from inside Git repository default branch [Extract content from git repository #62]
capture any present publiccode.yml into the snapshot [Read PublicCode.yml #55]
capture any present civic.json into the snapshot
capture any present README.md into the snapshot (ideally using some sort of library to parse the Markdown document into structured object reflecting the document's TOC so we have visibility into what headers it contains and its structure)
capture all the standard community health Markdown files (and ideally parse them by TOC too)

themightychris created this issue from a note in Brigade Project Index - Planning / Discovery (Prioritized Backlog) Sep 24, 2019

themightychris mentioned this issue Sep 24, 2019

Pull request bot #27

Open

nikolajbaer closed this as completed Sep 27, 2019

Brigade Project Index - Planning / Discovery automation moved this from Prioritized Backlog to Done Sep 27, 2019

nikolajbaer reopened this Sep 27, 2019

themightychris mentioned this issue Nov 3, 2019

Improve Brigade Project Meta-data representation on StatusBoard #30

Open

2 tasks

themightychris mentioned this issue Dec 19, 2019

Review tag import DemocracyLab/CivicTechExchange#261

Closed

tdooner mentioned this issue Sep 15, 2020

Add last_pushed_within field to project index #40

Merged

giosce added this to To do in Congress2021 Release via automation Sep 16, 2021

giosce added this to To do in 2021 Q4 Release Oct 19, 2021

giosce removed this from To do in Congress2021 Release Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[crawler] Capture project details #26

[crawler] Capture project details #26

themightychris commented Sep 24, 2019 •

edited

Loading

themightychris commented Sep 24, 2019

nikolajbaer commented Sep 27, 2019

gregboyer commented Sep 29, 2019

gregboyer commented Sep 29, 2019

themightychris commented Sep 29, 2019

tdooner commented Jan 9, 2020

themightychris commented Jan 10, 2020 •

edited

Loading

nikolajbaer commented Jan 16, 2020 •

edited

Loading

giosce commented Dec 10, 2020

giosce commented Sep 16, 2021

themightychris commented Sep 16, 2021

themightychris commented Sep 16, 2021

k3KAW8Pnf7mkmdSMPHz27 commented Sep 16, 2021

giosce commented Oct 19, 2021 •

edited

Loading

themightychris commented Oct 20, 2021

themightychris commented Oct 25, 2021

[crawler] Capture project details #26

[crawler] Capture project details #26

Comments

themightychris commented Sep 24, 2019 • edited Loading

themightychris commented Sep 24, 2019

nikolajbaer commented Sep 27, 2019

gregboyer commented Sep 29, 2019

gregboyer commented Sep 29, 2019

themightychris commented Sep 29, 2019

tdooner commented Jan 9, 2020

themightychris commented Jan 10, 2020 • edited Loading

nikolajbaer commented Jan 16, 2020 • edited Loading

giosce commented Dec 10, 2020

giosce commented Sep 16, 2021

themightychris commented Sep 16, 2021

themightychris commented Sep 16, 2021

k3KAW8Pnf7mkmdSMPHz27 commented Sep 16, 2021

giosce commented Oct 19, 2021 • edited Loading

themightychris commented Oct 20, 2021

themightychris commented Oct 25, 2021

themightychris commented Sep 24, 2019 •

edited

Loading

themightychris commented Jan 10, 2020 •

edited

Loading

nikolajbaer commented Jan 16, 2020 •

edited

Loading

giosce commented Oct 19, 2021 •

edited

Loading