-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crates.io top 20k crates verification process #11
Comments
While analyzing crates I caught an error and got stuck trying to fetch submodules from https://github.com/quickwit-oss/tantivy/tree/dff022b30aff6bcd4df7e908f6fa2f86e551204b because it used git over SSH. I guess |
I've opened #12 to stop missing repositories from continuously blocking the clone process. This patch is already present on the machine I'm doing the scanning from. |
The clone process was interrupted by ntex-rs/ntex#333 😅. I've applied a patch locally for now and I'll see how to fix it permanently. Turns out with a large enough pool of crates almost every |
It just happened again with the |
I'm starting to analyze the logs, which I'll publish once we finish analyzing all crates. I've already encountered something, which I've reported at rust-db/refinery#323 |
I've also opened TimelyDataflow/timely-dataflow#559 |
Not sure if it is in top 20k crates, but here is a |
Processing got stuck at ~18.5k crates. Heres the log: output.log.gz WARNING: I've already verified that there are a lot of false positives. I'll merge #18, #19 and #20 locally and have it re-run on all crates |
Don't worry about the 20k limit, it's just a number I've picked for doing the "official" scrape after having done a very rough 5k one in the previous days 😃 |
Issue for brotli, have not tested if it is reproducible: dropbox/rust-brotli#178 |
Here are the results from the second run: output.log.gz |
There are too many crates without a repository field. I'd like to start opening issues on crates that have recently released new versions, which are the ones more likely to respond. I wrote this very rough scraper for finding out the last updated date of each crate in the list [package]
name = "cargo-recent-crates"
version = "0.1.0"
edition = "2021"
[dependencies]
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "json", "blocking"] }
chrono = { version = "0.4", features = ["serde"] }
serde = { version = "1", features = ["derive"] } use std::{thread, time::Duration};
use chrono::{DateTime, Utc};
fn main() {
let mut client = reqwest::blocking::Client::builder().user_agent("https://github.com/M4SS-Code/cargo-goggles/issues/11 scraping crates with no repository field").build().unwrap();
let crates = [
// put crates here
];
#[derive(Debug, serde::Deserialize)]
struct C {
#[serde(rename = "crate")]
c: Cr,
}
#[derive(Debug, serde::Deserialize)]
struct Cr {
updated_at: DateTime<Utc>,
}
for k in crates {
for _ in 0..3 {
let j = match client
.get(format!("https://crates.io/api/v1/crates/{k}"))
.send()
{
Ok(r) => match r.json::<C>() {
Ok(j) => j,
Err(err) => {
eprintln!("{err:?}");
thread::sleep(Duration::from_secs(5));
continue;
}
},
Err(err) => {
eprintln!("{err:?}");
thread::sleep(Duration::from_secs(5));
continue;
}
};
println!("{k}\t{}", j.c.updated_at.to_rfc3339());
break;
}
thread::sleep(Duration::from_secs(2));
}
} |
Maybe also make a post on Mastodon with |
Sounds like a good idea. In the meantime here's the list (it's actually .tsv but GitHub didn't like it): crates.csv |
I haven't posted it on Twitter or Mastodon yet, or seen if I'm not sure opening issues this way is doable at this point, for once we're still just 3 people playing with our toys figuring out what to do with them. I think I'll dedicate more time on the development side to get something much more usable than the current version and see this can also help others, be it in a CLI or library form. Footnotes
|
I've scraped (I'm lazy, I should have used the database dumps) the top 20k crates by recent downloads count. I've published the list and the script at https://gist.github.com/paolobarbolini/b5101b3ad378bcb6bc5c282349edfd4c.
I'll soon be getting a server from Hetzner with 320 GB of disk and see if I can go through the entire list without running out of disk space. I'll also use the list as a way of fixing some of the shortcomings which have been reported in other issues.
Before you open issues in the projects you think are affected, investigate the reports thoroughly. This software is still v0.0.1 for a very good reason.
The text was updated successfully, but these errors were encountered: