Skip to content

Conversation

dholth
Copy link
Contributor

@dholth dholth commented Aug 15, 2025

Description

Working on adding sharded repodata support, mostly in the Python code of conda-libmamba-solver

Checklist - did you ...

  • Add a file to the news directory (using the template) for the next release's release notes?
  • Add / update necessary tests?
  • Add / update outdated documentation?

@dholth dholth marked this pull request as draft August 15, 2025 14:29
@github-project-automation github-project-automation bot moved this to 🆕 New in 🔎 Review Aug 15, 2025
@conda-bot conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label Aug 15, 2025
@dholth dholth mentioned this pull request Sep 3, 2025
3 tasks
@dholth dholth force-pushed the sharded-experiments branch from adb6d9a to 546cd84 Compare September 11, 2025 19:44
@jaimergp jaimergp linked an issue Sep 14, 2025 that may be closed by this pull request
2 tasks
found = fetch_shards(subdir_data)
if not found:
repodata_json, _ = subdir_data.repo_fetch.fetch_latest_parsed()
found = ShardLike(repodata_json, channel_url) # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was reviewing #697, and I wondered why do we need a ShardLike interface? Then saw this code. Are we going to subset all repodata going forward? That will remove any benefits from the .solv cache in Unix systems that just memory map the file and introduce maybe unneeded overhead? Also in general, reading the full JSON just to subset it, and then write it back, only to be read again by libmamba... I guess it's faster to just pass it through.

So... what if we do not use ShardLike for non-sharded channels and just keep the original repodata.json?

Copy link
Contributor Author

@dholth dholth Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather parse them into mamba PackageRecord and skip json, that's possible right?
I have read blog posts saying memory mapping isn't that big a win on a lot of modern systems.
Using the original cached json path instead of a fresh subset-of-monolithic-repodata would be a simple choice when loading LibMambaIndexHelper; even though to subset the real sharded repodata we also need to compute the subset of fake sharded repodata.
You'd be surprised at how fast Python json.loads() runs, what we must avoid is creating PackageRecord for all the unused dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would still need to initialize a bunch of Python objects to pass them to libmamba. My point is that if (in the case of non sharded repodata) we already have a JSON on disk ready to pass to the libmamba constructors, why bother doing all these extra steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sharded repodata, we could avoid the JSON writing overhead and construct the Python objects for libmamba straight from the raw dict. And then measure if that's faster. I guess there's a sweet number somewhere we can use as a threshold if the benefits are not clear in all cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you noticed the part where we don't do any of this if we have no sharded channels.

urls_to_channel = encoded_urls_to_channel

urls_to_json_path_and_state = self._fetch_repodata_jsons(tuple(urls_to_channel.keys()))
if self.in_state:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vs. out of state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Includes a pickled posix Path object that can't be deserialized on Windows

@dholth dholth force-pushed the sharded-experiments branch from 4d20a84 to c1eafe8 Compare September 23, 2025 23:09
Copy link

@ryanskeith ryanskeith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking over this, it looks reasonable to me. Being overgenerous at first is a good first step. It will be interesting to see how much this will ultimately download.

There is a little clean up in the form of removing print statements and the like but this is still in progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed [bot] added once the contributor has signed the CLA
Projects
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

Modify LibMambaIndexHelper for Sharded Repodata Gathering
5 participants