Sharded experiments #686

dholth · 2025-08-15T14:29:36Z

Description

Working on adding sharded repodata support, mostly in the Python code of conda-libmamba-solver

Checklist - did you ...

Add a file to the news directory (using the template) for the next release's release notes?
Add / update necessary tests?
Add / update outdated documentation?

…h shard

conda_libmamba_solver/shards_subset.py

jaimergp · 2025-09-17T16:51:39Z

conda_libmamba_solver/shards_subset.py

+            found = fetch_shards(subdir_data)
+            if not found:
+                repodata_json, _ = subdir_data.repo_fetch.fetch_latest_parsed()
+                found = ShardLike(repodata_json, channel_url)  # type: ignore


I was reviewing #697, and I wondered why do we need a ShardLike interface? Then saw this code. Are we going to subset all repodata going forward? That will remove any benefits from the .solv cache in Unix systems that just memory map the file and introduce maybe unneeded overhead? Also in general, reading the full JSON just to subset it, and then write it back, only to be read again by libmamba... I guess it's faster to just pass it through.

So... what if we do not use ShardLike for non-sharded channels and just keep the original repodata.json?

I would rather parse them into mamba PackageRecord and skip json, that's possible right?
I have read blog posts saying memory mapping isn't that big a win on a lot of modern systems.
Using the original cached json path instead of a fresh subset-of-monolithic-repodata would be a simple choice when loading LibMambaIndexHelper; even though to subset the real sharded repodata we also need to compute the subset of fake sharded repodata.
You'd be surprised at how fast Python json.loads() runs, what we must avoid is creating PackageRecord for all the unused dependencies.

We would still need to initialize a bunch of Python objects to pass them to libmamba. My point is that if (in the case of non sharded repodata) we already have a JSON on disk ready to pass to the libmamba constructors, why bother doing all these extra steps.

For sharded repodata, we could avoid the JSON writing overhead and construct the Python objects for libmamba straight from the raw dict. And then measure if that's faster. I guess there's a sweet number somewhere we can use as a threshold if the benefits are not clear in all cases.

I hope you noticed the part where we don't do any of this if we have no sharded channels.

Co-authored-by: jaimergp <[email protected]>

dholth · 2025-09-22T16:11:50Z

conda_libmamba_solver/index.py

        urls_to_channel = encoded_urls_to_channel

-        urls_to_json_path_and_state = self._fetch_repodata_jsons(tuple(urls_to_channel.keys()))
+        if self.in_state:


vs. out of state

dholth · 2025-09-23T14:15:06Z

tests/data/in_state.pickle

Includes a pickled posix Path object that can't be deserialized on Windows

ryanskeith

Looking over this, it looks reasonable to me. Being overgenerous at first is a good first step. It will be interesting to see how much this will ultimately download.

There is a little clean up in the form of removing print statements and the like but this is still in progress.

dholth added 5 commits August 1, 2025 17:18

store in_state on LibMambaIndexHelper

be79742

shards test; make index repos lazy

8138cca

example shards fetchers

41c5a67

download shards index

ddbb80c

use load(binary=True) to check cache against .msgpack.zst file

98cd30b

dholth marked this pull request as draft August 15, 2025 14:29

conda-bot added this to 🔎 Review Aug 15, 2025

github-project-automation bot moved this to 🆕 New in 🔎 Review Aug 15, 2025

conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label Aug 15, 2025

dholth added 17 commits August 18, 2025 18:11

improve cache handling

7d710f0

typed shards class

36df223

sqlite cache for individual shards

256f117

download and cache 16 random shards

67e0422

disable lazy repos due to bug; find all dependencies mentioned by eac…

750deb7

…h shard

add ShardLike class to present repodata.json as shards

c52b860

load heterogenous channels in test

b16f879

dependency traversal algorithm

d132bb1

begin parallel "fetch multiple" code

5e51dc7

method to build repodata subset

a80f935

adjust algorithm

76a8d79

move some functionality out of tests and into main package

4bfb88b

exclude type checking block from coverage

7481c0a

parallelize shard traversal algorithm; add bugs

0f3ac53

update unit tests

c08a844

additional test

87a1f66

sqlite3-based shard cache

53770e0

dholth mentioned this pull request Sep 3, 2025

sqlite3-based shard cache #695

Merged

3 tasks

dholth added 3 commits September 3, 2025 10:16

add news item

4f910b6

pre-commit changes

ed734bc

Update conda_libmamba_solver/shard_cache.py

3def95e

dholth added 3 commits September 10, 2025 10:46

Merge branch 'shards-model' into sharded-experiments

5402089

consolidate __contains__ method

ef60c02

100% code coverage on shards

58f5e11

This was referenced Sep 10, 2025

Modify LibMambaIndexHelper for Sharded Repodata Gathering #684

Open

Dependency-Based Repodata Subsetting Algorithm #707

Open

dholth added 2 commits September 10, 2025 15:56

don't use pickle

337e7eb

move dependency traversal into shards_subset module

546cd84

dholth force-pushed the sharded-experiments branch from adb6d9a to 546cd84 Compare September 11, 2025 19:44

dholth added 5 commits September 11, 2025 15:59

call build_repodata_subset in LibMambaIndexHelper

a2d7e80

Merge remote-tracking branch 'origin/main' into sharded-experiments

ada9ac4

pass repodata subset to solver

5c0ed58

get package names correctly

c274efe

traverse requested as well as installed packages

a00abe2

jaimergp linked an issue Sep 14, 2025 that may be closed by this pull request

Modify LibMambaIndexHelper for Sharded Repodata Gathering #684

Open

2 tasks

jaimergp reviewed Sep 17, 2025

View reviewed changes

conda_libmamba_solver/shards_subset.py Outdated Show resolved Hide resolved

jaimergp reviewed Sep 17, 2025

View reviewed changes

dholth and others added 4 commits September 17, 2025 14:14

type checking change

db43fd9

Merge remote-tracking branch 'origin/main' into sharded-experiments

2b00db2

Update conda_libmamba_solver/shards_subset.py

9ca17cc

Co-authored-by: jaimergp <[email protected]>

explain shards algorithm

9ccb696

dholth commented Sep 22, 2025

View reviewed changes

additional shard fetch error tests

4b741cc

dholth commented Sep 23, 2025

View reviewed changes

tests/data/in_state.pickle

Copy link

Contributor Author

dholth Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Includes a pickled posix Path object that can't be deserialized on Windows

begin bulk fetch from cache

c1eafe8

dholth force-pushed the sharded-experiments branch from 4d20a84 to c1eafe8 Compare September 23, 2025 23:09

ryanskeith reviewed Sep 24, 2025

View reviewed changes

dholth added 4 commits September 25, 2025 10:00

more efficient retrieve-from-sqlite function

906136b

refactor names; store cache headers in loop

f656cdc

environment variable to enable sharded

630caa5

move typeddict declarations into own module

9dd6158

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sharded experiments #686

Sharded experiments #686

Uh oh!

dholth commented Aug 15, 2025

Uh oh!

Uh oh!

jaimergp Sep 17, 2025

Uh oh!

dholth Sep 17, 2025 •

edited

Loading

Uh oh!

jaimergp Sep 17, 2025

Uh oh!

jaimergp Sep 17, 2025

Uh oh!

dholth Sep 17, 2025

Uh oh!

dholth Sep 22, 2025

Uh oh!

dholth Sep 23, 2025

Uh oh!

ryanskeith left a comment •

edited

Loading

Uh oh!

Uh oh!

Sharded experiments #686

Are you sure you want to change the base?

Sharded experiments #686

Uh oh!

Conversation

dholth commented Aug 15, 2025

Description

Checklist - did you ...

Uh oh!

Uh oh!

jaimergp Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

dholth Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaimergp Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

jaimergp Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

dholth Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

dholth Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

dholth Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

ryanskeith left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dholth Sep 17, 2025 •

edited

Loading

ryanskeith left a comment •

edited

Loading