-
Notifications
You must be signed in to change notification settings - Fork 30
Sharded experiments #686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Sharded experiments #686
Conversation
adb6d9a
to
546cd84
Compare
found = fetch_shards(subdir_data) | ||
if not found: | ||
repodata_json, _ = subdir_data.repo_fetch.fetch_latest_parsed() | ||
found = ShardLike(repodata_json, channel_url) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was reviewing #697, and I wondered why do we need a ShardLike interface? Then saw this code. Are we going to subset all repodata going forward? That will remove any benefits from the .solv cache in Unix systems that just memory map the file and introduce maybe unneeded overhead? Also in general, reading the full JSON just to subset it, and then write it back, only to be read again by libmamba... I guess it's faster to just pass it through.
So... what if we do not use ShardLike
for non-sharded channels and just keep the original repodata.json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather parse them into mamba PackageRecord and skip json, that's possible right?
I have read blog posts saying memory mapping isn't that big a win on a lot of modern systems.
Using the original cached json path instead of a fresh subset-of-monolithic-repodata would be a simple choice when loading LibMambaIndexHelper; even though to subset the real sharded repodata we also need to compute the subset of fake sharded repodata.
You'd be surprised at how fast Python json.loads() runs, what we must avoid is creating PackageRecord for all the unused dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would still need to initialize a bunch of Python objects to pass them to libmamba. My point is that if (in the case of non sharded repodata) we already have a JSON on disk ready to pass to the libmamba constructors, why bother doing all these extra steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sharded repodata, we could avoid the JSON writing overhead and construct the Python objects for libmamba straight from the raw dict. And then measure if that's faster. I guess there's a sweet number somewhere we can use as a threshold if the benefits are not clear in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope you noticed the part where we don't do any of this if we have no sharded channels.
urls_to_channel = encoded_urls_to_channel | ||
|
||
urls_to_json_path_and_state = self._fetch_repodata_jsons(tuple(urls_to_channel.keys())) | ||
if self.in_state: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vs. out of state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Includes a pickled posix Path object that can't be deserialized on Windows
4d20a84
to
c1eafe8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking over this, it looks reasonable to me. Being overgenerous at first is a good first step. It will be interesting to see how much this will ultimately download.
There is a little clean up in the form of removing print statements and the like but this is still in progress.
Description
Working on adding sharded repodata support, mostly in the Python code of conda-libmamba-solver
Checklist - did you ...
news
directory (using the template) for the next release's release notes?