Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] caching on demand #1653

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

AlbertDeFusco
Copy link
Contributor

@AlbertDeFusco AlbertDeFusco commented Jul 24, 2024

This PR solves a use case I have where I want to use filecache or simplecache to assist me in downloading a file and using the local path. Further, I may be working with extremely large files and I want to avoid calling .read() since that places the whole file into memory after it has been cached.

Todo:

  • Tests for chaching method
  • Docs

A new method has been added called .cache_path(path) to invoke caching (I'm open to better name) and return the local filename.

Here's an example with simplecache, the same works for filecache.

In [1]: import fsspec

In [2]: url = 'https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q5_K_M.ggu
   ...: f?download=True'

In [3]: fs, path = fsspec.url_to_fs(f"simplecache::{url}")

In [4]: from fsspec.callbacks import TqdmCallback

In [5]: fs.cache_path(path, callback=TqdmCallback(), force=True)
100%|██████████████████████████████████████████████████████████████████| 5131409056/5131409056 [02:56<00:00, 29001242.99it/s]
Out[5]: '/var/folders/cg/wyhz9cvx06vdt65k8tr31trc0000gp/T/tmp0ty9rygr/b6d84e10fdf36fa50acdf6e717b1e6ae2814efa70869edefe2d5d4a7182a11fb'

In [6]: fs.cache_path(path, callback=TqdmCallback())
Out[6]: '/var/folders/cg/wyhz9cvx06vdt65k8tr31trc0000gp/T/tmp0ty9rygr/b6d84e10fdf36fa50acdf6e717b1e6ae2814efa70869edefe2d5d4a7182a11fb'

In [7]: 

@martindurant
Copy link
Member

Where does the cached file go? Is there a way to get the equivalent cached filesystem, so you can interact with the local cached files?

I wonder what the connection with fsspec.open_local should be.

@martindurant
Copy link
Member

(a separate issue I think was mentioned elsewhere, is whether there should be an Intake "cache" reader which acts on any filetype in exactly this way, and returns a datatype object of the same type as the original but with the appropriate local path)

@AlbertDeFusco
Copy link
Contributor Author

Oh, I see on the outside that open_local does what I'm looking for. Now I see that it ends up calling open_many and then downloads the file to cache. I was unaware of this method.

In [2]: import fsspec

In [3]: import fsspec.callbacks

In [4]: fsspec.callbacks.DEFAULT_CALLBACK = fsspec.callbacks.TqdmCallback()

In [5]: url = "github://albertdefusco:datasets@main/auto-mpg.csv"

In [6]: fn = fsspec.open_local(f"simplecache::{url}")
100%|███████████████████████████████████████████████████████████████████████████| 21021/21021 [00:00<00:00, 107260905.58it/s]

In [7]: fn
Out[7]: '/var/folders/cg/wyhz9cvx06vdt65k8tr31trc0000gp/T/tmpu9_vcbpt/5a6fcc477034509ca56b58f0c0db6a17baa534ecf3c31fde04ddda8ee9a0f7e8'

@AlbertDeFusco
Copy link
Contributor Author

You're right that there is something missing when working with more than one file. I see where having a local fs for the cached items would be useful. I'll give it some thought.

In [1]: import fsspec

In [2]: import fsspec.callbacks

In [3]: fsspec.callbacks.DEFAULT_CALLBACK = fsspec.callbacks.TqdmCallback()

In [4]: url = "github://albertdefusco:datasets@main/weather/*.csv"

In [5]: paths = fsspec.open_local(f"filecache::{url}")
100%|███████████████████████████████████████████████████████████████████████████| 39576/39576 [00:00<00:00, 137868583.97it/s]
79382it [00:00, 291807397.13it/s]                                                                                            
119207it [00:00, 344820963.40it/s]                                                                                           
159723it [00:00, 368902432.70it/s]                                                                                           
198669it [00:00, 410481862.75it/s]                                                                                           

In [6]: paths
Out[6]: 
['/var/folders/cg/wyhz9cvx06vdt65k8tr31trc0000gp/T/tmpgseqn860/5955f07f6293ad9a465f38ed02edec0131ce19985be9ad2c443ca8184b2b1065',
 '/var/folders/cg/wyhz9cvx06vdt65k8tr31trc0000gp/T/tmpgseqn860/0c501daeded3e12aa5c2e473a8dc79893dc38cdeecf84a7581fe44513d11e747',
 '/var/folders/cg/wyhz9cvx06vdt65k8tr31trc0000gp/T/tmpgseqn860/2e40e4c868226583370492e940167bdff8f309c46bb16f1794126ccbd0f13f38',
 '/var/folders/cg/wyhz9cvx06vdt65k8tr31trc0000gp/T/tmpgseqn860/58c6f6669cd55443cca8722320a0f0cb13340be88e1049b0c3f8ba8fa722113b',
 '/var/folders/cg/wyhz9cvx06vdt65k8tr31trc0000gp/T/tmpgseqn860/d7f5750dd42f0200576c8b039a995c981c1d603f2f6dbd6532f9b959d2dd5540']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants