Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support opening from URLs #66

Open
benjeffery opened this issue Jun 7, 2022 · 11 comments
Open

Support opening from URLs #66

benjeffery opened this issue Jun 7, 2022 · 11 comments

Comments

@benjeffery
Copy link
Member

Discussion at tskit-dev/tskit#1566 (comment)

@hyanwong
Copy link
Member

hyanwong commented Jun 8, 2022

If this is implemented, perhaps we should put a note in the tskit docs for tskit.load and ts.dump to say that these methods are primarily intended for use on local files, and if your intention is to make a tree sequence file available for download on the internet, or to download one remotely, you are recommended to use tszip to (de)compress and load from URLs?

@jeromekelleher
Copy link
Member

jeromekelleher commented Jun 8, 2022

To implement this we'd need to

  1. Update the load_zarr to read from a path or a file. If it'sa file, we'd have to copy first to a local file and then feed the path to zarr.ZipStore (as this is all it supports).
  2. Maybe use fsspec to do URL loading for us (as this is already a dependency of zarr)

It'll be fiddly, unfortunately.

@hyanwong
Copy link
Member

To increase the fiddlyness, it would be really helpful, I think, to be able to show progress when downloading, if at all possible. Even if we don't know the file size beforehand, something that tells the user that the session hasn't just stalled is pretty useful for teaching purposes.

@jeromekelleher
Copy link
Member

That's surely feature creep - why not put in a bash cell that does the download to a local file using curl?

@hyanwong
Copy link
Member

ISWYM about feature creep. But how many tskit users (not devs) know about curl and bash? And do we even want them to know about that before they get started? We provide progress bars for tsinfer to give feedback too.

I guess this could be a Zarr thing anyway. Presumably remote access to data, and feedback about time to complete is on their agenda?

@jeromekelleher
Copy link
Member

ISWYM about feature creep. But how many tskit users (not devs) know about curl and bash?

They don't need to for your use case though right, either way it's just a cell in the notebook that they execute which leads to you having a TreeSequence object loaded.

@hyanwong
Copy link
Member

hyanwong commented Jun 15, 2022

Yep, at the moment I'm just doing this in a cell:

import urllib.request
from tqdm import tqdm
import tszip

class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)

url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"
with DownloadProgressBar(unit='B', unit_scale=True,
        miniters=1, desc=url.split('/')[-1]) as t:
    temporary_filename, _ = urllib.request.urlretrieve(url, reporthook=t.update_to)
ts_2q = tszip.decompress(temporary_filename)
urllib.request.urlcleanup() # remove temporary_filename

But it would be much cleaner to wrap that somehow

ts_2q = tszip.decompress(url=url)

@jeromekelleher
Copy link
Member

I just tried out by bash magic idea and it'd didn't work because there's no "live" update from the cell, and so you only get the download progress at the very end. So you would have to do this via a python package of some sort.

@hyanwong
Copy link
Member

hyanwong commented Jun 15, 2022

The tqdm code above works a treat. But it's still a bit verbose, and users might baulk at having to understand it. It's not that satisfying to say "just paste this code and ignore how it works". So anything that would help wrap this into a more terse and comprehensible syntax would be good, I think. Perhaps @benjeffery has a good suggestion (he usually does!). Personally I don't think it's too bad to have tqdm as a tszip dependency. You could imagine, for instance, defining something like the DownloadProgressBar class as a tszip helper:

url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"
with tszip.progressbar() as pbar:
    tmpname, _ = urllib.request.urlretrieve(url, reporthook=pbar.update)
    ts = tszip.decompress(tmpname)
    urllib.request.urlcleanup()

is already a lot cleaner IMO. But maybe there is an even terser way to do it?

@jeromekelleher
Copy link
Member

It makes no sense to add a general progress bar UI to a package that's for compressing tskit tree sequences. What you're looking for is a python package that does a download with an integrated progress bar (which I agree would be very useful):

import yanspackage

url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"
filename = yanspackage.download(url, progress="notebook")

@hyanwong
Copy link
Member

hyanwong commented Jun 16, 2022

OK, but I'm thinking of tszip as not just a compression package, but "a package for compressing and decompressing tree sequences, including from remote sites". Maybe that's feature creep, but as I say, it would be useful for teaching (and probably research too).

(FWIW for learning stuff, I rather dislike having to download files separately, before doing analysis, then fiddling around with coding where the files are stored, etc. I would much prefer it to appear as if I have streamed the download directly into the variables in my python session, and not have to think about clearing up disk space afterwards, or dealing with tmp directories. Perhaps I'm unusual like that, though?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants