Support opening from URLs #66

benjeffery · 2022-06-07T09:28:35Z

Discussion at tskit-dev/tskit#1566 (comment)

hyanwong · 2022-06-08T11:29:52Z

If this is implemented, perhaps we should put a note in the tskit docs for tskit.load and ts.dump to say that these methods are primarily intended for use on local files, and if your intention is to make a tree sequence file available for download on the internet, or to download one remotely, you are recommended to use tszip to (de)compress and load from URLs?

jeromekelleher · 2022-06-08T12:54:38Z

To implement this we'd need to

Update the load_zarr to read from a path or a file. If it'sa file, we'd have to copy first to a local file and then feed the path to zarr.ZipStore (as this is all it supports).
Maybe use fsspec to do URL loading for us (as this is already a dependency of zarr)

It'll be fiddly, unfortunately.

hyanwong · 2022-06-14T20:34:58Z

To increase the fiddlyness, it would be really helpful, I think, to be able to show progress when downloading, if at all possible. Even if we don't know the file size beforehand, something that tells the user that the session hasn't just stalled is pretty useful for teaching purposes.

jeromekelleher · 2022-06-15T08:32:30Z

That's surely feature creep - why not put in a bash cell that does the download to a local file using curl?

hyanwong · 2022-06-15T08:41:13Z

ISWYM about feature creep. But how many tskit users (not devs) know about curl and bash? And do we even want them to know about that before they get started? We provide progress bars for tsinfer to give feedback too.

I guess this could be a Zarr thing anyway. Presumably remote access to data, and feedback about time to complete is on their agenda?

jeromekelleher · 2022-06-15T11:24:13Z

ISWYM about feature creep. But how many tskit users (not devs) know about curl and bash?

They don't need to for your use case though right, either way it's just a cell in the notebook that they execute which leads to you having a TreeSequence object loaded.

hyanwong · 2022-06-15T12:28:09Z

Yep, at the moment I'm just doing this in a cell:

import urllib.request
from tqdm import tqdm
import tszip

class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)

url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"
with DownloadProgressBar(unit='B', unit_scale=True,
        miniters=1, desc=url.split('/')[-1]) as t:
    temporary_filename, _ = urllib.request.urlretrieve(url, reporthook=t.update_to)
ts_2q = tszip.decompress(temporary_filename)
urllib.request.urlcleanup() # remove temporary_filename

But it would be much cleaner to wrap that somehow

ts_2q = tszip.decompress(url=url)

jeromekelleher · 2022-06-15T13:09:23Z

I just tried out by bash magic idea and it'd didn't work because there's no "live" update from the cell, and so you only get the download progress at the very end. So you would have to do this via a python package of some sort.

hyanwong · 2022-06-15T22:06:49Z

The tqdm code above works a treat. But it's still a bit verbose, and users might baulk at having to understand it. It's not that satisfying to say "just paste this code and ignore how it works". So anything that would help wrap this into a more terse and comprehensible syntax would be good, I think. Perhaps @benjeffery has a good suggestion (he usually does!). Personally I don't think it's too bad to have tqdm as a tszip dependency. You could imagine, for instance, defining something like the DownloadProgressBar class as a tszip helper:

url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"
with tszip.progressbar() as pbar:
    tmpname, _ = urllib.request.urlretrieve(url, reporthook=pbar.update)
    ts = tszip.decompress(tmpname)
    urllib.request.urlcleanup()

is already a lot cleaner IMO. But maybe there is an even terser way to do it?

jeromekelleher · 2022-06-16T08:09:41Z

It makes no sense to add a general progress bar UI to a package that's for compressing tskit tree sequences. What you're looking for is a python package that does a download with an integrated progress bar (which I agree would be very useful):

import yanspackage

url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"
filename = yanspackage.download(url, progress="notebook")

hyanwong · 2022-06-16T08:45:56Z

OK, but I'm thinking of tszip as not just a compression package, but "a package for compressing and decompressing tree sequences, including from remote sites". Maybe that's feature creep, but as I say, it would be useful for teaching (and probably research too).

(FWIW for learning stuff, I rather dislike having to download files separately, before doing analysis, then fiddling around with coding where the files are stored, etc. I would much prefer it to appear as if I have streamed the download directly into the variables in my python session, and not have to think about clearing up disk space afterwards, or dealing with tmp directories. Perhaps I'm unusual like that, though?)

benjeffery mentioned this issue Jun 7, 2022

Support loading from URLs in Python tskit-dev/tskit#1566

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support opening from URLs #66

Support opening from URLs #66

benjeffery commented Jun 7, 2022

hyanwong commented Jun 8, 2022

jeromekelleher commented Jun 8, 2022 •

edited

Loading

hyanwong commented Jun 14, 2022

jeromekelleher commented Jun 15, 2022

hyanwong commented Jun 15, 2022

jeromekelleher commented Jun 15, 2022

hyanwong commented Jun 15, 2022 •

edited

Loading

jeromekelleher commented Jun 15, 2022

hyanwong commented Jun 15, 2022 •

edited

Loading

jeromekelleher commented Jun 16, 2022

hyanwong commented Jun 16, 2022 •

edited

Loading

Support opening from URLs #66

Support opening from URLs #66

Comments

benjeffery commented Jun 7, 2022

hyanwong commented Jun 8, 2022

jeromekelleher commented Jun 8, 2022 • edited Loading

hyanwong commented Jun 14, 2022

jeromekelleher commented Jun 15, 2022

hyanwong commented Jun 15, 2022

jeromekelleher commented Jun 15, 2022

hyanwong commented Jun 15, 2022 • edited Loading

jeromekelleher commented Jun 15, 2022

hyanwong commented Jun 15, 2022 • edited Loading

jeromekelleher commented Jun 16, 2022

hyanwong commented Jun 16, 2022 • edited Loading

jeromekelleher commented Jun 8, 2022 •

edited

Loading

hyanwong commented Jun 15, 2022 •

edited

Loading

hyanwong commented Jun 15, 2022 •

edited

Loading

hyanwong commented Jun 16, 2022 •

edited

Loading