Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose files in /data to the internet #116

Closed
hyanwong opened this issue Jul 10, 2021 · 15 comments
Closed

Expose files in /data to the internet #116

hyanwong opened this issue Jul 10, 2021 · 15 comments

Comments

@hyanwong
Copy link
Member

In conjunction with tskit-dev/tskit#1566, it would be useful for the tree sequence files stored in /data to be available at https://tskit.dev/tutorials/data/my_file.trees. We could use the GitHub raw version as a URL instead (e.g. https://github.com/tskit-dev/tutorials/blob/main/data/basics.trees?raw=true), but I think a strange GitHub url might be confusing to a new user.

@hyanwong
Copy link
Member Author

hyanwong commented Jul 10, 2021

Once we've done this, I think we should swap tutorial content of the form

tskit.load("data/basics.trees")

for

tskit.load(url="https://tskit.dev/tutorials/data/basics.trees")

Which will allow users to run the notebook code just as is done in the tutorial. I don't know if this will have an impact on CI testing the tutorial content, but I think it should be OK to rely on a file existing on the internet for compiling the tutorials.

@jeromekelleher
Copy link
Member

jeromekelleher commented Jul 10, 2021

I'm not particularly keen on this idea of making tskit.load accept URLS - it's a lot of work for very marginal benefit.

@jeromekelleher
Copy link
Member

Seems perfectly reasonable to copy the data files to the webserver, though, so they can be downloaded.

@hyanwong
Copy link
Member Author

hyanwong commented Jul 10, 2021

Easy enough to do this for the time being, anyway:

from urllib.request import urlopen
ts = tskit.load(urlopen("https://tskit.dev/tutorials/data/basics.trees"))

I'm not sure if that feels like cluttering up the top of the tutorials with an extra complicated import, but making the tutorial material easily runnable by e.g. a student seems like a big win to me.

@jeromekelleher
Copy link
Member

You could just put in a note some where saying "if you want to follow along, using this pattern". I don't think we should put all that urllib stuff in to every call, that'll really confuse people (they'll think you have to do this).

@hyanwong
Copy link
Member Author

hyanwong commented Jul 11, 2021

You could just put in a note some where saying "if you want to follow along, using this pattern". I don't think we should put all that urllib stuff in to every call, that'll really confuse people (they'll think you have to do this).

True. But I'd really like it for people to be able to execute the notebook (or the code in it) themselves, without having a lot of sorting out to do to get it to work. I guess we could say "download the data directory to the place from where you are running the jupyter notebook", but even that's a bit technical for the average student.

Just specifying a URL instead would be so much easier, which is why I'm so keen on wrapping the urlopen call within the tskit.load function, although I know you're not keen (tskit-dev/tskit#1566).

But perhaps you have other suggestions for how to make it easy for students to run the tutorial material themselves, now that we've taken away the binder possibility? I do think we want the code to run as-is rather than requiring the user to modify it, though. I think one of the main points of a tutorial is that someone can follow it along on their own machine, rather than just reading what we've written, so do need to make it easy to do this.

@jeromekelleher
Copy link
Member

IMO---after investing a fair bit of time into this for the msprime docs---we can either have something that links densely up with other tutorials and documentation and has a nice narrative structure, with unimportant details hidden OR have something that can be easily run in isolation as a notebook. They're just not compatible.

Let's do one thing well.

@hyanwong
Copy link
Member Author

Fair enough - but I do think the tutorials are a different use-case to the docs. The tutorials are meant to be executed and experimented with by the user, surely? Otherwise they are just additional documentation, not true tutorials?

So for the tutorials I suggest that we should provide a separate (derived?) notebook that "just works", and which can be downloaded and run in isolation by a user - I guess it only needs the code cells, and would simply swap any tskit.load commands with ones that get the correct files from a source URL rather than from a local path. I can open a separate issue about this?

@jeromekelleher
Copy link
Member

This sounds like an awful lot of work to me - I'm happy that people can copy the code cells into their own notebook, if they like (there's a nice copy button). I agree it would be nice if people could run the code themselves, but the technology doesn't really work right now, so (a) it's more important that the version you read works well (I'm assuming more people will read these than try to execute them); and (b) we don't try to maintain two separate versions.

@hyanwong
Copy link
Member Author

Yeah, I'd be happy with just copying too, but it would help an awful lot if we could use the URL form for loading a tree sequence in this case (or, as you say, at the top of every tutorial, repeat the mantra that they will need to do from urllib.request import urlopen; tskit.load(urlopen("https://tskit.dev/some/random/path.trees")) for each trees file that we use if they want to run the tutorial themselves)

@hyanwong
Copy link
Member Author

hyanwong commented Jul 16, 2021

OK, here's another workaround. In the intro page we can say that to follow along with the tutorials in your own nodebook, you can download from GitHub the files in data/ to the directory in which jupyter is running. We can even give instructions to do this from within the running python jupyter notebook:

import os
import json
import urllib.request

if not os.path.isdir("data"):
    os.mkdir("data")  # Make a "data" directory within the current folder
# Get the list of data files
info = urllib.request.urlopen("https://api.github.com/repos/tskit-dev/tutorials/git/trees/main?recursive=1")
files = json.loads(info.read().decode(info.info().get_content_charset('utf-8')))['tree']
# Save the data files to the data directory
for path in [file['path'] for file in files]:
    if path.startswith("data/") and path.endswith(".trees"):
        urllib.request.urlretrieve("https://raw.github.com/tskit-dev/tutorials/main/" + path, path)

For the moment, I'm strongly minded to add this code snippet to the tutorials/intro.html page, perhaps as a Note or under a section "Running tutorial code". The snippet above will presumably become less complicated when the files can be downloaded from https://tskit-dev/tutorials/data/xxx rather than GitHub, especially if we can provide a file list somewhere.

We don't actually need them to type all that code in anyway. We can simply us the IPython %load magic, so all they need in their notebook is e.g.

%load https://tskit-dev/tutorials/_static/download_data.py

@benjeffery
Copy link
Member

benjeffery commented Jul 18, 2021

My thoughts on this:

A large fraction of users coming to the tskit tutorials will already have a .trees produced by some other tool that they are looking to work with, they are interested in copy-pasting to look at and analyse their own file.

For those that need an example file, mentioning the %load magic above in a side note of the intro seems to be the least work to get these files to the user in an easy way.

EDIT: Just seen #120 which does this - will check that out!

@hyanwong
Copy link
Member Author

Thanks for the comments @benjeffery. I still think it may be worth sorting the web site so that these data files can be downloaded from https://tskit.dev/tutorials/data/XXX.trees, but I don't know if this will make the site construction untidy etc. It's not vital, although it's at least something that Jerome and I both think is a reasonable idea.

@benjeffery
Copy link
Member

It's pretty simple I think - just adding a copy to the end of the build.

@benjeffery
Copy link
Member

Closed in #122

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants