Expose files in /data to the internet #116

hyanwong · 2021-07-10T16:20:23Z

In conjunction with tskit-dev/tskit#1566, it would be useful for the tree sequence files stored in /data to be available at https://tskit.dev/tutorials/data/my_file.trees. We could use the GitHub raw version as a URL instead (e.g. https://github.com/tskit-dev/tutorials/blob/main/data/basics.trees?raw=true), but I think a strange GitHub url might be confusing to a new user.

The text was updated successfully, but these errors were encountered:

hyanwong · 2021-07-10T16:24:00Z

Once we've done this, I think we should swap tutorial content of the form

tskit.load("data/basics.trees")

for

tskit.load(url="https://tskit.dev/tutorials/data/basics.trees")

Which will allow users to run the notebook code just as is done in the tutorial. I don't know if this will have an impact on CI testing the tutorial content, but I think it should be OK to rely on a file existing on the internet for compiling the tutorials.

jeromekelleher · 2021-07-10T16:28:44Z

I'm not particularly keen on this idea of making tskit.load accept URLS - it's a lot of work for very marginal benefit.

jeromekelleher · 2021-07-10T16:35:07Z

Seems perfectly reasonable to copy the data files to the webserver, though, so they can be downloaded.

hyanwong · 2021-07-10T16:51:53Z

Easy enough to do this for the time being, anyway:

from urllib.request import urlopen
ts = tskit.load(urlopen("https://tskit.dev/tutorials/data/basics.trees"))

I'm not sure if that feels like cluttering up the top of the tutorials with an extra complicated import, but making the tutorial material easily runnable by e.g. a student seems like a big win to me.

jeromekelleher · 2021-07-11T09:03:55Z

You could just put in a note some where saying "if you want to follow along, using this pattern". I don't think we should put all that urllib stuff in to every call, that'll really confuse people (they'll think you have to do this).

hyanwong · 2021-07-11T09:46:29Z

You could just put in a note some where saying "if you want to follow along, using this pattern". I don't think we should put all that urllib stuff in to every call, that'll really confuse people (they'll think you have to do this).

True. But I'd really like it for people to be able to execute the notebook (or the code in it) themselves, without having a lot of sorting out to do to get it to work. I guess we could say "download the data directory to the place from where you are running the jupyter notebook", but even that's a bit technical for the average student.

Just specifying a URL instead would be so much easier, which is why I'm so keen on wrapping the urlopen call within the tskit.load function, although I know you're not keen (tskit-dev/tskit#1566).

But perhaps you have other suggestions for how to make it easy for students to run the tutorial material themselves, now that we've taken away the binder possibility? I do think we want the code to run as-is rather than requiring the user to modify it, though. I think one of the main points of a tutorial is that someone can follow it along on their own machine, rather than just reading what we've written, so do need to make it easy to do this.

jeromekelleher · 2021-07-11T10:04:35Z

IMO---after investing a fair bit of time into this for the msprime docs---we can either have something that links densely up with other tutorials and documentation and has a nice narrative structure, with unimportant details hidden OR have something that can be easily run in isolation as a notebook. They're just not compatible.

Let's do one thing well.

hyanwong · 2021-07-11T10:10:54Z

Fair enough - but I do think the tutorials are a different use-case to the docs. The tutorials are meant to be executed and experimented with by the user, surely? Otherwise they are just additional documentation, not true tutorials?

So for the tutorials I suggest that we should provide a separate (derived?) notebook that "just works", and which can be downloaded and run in isolation by a user - I guess it only needs the code cells, and would simply swap any tskit.load commands with ones that get the correct files from a source URL rather than from a local path. I can open a separate issue about this?

jeromekelleher · 2021-07-12T06:05:19Z

This sounds like an awful lot of work to me - I'm happy that people can copy the code cells into their own notebook, if they like (there's a nice copy button). I agree it would be nice if people could run the code themselves, but the technology doesn't really work right now, so (a) it's more important that the version you read works well (I'm assuming more people will read these than try to execute them); and (b) we don't try to maintain two separate versions.

hyanwong · 2021-07-12T07:50:32Z

Yeah, I'd be happy with just copying too, but it would help an awful lot if we could use the URL form for loading a tree sequence in this case (or, as you say, at the top of every tutorial, repeat the mantra that they will need to do from urllib.request import urlopen; tskit.load(urlopen("https://tskit.dev/some/random/path.trees")) for each trees file that we use if they want to run the tutorial themselves)

hyanwong · 2021-07-16T11:44:57Z

OK, here's another workaround. In the intro page we can say that to follow along with the tutorials in your own nodebook, you can download from GitHub the files in data/ to the directory in which jupyter is running. We can even give instructions to do this from within the running python jupyter notebook:

import os
import json
import urllib.request

if not os.path.isdir("data"):
    os.mkdir("data")  # Make a "data" directory within the current folder
# Get the list of data files
info = urllib.request.urlopen("https://api.github.com/repos/tskit-dev/tutorials/git/trees/main?recursive=1")
files = json.loads(info.read().decode(info.info().get_content_charset('utf-8')))['tree']
# Save the data files to the data directory
for path in [file['path'] for file in files]:
    if path.startswith("data/") and path.endswith(".trees"):
        urllib.request.urlretrieve("https://raw.github.com/tskit-dev/tutorials/main/" + path, path)

For the moment, I'm strongly minded to add this code snippet to the tutorials/intro.html page, perhaps as a Note or under a section "Running tutorial code". The snippet above will presumably become less complicated when the files can be downloaded from https://tskit-dev/tutorials/data/xxx rather than GitHub, especially if we can provide a file list somewhere.

We don't actually need them to type all that code in anyway. We can simply us the IPython %load magic, so all they need in their notebook is e.g.

%load https://tskit-dev/tutorials/_static/download_data.py

benjeffery · 2021-07-18T23:58:33Z

My thoughts on this:

A large fraction of users coming to the tskit tutorials will already have a .trees produced by some other tool that they are looking to work with, they are interested in copy-pasting to look at and analyse their own file.

For those that need an example file, mentioning the %load magic above in a side note of the intro seems to be the least work to get these files to the user in an easy way.

EDIT: Just seen #120 which does this - will check that out!

hyanwong · 2021-07-19T08:27:54Z

Thanks for the comments @benjeffery. I still think it may be worth sorting the web site so that these data files can be downloaded from https://tskit.dev/tutorials/data/XXX.trees, but I don't know if this will make the site construction untidy etc. It's not vital, although it's at least something that Jerome and I both think is a reasonable idea.

benjeffery · 2021-07-19T08:39:26Z

It's pretty simple I think - just adding a copy to the end of the build.

benjeffery · 2021-07-19T10:31:27Z

Closed in #122

hyanwong mentioned this issue Jul 10, 2021

Better formatting for notebooks that "Open in binder" #114

Closed

hyanwong mentioned this issue Jul 13, 2021

Add instructions for actually running the tutorials to intro.html #119

Closed

hyanwong mentioned this issue Jul 16, 2021

Add instructions for running the tutes #120

Merged

benjeffery closed this as completed Jul 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose files in /data to the internet #116

Expose files in /data to the internet #116

hyanwong commented Jul 10, 2021

hyanwong commented Jul 10, 2021 •

edited

Loading

jeromekelleher commented Jul 10, 2021 •

edited

Loading

jeromekelleher commented Jul 10, 2021

hyanwong commented Jul 10, 2021 •

edited

Loading

jeromekelleher commented Jul 11, 2021

hyanwong commented Jul 11, 2021 •

edited

Loading

jeromekelleher commented Jul 11, 2021

hyanwong commented Jul 11, 2021

jeromekelleher commented Jul 12, 2021

hyanwong commented Jul 12, 2021

hyanwong commented Jul 16, 2021 •

edited

Loading

benjeffery commented Jul 18, 2021 •

edited

Loading

hyanwong commented Jul 19, 2021

benjeffery commented Jul 19, 2021

benjeffery commented Jul 19, 2021

Expose files in /data to the internet #116

Expose files in /data to the internet #116

Comments

hyanwong commented Jul 10, 2021

hyanwong commented Jul 10, 2021 • edited Loading

jeromekelleher commented Jul 10, 2021 • edited Loading

jeromekelleher commented Jul 10, 2021

hyanwong commented Jul 10, 2021 • edited Loading

jeromekelleher commented Jul 11, 2021

hyanwong commented Jul 11, 2021 • edited Loading

jeromekelleher commented Jul 11, 2021

hyanwong commented Jul 11, 2021

jeromekelleher commented Jul 12, 2021

hyanwong commented Jul 12, 2021

hyanwong commented Jul 16, 2021 • edited Loading

benjeffery commented Jul 18, 2021 • edited Loading

hyanwong commented Jul 19, 2021

benjeffery commented Jul 19, 2021

benjeffery commented Jul 19, 2021

hyanwong commented Jul 10, 2021 •

edited

Loading

jeromekelleher commented Jul 10, 2021 •

edited

Loading

hyanwong commented Jul 10, 2021 •

edited

Loading

hyanwong commented Jul 11, 2021 •

edited

Loading

hyanwong commented Jul 16, 2021 •

edited

Loading

benjeffery commented Jul 18, 2021 •

edited

Loading