Skip to content

Commit 0c24fe3

Browse files
authored
Merge pull request #276 from coderefinery/swi_prep
Smallish edits to lessons
2 parents 8fc7377 + 0862aee commit 0c24fe3

File tree

7 files changed

+113
-69
lines changed

7 files changed

+113
-69
lines changed

content/dependencies.md

Lines changed: 28 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# Recording dependencies
22

3+
```{objectives}
4+
- Understand what dependency management tools can be useful for
5+
- Discuss environment/requirements files in the context of reusability and
6+
reproducibility
7+
```
8+
39
```{questions}
410
- How can we communicate different versions of software dependencies?
511
```
@@ -44,18 +50,18 @@ When we create recipes, we often use tools created by others (libraries) [Midjou
4450

4551
---
4652

47-
## Tools and what problems they try to solve
53+
## Dependency and environment management
4854

4955
**Conda, Anaconda, pip, virtualenv, Pipenv, pyenv, Poetry, requirements.txt,
5056
environment.yml, renv**, ..., these tools try to solve the following problems:
5157

52-
- **Defining a specific set of dependencies**, possibly with well defined versions
58+
- **Defining a specific set of dependencies**
5359
- **Installing those dependencies** mostly automatically
5460
- **Recording the versions** for all dependencies
5561
- **Isolate environments**
5662
- On your computer for projects so they can use different software
5763
- Isolate environments on computers with many users (and allow self-installations)
58-
- Using **different Python/R versions** per project
64+
- Using **different package versions** per project (also e.g. Python/R versions)
5965
- Provide tools and services to **share packages**
6066

6167
Isolated environments are also useful because they help you make sure
@@ -273,7 +279,7 @@ information?
273279
We start from an existing conda environment. Try this either with your own project or inside the "coderefinery" conda
274280
environment. For demonstration puprposes, you can also create an environment with:
275281
276-
```console
282+
```console
277283
$ conda env create -f myenv.yml
278284
```
279285
Where the file `myenv.yml` could have some python libraries with unspecified versions:
@@ -303,22 +309,37 @@ information?
303309
```
304310
305311
Have a look at the generated file and discuss what you see.
312+
313+
```{solution} Some things to note
314+
- Can you find all packages you installed directly? Which versions were installed?
315+
- What other packages were installed? -> Dependencies of dependencies
316+
- Besides the version you can also see the build channel
317+
- Sometimes the build includes an operating system or an architecture
318+
- Using this environment file might therefore not work/ not result in an identical setup on other computers
319+
```
306320
307321
In the future — or on a different computer — we can re-create this environment with:
308322
309323
```console
310324
$ conda env create -f environment.yml
311325
```
326+
You may use `conda` or `mamba` interchangeably for this step; mamba may solve the dependencies a bit faster.
312327
313328
What happens instead when you run the following command?
314329
315330
```console
316331
$ conda env export --from-history > environment_fromhistory.yml
317332
```
318333
319-
More information: <https://docs.conda.io/en/latest/>
334+
```{solution} Some things to note
335+
- Everything is listed as you installed it; with or without specified versions
336+
- Using this environment file a few days/weeks later will likely not result in the same environment
337+
- This can be a good starting point for a reproducible environment as you may add your current version numbers to it (check for example with `conda list | grep "packagename"`)
338+
```
339+
340+
In daily use you may not always use an environment.yml file to create the full environment, but create a base environment and then add new packages with `conda install packagename` as you go. Also those packages will be listed in the environment files created with either of the approaches above.
320341
321-
See also: <https://github.com/mamba-org/mamba>
342+
More information: <https://docs.conda.io/en/latest/> and <https://github.com/mamba-org/mamba>
322343
````
323344
324345
````{group-tab} Python virtualenv
@@ -355,5 +376,5 @@ information?
355376

356377
```{keypoints}
357378
- Recording dependencies with versions can make it easier for the next person to execute your code
358-
- There are many tools to record dependencies
379+
- There are many tools to record dependencies and separate environments
359380
```

content/environments.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
# Recording environments
22

33
```{objectives}
4-
- Understand what containers are
5-
- Understand good and less good usecases for containers
4+
- Understand what containers are and what they are useful for
65
- Discuss container definitions files in the context of reusability and
76
reproducibility
87
```

content/intro.md

Lines changed: 18 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,27 +16,34 @@
1616

1717
## This workshop is all about reproducibility - from a computational perspective
1818

19+
This section connects the steps above to the CodeRefinery workshop lessons.
20+
1921
**"Here is my code"**
2022

21-
-> **Version control with git** with focus on collaboration
22-
-> **Social coding**: What can you do to get credit for your code and to allow reuse
23-
-> **Documentation**: How to let others or future you know about your thoughts and how to use your code
24-
-> **Jupyter Notebooks**: A tool to write and share executable notebooks and data visualization
25-
-> **Automated testing**: Preventing yourself and others from breaking your functioning code
26-
-> **Modular code development**: Making reusing parts of your code easier
23+
- **Version control with git** with focus on collaboration
24+
- **Social coding**: What can you do to get credit for your code and to allow reuse
25+
- **Documentation**: How to let others or future you know about your thoughts and how to use your code
26+
- **Jupyter Notebooks**: A tool to write and share executable notebooks and data visualization
27+
- **Automated testing**: Preventing yourself and others from breaking your functioning code
28+
- **Modular code development**: Making reusing parts of your code easier
2729

28-
**"Here are my tools"**
30+
**"Here are my tools"**
2931

30-
-> This lesson on general **Reproducibility**: Preparing code to be usable by you and others in the future
32+
This lesson on general **Reproducibility**: Preparing code to be usable by you and others in the future
3133

3234
This includes organizing your projects on your own computer and recording your computational steps, dependencies and computing environment.
3335

34-
We will also mention a few tools and platforms for sharing data (**"Here is my data"**) and research outputs(**"Here are my results"**), but they are not the focus of this workshop.
36+
We will also mention a few tools and platforms for sharing data (**"Here is my data"**) and research outputs(**"Here are my results"**) in the **social coding** lesson, but they are not the focus of this workshop.
3537

3638
## Small steps towards reproducible research
3739

3840
If this is all new to you, it may feel quite overwhelming.
39-
Our recommendation: Focus on "good enough" instead of perfect: To start, pick one topic that seems reasonable to implement for your current project. Something that helps YOU right now. Some things you may have to implement due to requirements from your funders or the journal where you want to publish your research. Use their requirements as a checklist and find tools that feel comfortable for you.
40-
A great way to see what are the really important things to implement, meet with a colleague, exchange codes and try to run each others code. Every question your colleague has to ask from you about your code gives a hint on where you may need to improve your documentation.
41+
42+
**Our recommendation:** Don't worry! Focus on "good enough" instead of perfect.
43+
44+
To start, pick one topic that seems reasonable to implement for your current project. Something that helps YOU right now. This may be something you may have to implement due to requirements from your funders or the journal where you want to publish your research. Use their requirements as a checklist and find tools that feel comfortable for you.
45+
46+
A great way to see what are the really important things to implement is to meet with a colleague, exchange codes and try to run each others code. Every question your colleague has to ask from you about your code gives a hint on where you may need to improve.
47+
4148
Keeping a "log book" while working on your own code also serves as a great basis for making your code more reproducible. Can you use any of the tools and techniques learned in this workshop to share parts of your log book with others to help them run your code?
4249

content/motivation.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Motivation
22

3+
```{objectives}
4+
- Understand why we are talking about reproducibility in this workshop
5+
```
6+
37
```{instructor-note}
48
- 10 min teaching/discussion
59
```

content/organizing-projects.md

Lines changed: 51 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
# Organizing your projects
22

33
```{objectives}
4-
- Get an overview on how to organize research projects
4+
- Understand how to organize research projects
5+
- Get an overview of tools for collaborative and version controlled manuscripts
56
```
67

78
```{instructor-note}
@@ -14,22 +15,24 @@ Let's go over some of the basic things which people have found to work (and not
1415

1516
## Directory structure for projects
1617

17-
- Project files in a **single folder**
18-
- **Different projects** should have **separate folders**
18+
- Project files in a **single directory**
19+
- **Different projects** should have **separate directories**
1920
- Use **consistent and informative directory structure**
20-
- Avoid spaces in directory and file names – it is uglier for humans but handy for computers.
21-
- If you need to separate public/private, you can put them in public and private Git repos
22-
- If you need to separate public/secret, use `.gitignore` or a separate folder that's not in Git
21+
- Avoid spaces in directory and file names – use `-`, `_` or CamelCase instead (nicer for computers to handle).
22+
- If you need to separate public/private directories,
23+
- put them separately in public and private Git repositories, or
24+
- use `.gitignore` to exclude the private information from being tracked
2325
- Add a **README file** to describe the project and instructions on reproducing the results
24-
- If a software is reused in several projects it can make sense to put them in own repo
26+
- If you want to use the **same code in multiple projects**, host it on GitHub (or similar) and clone it into each of your project directories.
2527

2628
A project directory can look something like this:
29+
2730
```shell
2831
project_name/
2932
├── README.md # overview of the project
3033
├── data/ # data files used in the project
3134
│ ├── README.md # describes where data came from
32-
│ └── sub-folder/ # may contain subdirectories
35+
│ └── sub-directory/ # may contain subdirectories
3336
├── processed_data/ # intermediate files from the analysis
3437
├── manuscript/ # manuscript describing the results
3538
├── results/ # results of the analysis (data, tables, figures)
@@ -41,31 +44,44 @@ project_name/
4144
├── index.rst
4245
└── ...
4346
```
47+
4448
---
4549

4650
## Tracking source code, data, and results
4751

4852
- All code is version controlled and goes in the `src/` or `source/` directory
4953
- Include appropriate LICENSE file and information on software requirements
5054
- You can also version control data files or input files under `data/`
51-
- If data files are too large (or sensitive) to track, untrack them using `.gitignore`
55+
- If data files are too large (or sensitive) to track, untrack them using `.gitignore`
5256
- Intermediate files from the analysis are kept in `processed_data/`
5357
- Consider using Git tags to mark specific versions of results (version
5458
submitted to a journal, dissertation version, poster version, etc.):
5559
```console
5660
$ git tag -a thesis-submitted -m "this is the submitted version of my thesis"
5761
```
58-
* Check the [Git-intro lesson](https://coderefinery.github.io/git-intro/) for a reminder.
62+
63+
Check the [Git-intro lesson](https://coderefinery.github.io/git-intro/) for a reminder.
64+
65+
66+
## Some tools and templates
67+
68+
- [R devtools](https://devtools.r-lib.org/)
69+
- [Python cookiecutter template](https://github.com/Materials-Data-Science-and-Informatics/fair-python-cookiecutter)
70+
- [Reproducible research template](https://github.com/the-turing-way/reproducible-project-template) by the Turing Way
71+
72+
More tools and templates in [Heidi Seibolds blog](https://heidiseibold.ck.page/posts/setting-up-a-fair-and-reproducible-project).
73+
5974

6075
---
6176

62-
## Discussion on reproducibility
77+
## Excursion: Reproducible publications
78+
79+
### Discussion on collaborative writing of academic papers
6380

6481
````{discussion} Discuss in the collaborative document:
6582
66-
**How do you collaborate on writing academic papers?**
6783
```
68-
- Are you using version control for academic papers?
84+
- How do you collaborate on writing academic papers?
6985
- ...
7086
- ...
7187
- (share your experience)
@@ -75,46 +91,35 @@ project_name/
7591
- ...
7692
- (share your experience)
7793
```
78-
> Please write or discuss your ideas before opening solution!
79-
80-
```{solution} Take away messages
81-
- Consider using version control for manuscripts as well. It may help you when keeping track of edits + if you sync it online then you don't have to worry about losing your work.
8294
83-
- Collaboration can be done efficiently by
84-
- real time collaboration tools like HackMD/HedgeDoc where conflicts are resolved on the fly
85-
- version control where conflicts are detected and shown – and solved manually
86-
```
8795
````
8896

89-
## Some tools and templates
97+
-> Consider using **version control for manuscripts** as well. It may help you when keeping track of edits + if you sync it online then you don't have to worry about losing your work.
9098

91-
- [R devtools](https://devtools.r-lib.org/)
92-
- [Python cookiecutter template](https://github.com/Materials-Data-Science-and-Informatics/fair-python-cookiecutter)
93-
- [Reproducible research template](https://github.com/the-turing-way/reproducible-project-template) by the Turing Way
99+
Version control does not have to mean git, but could also mean using "tracking changes" in tools like Word, Google Docs, or Overleaf (find links below).
94100

95-
More tools and templates in [Heidi Seibolds blog](https://heidiseibold.ck.page/posts/setting-up-a-fair-and-reproducible-project).
101+
### Tools for collaborative writing and version control of manuscripts
96102

97-
## Reproducible publications
98-
99-
- Git can be used to collaborate on manuscripts written in, e.g., LaTeX and other text-based formats but other tools exist, some with git integration:
100-
- [Overleaf](https://www.overleaf.com) or [Typst](https://typst.app/): online, collaborative LaTeX editor
101-
- [Authorea](https://www.authorea.com): collaborative platform for preprints
102-
- [HackMD](https://hackmd.io/) or [HedgeDoc](https://hedgedoc.org/): online collaborative Markdown editors
103-
- [Manuscripts.io](https://www.manuscripts.io/): a collaborative authoring tool that support scientific content and reproducibility.
104-
- Google Docs can be a good alternative
105-
106-
- Many tools exist to assist in making scholarly output reproducible:
107-
- [rrtools](https://github.com/benmarwick/rrtools): instructions, templates, and functions for writing a reproducible article or report with R.
108-
- [Jupyter Notebooks](https://jupyter.org): web-based computational environment for creating code and text based notebooks that can be used as, see also our [Jupyter lesson](https://coderefinery.github.io/jupyter/) later in this workshop.
109-
supplementary material for articles.
110-
- [Binder](https://mybinder.org): makes a repository with Jupyter notebooks available in an executable environment (discussed later in the [Jupyter lesson](https://coderefinery.github.io/jupyter/)).
111-
- ["Research compendia"](http://inundata.org/talks/rstd19/#/): a set of good practices for
112-
reproducible data analysis in R, but much is transferable to other languages.
113-
114-
```{seealso}
115-
Do you want to practice your reproducibility skills and get inspired by working with other people's code/data? Join a [ReproHack event](https://www.reprohack.org/event/)!
116-
```
103+
Git **can** be used to collaborate on manuscripts written in, e.g., LaTeX and other text-based formats. However it might not always be the most convenient. Other tools exist to make the process more enjoyable:
104+
105+
You can **collaboratively gather notes** using self-hosted or public instances of tools like [HedgeDoc](https://hedgedoc.org/) and [Etherpad](https://etherpad.org) or use online options like [HackMD](https://hackmd.io/), [Google Docs](https://docs.google.com) or the Microsoft online tools for easy and efficient collaboration.
106+
107+
To format your notes into a manuscript, you can use Word-like online editors or tools like [Overleaf](https://www.overleaf.com) (LaTeX) or [Typst](https://typst.app/) (markdown). Most of the tools in this section even provide a git integration.
108+
109+
[Manubot](https://github.com/manubot/rootstock) offers another way to turn your written word into a fully rendered manuscript using GitHub.
110+
111+
### Executable manuscripts
112+
113+
You may also want to consider writing an executable manuscript using tools like [Jupyter Notebooks](https://jupyter.org) hosted on [Binder](https://mybinder.org), [Quarto](https://quarto.org/), [Authorea](https://www.authorea.com) or [Observable](https://observablehq.com/), to name a few.
114+
115+
### Resources on research compendia
116+
117+
- [About research compendia at the Turing Way](https://book.the-turing-way.org/reproducible-research/compendia)
118+
- ["Research compendia"](http://inundata.org/talks/rstd19/#/): a set of good practices for reproducible data analysis in R, but much is transferable to other languages.
119+
- [rrtools](https://github.com/benmarwick/rrtools): instructions, templates, and functions for writing a reproducible article or report with R.
120+
- ...
117121

118122
```{keypoints}
119123
- An organized project directory structure helps with reproducibility.
124+
- Also think about version control for writing your academic manuscripts.
120125
```

content/where-to-go.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,10 @@ However, you will not always need all of them. As with so many things, it again
4747
- [Reproducible research policies and software/data management in scientific computing journals: a survey, discussion, and perspectives](https://doi.org/10.3389/fcomp.2024.1491823)
4848
- ...
4949

50+
```{seealso}
51+
Do you want to practice your reproducibility skills and get inspired by working with other people's code/data? Join a [ReproHack event](https://www.reprohack.org/event/)!
52+
```
53+
5054
```{keypoints}
5155
- Not everything in this lesson might be useful right now, but it is good to know that these things exist if you ever get in a situation that would require such solution.
5256
- Caring about reproducibility makes work easier for the next person working on the project - and that might be you in a few years!

content/workflow-management.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,14 @@
11
# Recording computational steps
22

3+
```{objectives}
4+
- Understand why and when a workflow management tool can be useful
5+
```
6+
37
```{questions}
48
- You have some steps that need to be run to do your work. How do you
59
actually run them? Does it rely on your own memory and work, or is it
610
reproducible? **How do you communicate the steps** for future you and others?
711
- How can we create a reproducible workflow?
8-
- When to use scientific workflow management systems.
912
```
1013

1114
```{instructor-note}
@@ -78,7 +81,7 @@ steps in precisely this order, as we would run them manually, one after another.
7881

7982
## Workflow tools
8083

81-
Sometimes it may be helpful to go from imperative to declarative style. Rather than saying "do this and then that" we describe dependencies but we let the tool figure out the series of steps to produce results.
84+
Sometimes it may be helpful to go from imperative to declarative style. Rather than saying "do this and then that" we describe dependencies between steps, but we let the tool figure out the order of steps to produce results.
8285

8386
### Example workflow tool: [Snakemake](https://snakemake.readthedocs.io/en/stable/index.html)
8487

@@ -205,6 +208,7 @@ which can be installed by `conda install graphviz`.
205208
```console
206209
$ snakemake -j 1 --dag | dot -Tpng > dag.png
207210
```
211+
208212
Rules that have yet to be completed are indicated with solid outlines, while already completed rules are indicated with dashed outlines.
209213

210214
```{figure} img/snakemake_dag.png
@@ -238,5 +242,5 @@ Tools like Snakemake help us with **reproducibility** by supporting us with **au
238242

239243
```{keypoints}
240244
- Computational steps can be recorded in many ways
241-
- Workflow tools can help, if there are many steps to be executed
245+
- Workflow tools can help, if there are many steps to be executed and/or many datasets to be processed
242246
```

0 commit comments

Comments
 (0)