Skip to content

Commit

Permalink
Merge pull request #74 from 3mmaRand/feature/core-w6-2024
Browse files Browse the repository at this point in the history
Feature/core w6 2024
  • Loading branch information
3mmaRand authored Nov 4, 2024
2 parents dad40ee + 8cb9cba commit c55073d
Show file tree
Hide file tree
Showing 10 changed files with 51 additions and 260 deletions.
10 changes: 10 additions & 0 deletions .renvignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
.Rproj.user
.Rhistory
.Rdata
.httr-oauth
.DS_Store

/.quarto/

transcriptomics/week-3/data-raw/

2 changes: 1 addition & 1 deletion core/week-2/study_before_workshop.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ toc-location: right

2. πŸ“– Read [Workflow in RStudio](https://3mmarand.github.io/comp4biosci/workflow_rstudio.html). You may find it helpful to remind yourself about RStudio Projects. In previous years, you have submitted an "RStudio Project" as part of your BABS work. In this module, you will submit "Supporting Information" for your Project Report. The Supporting Information is a documented and organised collection of all the digital parts of your research project. This includes data (or instructions for accessing data), code and/or non-coded processing, instructions for use, computational requirements and outputs. The Supporting Information could be a single RStudio Project (like you have done previously but with better documentation) or a folder that includes an RStudio Project and other material/scripts.

3.πŸ’» Set up the Virtual Desktop. I very strongly recommend working on
3. Set up the Virtual Desktop. I very strongly recommend working on
the University computers for this work. You will be using more specialised R
packages than you might be used to. This is especially important if you often
have difficulty updating and or installing software on your own machine,
Expand Down
Binary file removed core/week-6/Y12345678.zip
Binary file not shown.
Binary file added core/week-6/images/mentimeter_qr_code.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 17 additions & 8 deletions core/week-6/overview.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,34 @@ toc: true
toc-location: right
---

This week you will revise some essential concepts for scientific computing: file system organisation, file types, working directories and paths. The workshop will cover a rationale for working reproducibly, project oriented workflow, naming things and documenting your work. We will also examine some file types and the concept of tidy data.
We considered how to organise reproducible data analyses in
[Core: Supporting Information 1](../week-2/overview.qmd). This week we will
consider how to document and curate reproducible data analyses. You will
add a README to your project and discover all the software you are using in R.
The workshop will also include a questions and answers section.


### Learning objectives

The successful student will be able to:

- explain the organisation of files and directories in a file systems including root, home and working directories
- explain absolute and relative file paths
- explain why working reproducibly is important
- know how to use a project-oriented workflow to organise work
- be able to give files human- and machine-readable names
- outline some common biological data file formats
- Describe the purpose of a README file

- List the key components of a README file

- Use `sessioninfo` to document the software used in an R project

- Write a README file for a project




### Instructions

1. [Prepare](study_before_workshop.qmd)

i. πŸ“– Read Understanding file systems
i. Revise [Core: Supporting Information 1](../week-2/overview.qmd)
and make a note of queries you have

2. [Workshop](workshop.qmd)

Expand Down
10 changes: 1 addition & 9 deletions core/week-6/study_after_workshop.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Independent Study to consolidate this week"
subtitle: "Core 1"
subtitle: "Core: Supporting Information 2"
toc: true
toc-location: right
format:
Expand All @@ -9,12 +9,4 @@ format:
code-summary: "Answer - don't look until you have tried!"
---

These are suggestions

## BIO00088H Group Research Project students

1. Revise previous Data Analysis materials. You can find the version you took on the VLE site for 17C / 08C. However, my latest versions (in development) are here: [Data Analysis in R](https://3mmarand.github.io/R4BABS/). The Becoming a Bioscientist (BABS) modules replace the Laboratory and Professional Skills modules. BABS1 and BABS2 are stage one, and I've tried to improve them over 17C / 08C. The site is also searchable (icon top right)

## MSc Bioinformatics students doing BIO00070M

1. Make sure you carry out the [preparatory work for week 2 of 52M](https://3mmarand.github.io/R4BABS/pgt52m/week-2/overview.html)
21 changes: 18 additions & 3 deletions core/week-6/study_before_workshop.qmd
Original file line number Diff line number Diff line change
@@ -1,10 +1,25 @@
---
title: "Independent Study to prepare for workshop"
subtitle: "Core 1"
subtitle: "Core: Supporting Information 2"
toc: true
toc-location: right
---

1. πŸ“– Read [Understanding file systems](https://3mmarand.github.io/comp4biosci/file_systems.html). This is an approximately 15 - 20 minute read revising file types and filesystems. It covers concepts of working directories and paths. We learned these ideas in stage 1 and you may feel completely confident with them but many students will benefit from a refresher. For BIO00070M students, this is part of the work you will also be asked to complete for BIO00052M Data Analysis in R.
1. Revise [Core: Supporting Information 1 Organising Reproducible Data
Analyses](../week-2/overview.qmd).

2. In previous years you have submitted and RStudio Project as part of your BABS work. In this module you will develop this by submitting a Research Compendium. A Research Compendium is a documented collection of all the digital parts of the research project including data (or access to data), code and outputs. The Compendium might be a single Quarto/RStudio Project, (like you have done previously but with better documentation) or it might be a folder including an Quarto/RStudio Project and other material/scripts including the description of unscripted processing. You might want to remind yourself of the example RStudio Project, [Y12345678.zip](Y12345678.zip) used in BABS 2.
i. Do you know your Supporting information will most likely be be a
structured folder which *is* either an RStudio Project or
contains an RStudio Project?
ii. Are you following the best practices code formatting and style?
If not, go through your scripts and edit.
iii. Do you have numbers hard coded where they could be variables?
iv. Are you using a sensible naming convention for files and
variables? Have you written it down?
v. Make a note of queries you have. Take some time to formulate and
write down your questions. The more specific and clear your
question is, the better answer I will be able to provide.
vi. Post your questions here:
[Menti](https://www.menti.com/m86rqcbb88) Code: 3306 3222. QR:
![mentimeter qr
code](images/mentimeter_qr_code.png){width="400"}
16 changes: 0 additions & 16 deletions core/week-6/styles.css

This file was deleted.

225 changes: 3 additions & 222 deletions core/week-6/workshop.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Workshop"
subtitle: "Organising Reproducible Data Analyses"
subtitle: "Supporting Information 2 Documenting and curating Reproducible Data Analyses"
author: "Emma Rand"
toc: true
toc-depth: 4
Expand All @@ -19,158 +19,9 @@ editor:

## Session overview

In this workshop we will discuss why reproducibility matters and how to
organise your work to make it reproducible. We will cover:
In this workshop

# Reproducibility

## What is reproducibility?

- **Reproducible: Same data + same analysis = identical results**.
*"... obtaining consistent results using the same input data;
computational steps, methods, and code; and conditions of analysis.
This definition is synonymous with"computational reproducibility"*
[@nationalacademiesofsciences2019]

- Replicable: Different data + same analysis = qualitatively similar
results. The work is not dependent on the specificities of the data.

- Robust: Same data + different analysis = qualitatively similar or
identical results. The work is not dependent on the specificities of
the analysis.

- Generalisable: Different data + different analysis = qualitatively
similar results and same conclusions. The findings can be
generalised

[![The Turing Way\'s definitions of reproducible research
](images/reproducible-matrix.jpg){fig-alt="Two by Two cell matrix. Columns are Data, either same or different. Rows are Analysis either same or different. Each of cells contain one of the definitions for reproducibility"}](https://the-turing-way.netlify.app/reproducible-research/overview/overview-definitions#rr-overview-definitions)

## Why does it matter?

![futureself, CC-BY-NC, by Julen
Colomb](images/future_you.png){fig-alt="Person working at a computer with an offstage person asking 'How is the analysis going?' The person at the computer replies 'Can't understand the date...and the data collector does not answer my emails or calls' Person offstage: 'That's terrible! So cruel! Who did collect the data? I will sack them!' Person at the computer: 'um...I did, 3 years ago.'"
width="400"}

- Five selfish reasons to work reproducibly [@markowetz2015].
Alternatively, see the very entertaining
[talk](https://youtu.be/yVT07Sukv9Q)

- Many high profile cases of work which did not reproduce e.g. Anil
Potti unravelled by @baggerly2009

- **Will** become standard in Science and publishing e.g OECD Global
Science Forum Building digital workforce capacity and skills for
data-intensive science [@oecdglobalscienceforum2020]

## How to achieve reproducibility

- Scripting

- Organisation: Project-oriented workflows with file and folder
structure, naming things

- Documentation: Readme files, code comments, metadata, version
control

# Scripting

## Rationale for scripting?

- Science is the generation of ideas, designing work to test them and
reporting the results.

- We ensure laboratory and field work is replicable, robust and
generalisable by planning and recording in lab books and using
standard protocols. Repeating results is still hard.

- Workflows for computational projects, and the data analysis and
reporting of other work can, and should, be 100% reproducible!

- Scripting is the way to achieve this.

# Organisation

## Project-oriented workflow

- use folders to organise your work

- you are aiming for structured, systematic and repeatable.

- inputs and outputs should be clearly identifiable from structure
and/or naming

Examples

```
-- liver_transcriptome/
|__data
|__raw/
|__processed/
|__images/
|__code/
|__reports/
|__figures/
```

## Naming things

![documents, CC-BY-NC,
https://xkcd.com/1459/](images/xkcd-comic-file-names.png){fig-alt="A comic figure is looking over the shoulder of another and is shocked by a list of files with names like 'Untitled 138 copy.docx' and 'Untitled 243.doc'. Caption: 'Protip: Never look in someone else's documents folder'"}

Guiding principle - Have a convention! Good file names are:

- machine readable

- human readable

- play nicely with sorting

I suggest

- no spaces in names

- use snake_case or kebab-case rather than CamelCase or dot.case

- use all lower case except very occasionally where convention is
otherwise, e.g., README, LICENSE

- ordering: use left-padded numbers e.g., 01, 02....99 or 001,
002....999

- dates [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format:
2020-10-16

- write down your conventions

```
-- liver_transcriptome/
|__data
|__raw/
|__2022-03-21_donor_1.csv
|__2022-03-21_donor_2.csv
|__2022-03-21_donor_3.csv
|__2022-05-14_donor_1.csv
|__2022-05-14_donor_2.csv
|__2022-05-14_donor_3.csv
|__processed/
|__images/
|__code/
|__functions/
|__summarise.R
|__normalise.R
|__theme_volcano.R
|__01_data_processing.py
|__02_exploratory.R
|__03_modelling.R
|__04_figures.R
|__reports/
|__01_report.qmd
|__02_supplementary.qmd
|__figures/
|__01_volcano_donor_1_vs_donor_2.eps
|__02_volcano_donor_1_vs_donor_3.eps
```

# Documentation

Expand Down Expand Up @@ -211,35 +62,7 @@ Python:

- Ideally, a summary of changes with the date

```
-- liver_transcriptome/
|__data
|__raw/
|__2022-03-21_donor_1.csv
|__2022-03-21_donor_2.csv
|__2022-03-21_donor_3.csv
|__2022-05-14_donor_1.csv
|__2022-05-14_donor_2.csv
|__2022-05-14_donor_3.csv
|__processed/
|__images/
|__code/
|__functions/
|__summarise.R
|__normalise.R
|__theme_volcano.R
|__01_data_processing.py
|__02_exploratory.R
|__03_modelling.R
|__04_figures.R
|__README.md
|__reports/
|__01_report.qmd
|__02_supplementary.qmd
|__figures/
|__01_volcano_donor_1_vs_donor_2.eps
|__02_volcano_donor_1_vs_donor_3.eps
```


## Code comments

Expand All @@ -248,49 +71,7 @@ Python:
explain what the code is doing and why. They are also used to
temporarily remove code from execution.

# Github co-pilot demo

# Quarto demo

# Useful exercises

- Want github co-pilot?

🎬 Create a [GitHub account](https://github.com/)

🎬 Apply for [student
benefits](https://education.github.com/discount_requests/application)

- Update R and RStudio

🎬 [Update R]()

🎬 [Update RStudio](https://posit.co/download/rstudio-desktop/). You
will need the prelease [Dessert
Sunflower](https://dailies.rstudio.com/rstudio/desert-sunflower/)
for github Copilot integration

- Install package building tools

🎬 Windows Install
[Rtools](https://cran.r-project.org/bin/windows/Rtools/rtools43/rtools.html)

🎬 Mac install [Xcode from Mac App
Store](https://apps.apple.com/ca/app/xcode/id497799835?mt=12)

- Update packages:

🎬 devtools, tidyverse, BiocManager, readxl

- Install Quarto

🎬 [Install Quarto](https://quarto.org)

- Install Zotero

🎬 Install [Zotero](https://www.zotero.org/)

🎬 [Sign up for an account](https://www.zotero.org/user/register)

You're finished!

Expand Down
2 changes: 1 addition & 1 deletion update-notes.txt
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ Curate your and reorganise your code
restart R to try. exchange projects with a friend. Do they understand?

Readme
- how to make
- how to make: create a new text file in the top level of your project
- what goes in
- software including versions
- session info
Expand Down

0 comments on commit c55073d

Please sign in to comment.