diff --git a/core/core.html b/core/core.html index 55aff88..b44c689 100644 --- a/core/core.html +++ b/core/core.html @@ -8,7 +8,7 @@ -Core Data Analysis – Data Analysis for Group Project +Core: Supporting Information – Data Analysis for Group Project - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
- - -
- -
- - -
- - - -
- -
-
-

Overview

-

Core 3: Research Compendia and Reproducible Reporting

-
- - - -
- - -
-
Published
-
-

3 September, 2024

-
-
- - -
- - - -
- - -

This week we will cover the “Research compendium” and reproducible reporting which are part of the assessment. Research Compendium that is a documented collection of all the digital parts of the research project including data (or access to data), code and outputs. The Compendium might be a single Quarto/RStudio Project, or it might be a folder including an Quarto/RStudio Project and some additional materials including the description of unscripted processing. The collection is organised and documented in such a way that reproducing all the results is straightforward for another individual. We will also cover reproducible reporting which means using literate programming to weave together code and text together in a single document. Quarto is a multi-language literate programming tool (very like R Markdown).

-
-

Learning objectives

-

The successful student will be able to:

-
    -
  • explain what a research compendium is and describe its components
  • -
  • relate the content and concepts in Core 1 and Core 2 to the research compendium
  • -
  • Create a quarto document and: -
      -
    • appreciate the role of the YAML header
    • -
    • format text as bold, italics, headings etc
    • -
    • add citations and a bibliography
    • -
    • create automatically numbered figures and tables with cross references in text
    • -
    • set default code chunk behaviour and those for individual chunks
    • -
    • use inline code to report results
    • -
    • insert special characters and mathematical expressions with LaTeX
    • -
  • -
-
-
-

Instructions

-
    -
  1. Prepare

  2. -
  3. Workshop

  4. -
  5. Consolidate by working on your project and research compendium

  6. -
- - -
- -
- -
- - - - - - \ No newline at end of file diff --git a/core/week-11/study_before_workshop.html b/core/week-11/study_before_workshop.html deleted file mode 100644 index bae0961..0000000 --- a/core/week-11/study_before_workshop.html +++ /dev/null @@ -1,918 +0,0 @@ - - - - - - - - - - - - - - - - Data Analysis for Group Project – Independent Study to prepare for workshop - - - - - - - - - - - - - - - - - - - - - - - -
-
- -
-

Independent Study to prepare for workshop

-

Core 3 Research Compendia and Reproducible Reporting

- -
-
-
-Emma Rand -
-
-
- -

3 September, 2024

-
-
-

Module assessment

-

This module is assessed by:

-
    -
  • Oral presentation 30%

  • -
  • Project Report and Research Compendium 70% of which

    -
      -
    • 50% report
    • -
    • 20% compendium
    • -
  • -
-

These slides are a guide to Research compendium.

-
-
-

What is a Research Compendium?

-

Overview of assessment

-
-

Stage 3 Integrated Masters students are expected to submit a Research Compendium that is a documented collection of all the digital parts of the research project including data (or access to data), code and outputs. The Compendium might be a single Quarto/RStudio Project, or it might be a folder including an Quarto/RStudio Project and some additional materials including the description of unscripted processing. The collection is organised and documented in such a way that reproducing all the results is straightforward for another individual.

-

Students will be assessed on the technical complexity, completeness and organisation of their compendium and the completeness, reproducibility and clarity of their documentation at the project and the code/process level. Marking will focus on the reproducibility of the results and the clarity of the decision making processes rather than the interpretation of the results which is covered in the report. There is no word or size limit for any part of the compendium but its contents should be concise and minimal. Extraneous text, code or files will be penalised.

-
-
-
-

What is a Research Compendium?

-

Overview of assessment

-
-

Stage 3 Integrated Masters students are expected to submit a Research Compendium that is a documented collection of all the digital parts of the research project including data (or access to data), code and outputs. The Compendium might be a single Quarto/RStudio Project, or it might be a folder including an Quarto/RStudio Project and some additional materials including the description of unscripted processing. The collection is organised and documented in such a way that reproducing all the results is straightforward for another individual.

-

Students will be assessed on the technical complexity, completeness and organisation of their compendium and the completeness, reproducibility and clarity of their documentation at the project and the code/process level. Marking will focus on the reproducibility of the results and the clarity of the decision making processes rather than the interpretation of the results which is covered in the report. There is no word or size limit for any part of the compendium but its contents should be concise and minimal. Extraneous text, code or files will be penalised.

-
-
-
-

What is a Research Compendium?

-
-
-
    -
  • Zipped folder containing all data, code and text associated with a research project organised and documented clearly. Any unscripted processing should be described.

  • -
  • Everything needed to understand what the project is and reproduce the results, and no more. The compendium should not be a dumping ground for data files and scripts. It needs to be curated. You may generate files that are not needed to reproduce your work and these should be removed.

  • -
  • Your compendium might be a single Quarto/RStudio Project, or it might be folder including an RStudio Project and some additional materials including the description of unscripted processing.

  • -
  • Ideally uses literate programming to create submitted report

  • -
-
-
-
-
-

Use guidelines from Core 1 and 2

-
    -
  • follow the guidance in Core 1 on organisation, naming things and documentation

  • -
  • follow the guidance in Core 2 on well-formatted code, consistency, modularisation and documentation

  • -
-
-
-

Project level documentation

-
-
    -
  • as concise as possible, bullet points are good

  • -
  • primarily in the README file but some details may be in scripts

  • -
  • title, concise description of the work, author exam number, date, overview of compendium contents

  • -
  • all the software information including versions

  • -
  • instructions needed to reproduce the work, order of workflow, settings/parameter values for software

  • -
-
-
-
-

Project level documentation - cont

-
-
    -
  • description, format and provenance of the data

  • -
  • style conventions used in the code,

  • -
  • any other information needed to understand the project and reproduce the results

  • -
-
-
-
-

Script level documentation

-

Shorthand for documentation at the script and/or code chunk level and/or process level where unscripted processing is used.

-
-
    -
  • overview of the script/chunk/process and its purpose

  • -
  • code comments

  • -
-
-
-
-

What is a Research Compendium?

-
-
    -
  • A research compendium is something you develop throughout your research project. It is not something you create at the end.

  • -
  • You update and reorganise as you go.

  • -
  • When you plan your research include the planning of recording, organising, and documenting your data and its analysis.

  • -
  • Think ahead to how and where you will be recording your data and how you will be analysing.

  • -
-
-
-
-

Further Reading

- -
-
-

References

- - -
-
-Community, The Turing Way. 2022. The Turing Way: A Handbook for Reproducible, Ethical and Collaborative Research. Zenodo. https://doi.org/10.5281/ZENODO.3233853. -
-
-Marwick, Ben, Carl Boettiger, and Lincoln Mullen. 2018. “Packaging Data Analytical Work Reproducibly Using r (and Friends).” The American Statistician 72 (1): 80–88. https://doi.org/10.1080/00031305.2017.1375986. -
-
-Rule, Adam, Amanda Birmingham, Cristal Zuniga, Ilkay Altintas, Shih-Cheng Huang, Rob Knight, Niema Moshiri, et al. 2019. “Ten Simple Rules for Writing and Sharing Computational Analyses in Jupyter Notebooks.” Edited by Fran Lewitter. PLOS Computational Biology 15 (7): e1007007. https://doi.org/10.1371/journal.pcbi.1007007. -
-
-Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. “Ten Simple Rules for Reproducible Computational Research.” PLoS Comput. Biol. 9 (10): e1003285. https://doi.org/10.1371/journal.pcbi.1003285. -
-
-
-
-
- - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/core/week-2/data/1cq2.pdb b/core/week-2-old/data/1cq2.pdb similarity index 100% rename from core/week-2/data/1cq2.pdb rename to core/week-2-old/data/1cq2.pdb diff --git a/core/week-2/data/Y101_Y102_Y201_Y202_Y101-5.csv b/core/week-2-old/data/Y101_Y102_Y201_Y202_Y101-5.csv similarity index 100% rename from core/week-2/data/Y101_Y102_Y201_Y202_Y101-5.csv rename to core/week-2-old/data/Y101_Y102_Y201_Y202_Y101-5.csv diff --git a/core/week-2/data/control_merged.tif b/core/week-2-old/data/control_merged.tif similarity index 100% rename from core/week-2/data/control_merged.tif rename to core/week-2-old/data/control_merged.tif diff --git a/core/week-2/data/tf_lthsc.csv b/core/week-2-old/data/tf_lthsc.csv similarity index 100% rename from core/week-2/data/tf_lthsc.csv rename to core/week-2-old/data/tf_lthsc.csv diff --git a/core/week-2/data/xlaevis_counts_S30.csv b/core/week-2-old/data/xlaevis_counts_S30.csv similarity index 100% rename from core/week-2/data/xlaevis_counts_S30.csv rename to core/week-2-old/data/xlaevis_counts_S30.csv diff --git a/core/week-2-old/images/JennyBryan.jpg b/core/week-2-old/images/JennyBryan.jpg new file mode 100644 index 0000000..f233139 Binary files /dev/null and b/core/week-2-old/images/JennyBryan.jpg differ diff --git a/core/week-2-old/overview.html b/core/week-2-old/overview.html new file mode 100644 index 0000000..b6f9429 --- /dev/null +++ b/core/week-2-old/overview.html @@ -0,0 +1,677 @@ + + + + + + + + + + +Overview – Data Analysis for Group Project + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + + + +
+ +
+
+

Overview

+

Core 2: File types, workflow tips and other tools

+
+ + + +
+ + +
+
Published
+
+

13 September, 2024

+
+
+ + +
+ + + +
+ + +

This week we will consider File types, workflow tips and other tools. The independent study reiterates the value of RStudio projects and shows you how you create them with usethis. You will also learn how to recognise and write cool 😎 code, not 😩 ugly code and code algorithmically. In the workshop we will examine some common biological data formats and discover some awesome short cuts to help you write cool 😎 code. You will also get a brief introduction to the command line and Google Colab.

+
+

Learning objectives

+

The successful student will be able to:

+
    +
  • explain why RStudio are useful/essential and be able to use the usethis package
  • +
  • write cool 😎 code not 😩 ugly code
  • +
  • explain the value of code which expresses the structure of the problem/solution.
  • +
  • describe some common file types for biological data
  • +
  • use some useful shortcuts to help write cool 😎 code
  • +
  • know what the command line is and how to use it for simple tasks
  • +
  • use Google colab to run code
  • +
  • recognise some of the differences between R and Python
  • +
+
+
+

Instructions

+
    +
  1. Prepare 20 mins reading on RStudio Projects revisited, formatting code and coding algorithmically

  2. +
  3. Workshop

    +
      +
    1. 💬 Types of biological data files

    2. +
    3. 🪄 Workflow tips and shortcuts

    4. +
    5. 💻 The command line

    6. +
    7. 💻 Google colab

    8. +
    9. 💻 Python

    10. +
  4. +
  5. Consolidate

    +
      +
    1. 💻 not sure yet :)
    2. +
  6. +
+ + +
+ +
+ +
+ + + + + + \ No newline at end of file diff --git a/core/week-2-old/study_after_workshop.html b/core/week-2-old/study_after_workshop.html new file mode 100644 index 0000000..7b720ef --- /dev/null +++ b/core/week-2-old/study_after_workshop.html @@ -0,0 +1,647 @@ + + + + + + + + + + +Independent Study to consolidate this week – Data Analysis for Group Project + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + + + +
+ +
+
+

Independent Study to consolidate this week

+

Core 2

+
+ + + +
+ + +
+
Published
+
+

13 September, 2024

+
+
+ + +
+ + + +
+ + +
+

bbbb

+ + +
+ +
+ +
+ + + + + + \ No newline at end of file diff --git a/core/week-2-old/study_before_workshop.html b/core/week-2-old/study_before_workshop.html new file mode 100644 index 0000000..5920783 --- /dev/null +++ b/core/week-2-old/study_before_workshop.html @@ -0,0 +1,1274 @@ + + + + + + + + + + + + + + + +Data Analysis for Group Project – Independent Study to prepare for workshop + + + + + + + + + + + + + + + + + + + +
+
+ +

Independent Study to prepare for workshop

+

Core 2: File types, workflow tips and other tools

+ +
+
+
+Emma Rand +
+
+
+ +

13 September, 2024

+

Overview

+
    +
  • RStudio Projects revisited +
      +
    • using usethis package
    • +
    • Adding a README
    • +
    +
  • +
  • Formatting code
  • +
  • Code algorithmically / algebraically.
  • +

Reproducibility is a continuum

+

Some is better than none!

+
    +
  • Organise your project
    +
  • +
  • Script everything.
    +
  • +
  • Format code and follow a consistent style.
    +
  • +
  • Code algorithmically
  • +
  • Modularise your code: organise into sections and scripts
  • +
  • Document your project - commenting, READMEs
  • +
  • Use literate programming e.g., R Markdown or Quarto
  • +
+
+
    +
  • More advanced: Version control, continuous integration, environments, containers
  • +
+
+

RStudio Projects revisited

+ +

RStudio Projects

+
+
    +
  • We used RStudio Projects in stage one but they are so useful, it is worth covering them again in case you are not yet using them.

  • +
  • We will also cover the usethisworkflow to create an RStudio Project.

  • +
  • RStudio Projects make it easy to manage working directories and paths because they set the working directory to the RStudio Projects directory automatically.

  • +
+
+

RStudio Projects

+
+
+
+
-- stem_cell_rna
+   |__stem_cell_rna.Rproj   
+   |__raw_ data/            
+      |__2019-03-21_donor_1.csv
+   |__README. md
+   |__R/
+      |__01_data_processing.R
+      |__02_exploratory.R
+      |__functions/
+         |__theme_volcano.R
+         |__normalise.R
+
+
+

The project directory is the folder at the top 1

+
+

RStudio Projects

+
+
+
+
-- stem_cell_rna
+   |__stem_cell_rna.Rproj   
+   |__raw_ data/            
+      |__2019-03-21_donor_1.csv
+   |__README. md
+   |__R/
+      |__01_data_processing.R
+      |__02_exploratory.R
+      |__functions/
+         |__theme_volcano.R
+         |__normalise.R
+
+
+

the .RProj file is directly under the project folder. Its presence is what makes the folder an RStudio Project

+
+

RStudio Projects

+
+
    +
  • When you open an RStudio Project, the working directory is set to the Project directory (i.e., the location of the .Rproj file).

  • +
  • When you use an RStudio Project you do not need to use setwd()

  • +
  • When someone, including future you, opens the project on another machine, all the paths just work.

  • +
+
+

RStudio Projects

+ +

Jenny Bryan

In the words of Jenny Bryan:

+
+

“If the first line of your R script is setwd(”C:/Users/jenny/path/that/only/I/have”) I will come into your office and SET YOUR COMPUTER ON FIRE”

+
+

Creating an RStudio Project

+

There are two ways to create an RStudio Project.

+
    +
  1. Using one of the two menus

  2. +
  3. Using the usethis package

  4. +

Using a menu

+

There are two menus:

+
    +
  1. Top left, File menu

  2. +
  3. Top Right, drop-down indicated by the .RProj icon

  4. +
+

They both do the same thing.

+

In both cases you choose: New Project | New Directory | New Project

+
+

Make sure you “Browse” to the folder you want to create the project.

+
+

Using the usethis package

+ +

Using the usethis package

+

I occasionally use the menu but I mostly use the usethis package.

+
+

🎬 Go to RStudio and check your working directory:

+
+ +
+

"C:/Users/er13/Desktop"

+
+
+

❔ Is your working directory a good place to create a Project folder?

+
+

Using the usethis package

+

If this is a good place to create a Project directory then…

+

🎬 Create a project with:

+
+
usethis::create_project("bananas")
+
+

Using the usethis package

+

Otherwise

+

If you want the project directory elsewhere, you will need to give the relative path, e.g.

+
+
usethis::create_project("../Documents/bananas")
+
+

Using the usethis package

+

The output will look like this and a new RStudio session will start.

+
> usethis::create_project("bananas")
+√ Creating 'bananas/'
+√ Setting active project to 'C:/Users/er13/Desktop/bananas'
+√ Creating 'R/'
+√ Writing 'bananas.Rproj'
+√ Adding '.Rproj.user' to '.gitignore'
+√ Opening 'C:/Users/er13/Desktop/bananas/' in new RStudio session
+√ Setting active project to '<no active project>'
+

Using the usethis package

+

When you create a new RStudio Project with usethis:

+
+
    +
  • A folder called bananas/ is created
  • +
  • RStudio starts a new session in bananas/ i.e., your working directory is now bananas/ +
  • +
  • A folder called R/ is created
  • +
  • A file called bananas.Rproj is created
  • +
  • A file called .gitignore is created
  • +
  • A hidden directory called .Rproj.user is created
  • +
+
+

Using the usethis package

+
+
    +
  • the .Rproj file is what makes the directory an RStudio Project

  • +
  • the Rproj.user directory is where project-specific temporary files are stored. You don’t need to mess with it.

  • +
  • the .gitignore is used for version controlled projects. If not using git, you can ignore it.

  • +
+
+

Opening and closing

+

You can close an RStudio Project with ONE of:

+
    +
  1. File | Close Project
  2. +
  3. Using the drop-down option on the far right of the tool bar where you see the Project name
  4. +
+
+

You can open an RStudio Project with ONE of:

+
    +
  1. File | Open Project or File | Recent Projects
    +
  2. +
  3. Using the drop-down option on the far right of the tool bar where you see the Project name
    +
  4. +
  5. Double-clicking an .Rproj file from your file explorer/finder
  6. +
+

When you open project, a new R session starts.

+
+

Adding a README

+ +

Using the usethis package

+

Once the RStudio project has been created, usethis helps you follow good practice.

+
+

🎬 We can add a README with:

+
+
usethis::use_readme_md()
+
+
+
+

This creates a file called README.md, with a little default text, in the Project directory and opens it for editing.

+
+
+

md stands for markdown, it is a extremely widely used text formatting language which is readable as plain text. If you have ever used asterisks to make text bold or italic, you have used markdown.

+
+

Code formatting and style

+ +

Code formatting and style

+
+

“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.”

+
+

The tidyverse style guide

+

Code formatting and style

+

We have all written code which is hard to read!

+

We all improve over time.

+
+
+ + +
+
+

Code formatting and style

+

Some keys points:

+
    +
  • be consistent, emulate experienced coders
    +
  • +
  • use snake_case for variable names (not CamelCase, dot.case)
    +
  • +
  • use <- not = for assignment
    +
  • +
  • use spacing around most operators and after commas
    +
  • +
  • use indentation
    +
  • +
  • avoid long lines, break up code blocks with new lines
    +
  • +
  • use " for quoting text (not ') unless the text contains double quotes
  • +

😩 Ugly code 😩

+
+
data<-read_csv('../data-raw/Y101_Y102_Y201_Y202_Y101-5.csv',skip=2)
+library(janitor);sol<-clean_names(data)
+data=data|>filter(str_detect(description,"OS=Homo sapiens"))|>filter(x1pep=='x')
+data=data|>
+mutate(g=str_extract(description,
+"GN=[^\\s]+")|>str_replace("GN=",''))
+data<-data|>mutate(id=str_extract(accession,"1::[^;]+")|>str_replace("1::",""))
+
+

😩 Ugly code 😩

+
    +
  • no spacing or indentation
  • +
  • inconsistent splitting of code blocks over lines
  • +
  • inconsistent use of quote characters
  • +
  • no comments
  • +
  • variable names convey no meaning
  • +
  • use of = for assignment and inconsistently
  • +
  • multiple commands on a line
  • +
  • library statement in the middle of the analysis
  • +

😎 Cool code 😎

+
+
# Packages ----------------------------------------------------------------
+library(tidyverse)
+library(janitor)
+
+# Import ------------------------------------------------------------------
+
+# define file name
+file <- "../data-raw/Y101_Y102_Y201_Y202_Y101-5.csv"
+
+# import: column headers and data are from row 3
+solu_protein <- read_csv(file, skip = 2) |>
+  janitor::clean_names()
+
+# Tidy data ----------------------------------------------------------------
+
+# filter out the bovine proteins and those proteins 
+# identified from fewer than 2 peptides
+solu_protein <- solu_protein |>
+  filter(str_detect(description, "OS=Homo sapiens")) |>
+  filter(x1pep == "x")
+
+# Extract the genename from description column to a column
+# of its own
+solu_protein <- solu_protein |>
+  mutate(genename =  str_extract(description,"GN=[^\\s]+") |>
+           str_replace("GN=", ""))
+
+# Extract the top protein identifier from accession column (first
+# Uniprot ID after "1::") to a column of its own
+solu_protein <- solu_protein |>
+  mutate(protid =  str_extract(accession, "1::[^;]+") |>
+           str_replace("1::", ""))
+
+

😎 Cool code 😎

+
    +
  • library() calls collected

  • +
  • Uses code sections to make it easier to navigate

  • +
  • Uses white space and proper indentation

  • +
  • Commented

  • +
  • Uses more informative name for the dataframe

  • +

Code ‘algorithmically’

+ +

Code ‘algorithmically’

+
+
    +
  • Write code which expresses the structure of the problem/solution.

  • +
  • Avoid hard coding numbers if at all possible - declare variables instead

  • +
  • Declare frequently used values as variables at the start e.g., colour schemes, figure saving settings

  • +
+
+

😩 Hard coding numbers.

+
+
    +
  • Suppose we want to calculate the sums of squares, \(SS(x)\), for the number of eggs in five nests.

  • +
  • The formula is given by: \(\sum (x_i- \bar{x})^2\)

  • +
  • We could calculate the mean and copy it, and the individual numbers into the formula

  • +
+
+

😩 Hard coding numbers.

+
+
# mean number of eggs per nest
+sum(3, 5, 6, 7, 8) / 5
+
+
[1] 5.8
+
+
# ss(x) of number of eggs
+(3 - 5.8)^2 + (5 - 5.8)^2 + (6 - 5.8)^2 + (7 - 5.8)^2 + (8 - 5.8)^2
+
+
[1] 14.8
+
+
+

I am coding the calculation of the mean rather using the mean() function only to explain what ‘coding algorithmically’ means using a simple example.

+

😩 Hard coding numbers

+
+
    +
  • if any of the sample numbers must be altered, all the code needs changing

  • +
  • it is hard to tell that the output of the first line is a mean

  • +
  • its hard to recognise that the numbers in the mean calculation correspond to those in the next calculation

  • +
  • it is hard to tell that 5 is just the number of nests

  • +
  • no way of know if numbers are the same by coincidence or they refer to the same thing

  • +
+
+

😎 Better

+
+
# eggs each nest
+eggs <- c(3, 5, 6, 7, 8)
+
+# mean eggs per nest
+mean_eggs <- sum(eggs) / length(eggs)
+
+# ss(x) of number of eggs
+sum((eggs - mean_eggs)^2)
+
+
[1] 14.8
+
+
+

😎 Better

+
+
    +
  • the commenting is similar but it is easier to follow

  • +
  • if any of the sample numbers must be altered, only that number needs changing

  • +
  • assigning a value you will later use to a variable with a meaningful name allows us to understand the first and second calculations

  • +
  • makes use of R’s elementwise calculation which resembles the formula (i.e., is expressed as the general rule)

  • +
+
+

Summary

+
+
    +
  • Use an RStudio project for any R work (you can also incorporate other languages)

  • +
  • Write Cool code not Ugly code: space, consistency, indentation, comments, meaningful variable names

  • +
  • Write code which expresses the structure of the problem/solution.

  • +
  • Avoid hard coding numbers if at all possible - declare variables instead

  • +
+
+

Reading

+

Recommended if you still need convincing to use RStudio Projects

+ +

Completely optional suggestions for further reading

+
    +
  • Ten simple rules for reproducible computational research (Sandve et al. 2013) +
  • +
  • Good enough practices in scientific computing (Wilson et al. 2017) +
  • +
  • Excuse Me, Do You Have a Moment to Talk About Version Control? (Bryan 2018) +
  • +

References

+ + +
+
+Bryan, Jennifer. 2018. “Excuse Me, Do You Have a Moment to Talk about Version Control?” Am. Stat. 72 (1): 20–27. https://doi.org/10.1080/00031305.2017.1399928. +
+
+Bryan, Jennifer, Jim Hester, Shannon Pileggi, and E. David Aja. n.d. What They Forgot to Teach You about r. https://rstats.wtf/. +
+
+Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. “Ten Simple Rules for Reproducible Computational Research.” PLoS Comput. Biol. 9 (10): e1003285. https://doi.org/10.1371/journal.pcbi.1003285. +
+
+Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K Teal. 2017. “Good Enough Practices in Scientific Computing.” PLoS Comput. Biol. 13 (6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510. +
+
+
+
+
+ + + + + \ No newline at end of file diff --git a/core/week-11/workshop.html b/core/week-2-old/workshop.html similarity index 55% rename from core/week-11/workshop.html rename to core/week-2-old/workshop.html index 641c6dd..fce3096 100644 --- a/core/week-11/workshop.html +++ b/core/week-2-old/workshop.html @@ -140,7 +140,7 @@ }); - +
+
-
-
+

Workshop

-

Research Compendia and Reproducible Reporting

+

File types, workflow tips and other tools

@@ -382,7 +250,7 @@

Workshop

Published
-

3 September, 2024

+

13 September, 2024

@@ -392,34 +260,210 @@

Workshop

Introduction

-

Literate Programming

+

Session overview

+

In this workshop you will

+

File types

+

Omics

    -
  • Literate programming is a way of writing code and text together in a single document

  • -
  • The document is then processed to produce a report

  • -
  • Quarto (recommended) or R Markdown

  • -

Session overview

-

In this workshop we will go through an example quarto document. You will learn:

+
  • gene/transcript/protein/metabolite expression

  • +
  • transcriptomics 1

  • +
  • transcriptomics 2

  • +
  • proteomics

  • +

    Images

    +

    control_merged.tif

    +
    library(ijtiff)
    +img <- read_tif("data/control_merged.tif")
    +
    img
    +
      +
    • an image at least one and usually more matrices of numbers representing the intensity of light at each pixel in the image

    • +
    • the number of matrices depends on the number of ‘channels’ in the image

    • +
    • a channel is a colour in the image

    • +
    • a frame is a single image in a series of images

    • +
    • we might normally call this a multi-dimensional array: x and y coordinates of the pixels are 2 dimensions, the channel is the third dimension and time is the forth dimension

    • +
    +
    display(img)
    +

    Structure

    +

    1cq2.pdb

    +

    Workflow tips

    +

    🎬 Start by making a new RStudio project (use the workflow from in the independent study). Add some files

    +
      +
    • multiple cursors

    • +
    • open a file/function or find a variable CONTROL+.

    • +
    • the command palette CONTROL+SHIFT+P

    • +
    • segmenting code CONTROL+SHIFT+R

    • +
    • to correct indentation CONTROL+i

    • +
    • to reformat code CONTROL+SHIFT+A Not perfect but corrects spacing, indentation, multiple commands on lines and assignment with =

    • +
    • to comment and uncomment lines CONTROL+SHIFT+C

    • +
    • Tools | Global options | Code | Display | Show margin

    • +
    • Tools | Global options | Code | Diagnostic | Provide R style diagnostics

    • +
    • GitHub Copilot in RStudio, it’s finally here!

    • +

    Other tools

    +

    The command line

    +

    The command line - or shell - is a text interface for your computer. It’s a program that takes in commands, which it passes on to the computer’s operating system to run.

    +
      +
    • Windows PowerShell is a command-line in windows. It uses bash-like commands unlike the Command Prompt which uses dos commands (a sort of windows only language). You can open is by going to Start | Windows PowerShell or by searching for it in the search bar.

    • +
    • Terminal is the command line in Mac OS X. You can open it by going to Applications | Utilities | Terminal or by searching for it in the Spotlight search bar.

    • +
    • git bash. I used the bash shell that comes with Git

    • +

    RStudio terminal

    +

    The RStudio terminal is a convenient interface to the shell without leaving RStudio. It is useful for running commands that are not available in R. For example, you can use it to run other programs like fasqc, git, ftp, ssh

    +

    Make a directory

    +
    mkdir mynewdir
    +

    Quarto notebooks

    +

    Demo

      -
    • what the YAML header is
    • -
    • formatting (bold, italics, headings)
    • -
    • to control default and individual chunk options
    • -
    • how to add citations
    • -
    • figures and tables with cross referencing and automatic numbering
    • -
    • how to use inline coding to report results
    • -
    • how to insert special characters and equations
    • -

    Exercise

    -

    🎬 The example RStudio project containing this code here: chaffinch. You can download the project as a zip file from there but there is some code that will do that automatically for you. Since this is an RStudio Project, do not run the code from inside a project. You may want to navigate to a particular directory or edit the destdir:

    -
    usethis::use_course(url = "3mmaRand/chaffinch", destdir = ".")
    -

    You can agree to deleting the zip. You should find RStudio restarts and you have a new project called chaffinch-xxxxxx. The xxxxxx is a commit reference - you do not need to worry about that, it is just a way to tell you which version of the repo you downloaded. You can now run the code in the project.

    -

    🎬 Make an outline of your compendium. This could be a sketch on paper or slide or from the mindmap software you usually use. Or it could be a skeleton of folders and files on your computer.

    -

    🎬 Make a start on a quarto doc.

    +
  • Text and executable cells
  • +
  • Formatting
  • +
  • Markdown
  • +
  • More in Week 6
  • +

    Google Colaboratory

    +

    Google Colab allows you to write and execute python code in your browser.

    +

    Demo

    +

    Python

    +

    Differences between R and python

    +

    Demo

    You’re finished!

    -

    🥳 Well Done! 🎉

    +

    🥳 Well Done! 🎉

    Independent study following the workshop

    -

    Consolidate

    -

    The Code file

    -

    These contain all the code needed in the workshop even where it is not visible on the webpage.

    -

    The workshop.qmd file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interleaved to produce well-formatted reports including webpages. Right-click on the link and choose Save-As to download. You will be able to open the Qmd file in RStudio. Alternatively, View in Browser. Coding and thinking answers are marked with #---CODING ANSWER--- and #---THINKING ANSWER---

    +

    Consolidate

    Pages made with R (R Core Team 2024), Quarto (allaire2022?), knitr (knitr?), kableExtra (Zhu 2021)

    diff --git a/core/week-2/workshop_files/figure-html/unnamed-chunk-3-1.png b/core/week-2-old/workshop_files/figure-html/unnamed-chunk-3-1.png similarity index 100% rename from core/week-2/workshop_files/figure-html/unnamed-chunk-3-1.png rename to core/week-2-old/workshop_files/figure-html/unnamed-chunk-3-1.png diff --git a/core/week-2/images/reproducible-matrix.jpg b/core/week-2/images/reproducible-matrix.jpg new file mode 100644 index 0000000..01287fc Binary files /dev/null and b/core/week-2/images/reproducible-matrix.jpg differ diff --git a/core/week-2/images/xkcd-comic-file-names.png b/core/week-2/images/xkcd-comic-file-names.png new file mode 100644 index 0000000..885acdc Binary files /dev/null and b/core/week-2/images/xkcd-comic-file-names.png differ diff --git a/core/week-2/overview.html b/core/week-2/overview.html index 7943c31..70f4062 100644 --- a/core/week-2/overview.html +++ b/core/week-2/overview.html @@ -101,7 +101,7 @@ - +
    @@ -124,7 +124,7 @@
    @@ -166,13 +166,13 @@ -
    @@ -166,13 +166,13 @@ -
    + +
    + + + +
    -
    - +
    +
    +

    Independent Study to prepare for workshop

    +

    Core: Supporting Information 1

    -
    -
    -Bryan, Jennifer. 2018. “Excuse Me, Do You Have a Moment to Talk about Version Control?” Am. Stat. 72 (1): 20–27. https://doi.org/10.1080/00031305.2017.1399928. -
    -
    -Bryan, Jennifer, Jim Hester, Shannon Pileggi, and E. David Aja. n.d. What They Forgot to Teach You about r. https://rstats.wtf/. -
    -
    -Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. “Ten Simple Rules for Reproducible Computational Research.” PLoS Comput. Biol. 9 (10): e1003285. https://doi.org/10.1371/journal.pcbi.1003285. -
    -
    -Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K Teal. 2017. “Good Enough Practices in Scientific Computing.” PLoS Comput. Biol. 13 (6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510. -
    -
    - -
    - - - + const findCites = (el) => { + const parentEl = el.parentElement; + if (parentEl) { + const cites = parentEl.dataset.cites; + if (cites) { + return { + el, + cites: cites.split(' ') + }; + } else { + return findCites(el.parentElement) + } + } else { + return undefined; + } + }; + var bibliorefs = window.document.querySelectorAll('a[role="doc-biblioref"]'); + for (var i=0; i + + + + \ No newline at end of file diff --git a/core/week-2/styles.css b/core/week-2/styles.css new file mode 100644 index 0000000..2ff0570 --- /dev/null +++ b/core/week-2/styles.css @@ -0,0 +1,16 @@ +/* css styles */ + + +@import url('https://fonts.googleapis.com/css2?family=Open+Sans&family=Source+Code+Pro&display=swap'); + + +// fonts + +$font-family-monospace: "Source Code Pro"; + +/*-- scss:rules --*/ + +code.sourceCode { + font-size: 1.3em; +} + diff --git a/core/week-2/workshop.html b/core/week-2/workshop.html index 9ac0e6d..053e70b 100644 --- a/core/week-2/workshop.html +++ b/core/week-2/workshop.html @@ -1,450 +1,849 @@ - - - - - - -Workshop – Data Analysis for Group Project - - - - - - - - - + - + + - - - - - + + + + + + + + + - - -
    -
    - -
    - - - -
    -

    Workshop

    -

    File types, workflow tips and other tools

    +

    😩 Ugly code 😩

    +
      +
    • no spacing or indentation
    • +
    • inconsistent splitting of code blocks over lines
    • +
    • inconsistent use of quote characters
    • +
    • no comments
    • +
    • variable names convey no meaning
    • +
    • use of = for assignment and inconsistently
    • +
    • multiple commands on a line
    • +
    • library statement in the middle of the analysis
    • +

    😎 Cool code 😎

    +
    +
    +
    # Packages ----------------------------------------------------------------
    +library(tidyverse)
    +library(janitor)
    +
    +# Import ------------------------------------------------------------------
    +
    +# define file name
    +file <- "../data-raw/Y101_Y102_Y201_Y202_Y101-5.csv"
    +
    +# import: column headers and data are from row 3
    +solu_protein <- read_csv(file, skip = 2) |>
    +  janitor::clean_names()
    +
    +# Tidy data ----------------------------------------------------------------
    +
    +# filter out the bovine proteins and those proteins 
    +# identified from fewer than 2 peptides
    +solu_protein <- solu_protein |>
    +  filter(str_detect(description, "OS=Homo sapiens")) |>
    +  filter(x1pep == "x")
    +
    +# Extract the genename from description column to a column
    +# of its own
    +solu_protein <- solu_protein |>
    +  mutate(genename =  str_extract(description,"GN=[^\\s]+") |>
    +           str_replace("GN=", ""))
    +
    +# Extract the top protein identifier from accession column (first
    +# Uniprot ID after "1::") to a column of its own
    +solu_protein <- solu_protein |>
    +  mutate(protid =  str_extract(accession, "1::[^;]+") |>
    +           str_replace("1::", ""))
    +
    +
    +

    😎 Cool code 😎

    +
      +
    • library() calls collected

    • +
    • Uses code sections to make it easier to navigate

    • +
    • Uses white space and proper indentation

    • +
    • Commented

    • +
    • Uses more informative name for the dataframe

    • +

    Code ‘algorithmically’

    - - -
    - -
    -
    Author
    -
    -

    Emma Rand

    -
    -
    - -
    -
    Published
    -
    -

    3 September, 2024

    -
    -
    - - -
    - - - -

    Introduction

    -

    Session overview

    -

    In this workshop you will

    -

    File types

    -

    Omics

    +

    Code ‘algorithmically’

    +
    +
      +
    • Write code which expresses the structure of the problem/solution.

    • +
    • Avoid hard coding numbers if at all possible - declare variables instead

    • +
    • Declare frequently used values as variables at the start e.g., colour schemes, figure saving settings

    • +
    +
    +

    😩 Hard coding numbers.

    +
    +
      +
    • Suppose we want to calculate the sums of squares, \(SS(x)\), for the number of eggs in five nests.

    • +
    • The formula is given by: \(\sum (x_i- \bar{x})^2\)

    • +
    • We could calculate the mean and copy it, and the individual numbers into the formula

    • +
    +
    +

    😩 Hard coding numbers.

    +
    +
    # mean number of eggs per nest
    +sum(3, 5, 6, 7, 8) / 5
    +
    +
    [1] 5.8
    +
    +
    # ss(x) of number of eggs
    +(3 - 5.8)^2 + (5 - 5.8)^2 + (6 - 5.8)^2 + (7 - 5.8)^2 + (8 - 5.8)^2
    +
    +
    [1] 14.8
    +
    +
    +

    I am coding the calculation of the mean rather using the mean() function only to explain what ‘coding algorithmically’ means using a simple example.

    +

    😩 Hard coding numbers

    +

    Images

    -

    control_merged.tif

    -
    library(ijtiff)
    -img <- read_tif("data/control_merged.tif")
    -
    img
    +
  • if any of the sample numbers must be altered, all the code needs changing

  • +
  • it is hard to tell that the output of the first line is a mean

  • +
  • its hard to recognise that the numbers in the mean calculation correspond to those in the next calculation

  • +
  • it is hard to tell that 5 is just the number of nests

  • +
  • no way of know if numbers are the same by coincidence or they refer to the same thing

  • + +
    +

    😎 Better

    +
    +
    # eggs each nest
    +eggs <- c(3, 5, 6, 7, 8)
    +
    +# mean eggs per nest
    +mean_eggs <- sum(eggs) / length(eggs)
    +
    +# ss(x) of number of eggs
    +sum((eggs - mean_eggs)^2)
    +
    +
    [1] 14.8
    +
    +
    +

    😎 Better

    +
      -
    • an image at least one and usually more matrices of numbers representing the intensity of light at each pixel in the image

    • -
    • the number of matrices depends on the number of ‘channels’ in the image

    • -
    • a channel is a colour in the image

    • -
    • a frame is a single image in a series of images

    • -
    • we might normally call this a multi-dimensional array: x and y coordinates of the pixels are 2 dimensions, the channel is the third dimension and time is the forth dimension

    • +
    • the commenting is similar but it is easier to follow

    • +
    • if any of the sample numbers must be altered, only that number needs changing

    • +
    • assigning a value you will later use to a variable with a meaningful name allows us to understand the first and second calculations

    • +
    • makes use of R’s elementwise calculation which resembles the formula (i.e., is expressed as the general rule)

    -
    display(img)
    -

    Structure

    -

    1cq2.pdb

    -

    Workflow tips

    -

    🎬 Start by making a new RStudio project (use the workflow from in the independent study). Add some files

    + +

    Naming things

    + +A comic figure is looking over the shoulder of another and is shocked by a list of files with names like 'Untitled 138 copy.docx' and 'Untitled 243.doc'. Caption: 'Protip: Never look in someone else's documents folder'

    documents, CC-BY-NC, https://xkcd.com/1459/

    Guiding principle - Have a convention! Good file names are:

    +
      +
    • machine readable

    • +
    • human readable

    • +
    • play nicely with sorting

    • +

    Naming suggestions

    +
      +
    • no spaces in names

    • +
    • use snake_case or kebab-case rather than CamelCase or dot.case

    • +
    • use all lower case except very occasionally where convention is otherwise, e.g., README, LICENSE

    • +
    • ordering: use left-padded numbers e.g., 01, 02….99 or 001, 002….999

    • +
    • dates ISO 8601 format: 2020-10-16

    • +
    • write down your conventions

    • +

    Workflow tips

    +
    • multiple cursors

    • open a file/function or find a variable CONTROL+.

    • @@ -456,644 +855,483 @@

      Workshop

    • Tools | Global options | Code | Display | Show margin

    • Tools | Global options | Code | Diagnostic | Provide R style diagnostics

    • GitHub Copilot in RStudio, it’s finally here!

    • -

    Other tools

    -

    The command line

    -

    The command line - or shell - is a text interface for your computer. It’s a program that takes in commands, which it passes on to the computer’s operating system to run.

    + + +

    Summary

    +
      -
    • Windows PowerShell is a command-line in windows. It uses bash-like commands unlike the Command Prompt which uses dos commands (a sort of windows only language). You can open is by going to Start | Windows PowerShell or by searching for it in the search bar.

    • -
    • Terminal is the command line in Mac OS X. You can open it by going to Applications | Utilities | Terminal or by searching for it in the Spotlight search bar.

    • -
    • git bash. I used the bash shell that comes with Git

    • -

    RStudio terminal

    -

    The RStudio terminal is a convenient interface to the shell without leaving RStudio. It is useful for running commands that are not available in R. For example, you can use it to run other programs like fasqc, git, ftp, ssh

    -

    Reading

    +

    Completely optional suggestions for further reading

    + +

    Pages made with R (R Core Team 2024) and Quarto (allaire2022?)

    +

    References

    + +
    + -

    You can find out what you can see with ls which stands for “list”.

    -
    -
    ls
    -
    -
    data
    -images
    -overview.html
    -overview.qmd
    -study_after_workshop.qmd
    -study_before_workshop.html
    -study_before_workshop.ipynb
    -study_before_workshop.qmd
    -workshop.html
    -workshop.qmd
    -workshop.rmarkdown
    -workshop_files
    +
    +
    +Baggerly, Keith A, and Kevin R Coombes. 2009. “DERIVING CHEMOSENSITIVITY FROM CELL LINES: FORENSIC BIOINFORMATICS AND REPRODUCIBLE RESEARCH IN HIGH-THROUGHPUT BIOLOGY.” Ann. Appl. Stat. 3 (4): 1309–34. https://doi.org/10.2307/27801549.
    -

    You might have noticed that unlike R, the commands do not have brackets after them. Instead, options (or switches) are given after the command. For example, we can modify the ls command to give us more information with the -l option, which stands for “long”.

    -
    -
    ls -l
    -
    -
    total 236
    -drwxr-xr-x 2 runner docker  4096 Sep  3 16:11 data
    -drwxr-xr-x 2 runner docker  4096 Sep  3 16:11 images
    --rw-r--r-- 1 runner docker 34020 Sep  3 16:15 overview.html
    --rw-r--r-- 1 runner docker  1597 Sep  3 16:11 overview.qmd
    --rw-r--r-- 1 runner docker   184 Sep  3 16:11 study_after_workshop.qmd
    --rw-r--r-- 1 runner docker 72936 Sep  3 16:15 study_before_workshop.html
    --rw-r--r-- 1 runner docker  4807 Sep  3 16:11 study_before_workshop.ipynb
    --rw-r--r-- 1 runner docker 13029 Sep  3 16:11 study_before_workshop.qmd
    --rw-r--r-- 1 runner docker 58063 Sep  3 16:11 workshop.html
    --rw-r--r-- 1 runner docker  8550 Sep  3 16:11 workshop.qmd
    --rw-r--r-- 1 runner docker  8590 Sep  3 16:15 workshop.rmarkdown
    -drwxr-xr-x 3 runner docker  4096 Sep  3 16:11 workshop_files
    +
    +Bryan, Jennifer. 2018. “Excuse Me, Do You Have a Moment to Talk about Version Control?” Am. Stat. 72 (1): 20–27. https://doi.org/10.1080/00031305.2017.1399928.
    +
    +Bryan, Jennifer, Jim Hester, Shannon Pileggi, and E. David Aja. n.d. What They Forgot to Teach You about r. https://rstats.wtf/.
    -

    You can use more than one option at once. The -h option stands for “human readable” and makes the file sizes easier to understand for humans:

    -
    -
    ls -hl
    -
    -
    total 236K
    -drwxr-xr-x 2 runner docker 4.0K Sep  3 16:11 data
    -drwxr-xr-x 2 runner docker 4.0K Sep  3 16:11 images
    --rw-r--r-- 1 runner docker  34K Sep  3 16:15 overview.html
    --rw-r--r-- 1 runner docker 1.6K Sep  3 16:11 overview.qmd
    --rw-r--r-- 1 runner docker  184 Sep  3 16:11 study_after_workshop.qmd
    --rw-r--r-- 1 runner docker  72K Sep  3 16:15 study_before_workshop.html
    --rw-r--r-- 1 runner docker 4.7K Sep  3 16:11 study_before_workshop.ipynb
    --rw-r--r-- 1 runner docker  13K Sep  3 16:11 study_before_workshop.qmd
    --rw-r--r-- 1 runner docker  57K Sep  3 16:11 workshop.html
    --rw-r--r-- 1 runner docker 8.4K Sep  3 16:11 workshop.qmd
    --rw-r--r-- 1 runner docker 8.4K Sep  3 16:15 workshop.rmarkdown
    -drwxr-xr-x 3 runner docker 4.0K Sep  3 16:11 workshop_files
    +
    +Markowetz, Florian. 2015. “Five Selfish Reasons to Work Reproducibly.” Genome Biol. 16 (December): 274. https://doi.org/10.1186/s13059-015-0850-7.
    +
    +National Academies of Sciences, Engineering, Medicine, Policy, Global Affairs, Engineering, Medicine Committee on Science, Public Policy, Board on Research Data, et al. 2019. Understanding Reproducibility and Replicability. National Academies Press (US). https://www.ncbi.nlm.nih.gov/books/NBK547546/.
    -

    The -a option stands for “all” and shows us all the files, including hidden files.

    -
    -
    ls -alh
    -
    -
    total 244K
    -drwxr-xr-x 5 runner docker 4.0K Sep  3 16:15 .
    -drwxr-xr-x 6 runner docker 4.0K Sep  3 16:11 ..
    -drwxr-xr-x 2 runner docker 4.0K Sep  3 16:11 data
    -drwxr-xr-x 2 runner docker 4.0K Sep  3 16:11 images
    --rw-r--r-- 1 runner docker  34K Sep  3 16:15 overview.html
    --rw-r--r-- 1 runner docker 1.6K Sep  3 16:11 overview.qmd
    --rw-r--r-- 1 runner docker  184 Sep  3 16:11 study_after_workshop.qmd
    --rw-r--r-- 1 runner docker  72K Sep  3 16:15 study_before_workshop.html
    --rw-r--r-- 1 runner docker 4.7K Sep  3 16:11 study_before_workshop.ipynb
    --rw-r--r-- 1 runner docker  13K Sep  3 16:11 study_before_workshop.qmd
    --rw-r--r-- 1 runner docker  57K Sep  3 16:11 workshop.html
    --rw-r--r-- 1 runner docker 8.4K Sep  3 16:11 workshop.qmd
    --rw-r--r-- 1 runner docker 8.4K Sep  3 16:15 workshop.rmarkdown
    -drwxr-xr-x 3 runner docker 4.0K Sep  3 16:11 workshop_files
    +
    +OECD Global Science Forum. 2020. “Building Digital Workforce Capacity and Skills for Data-Intensive Science.” http://www.oecd.org/officialdocuments/publicdisplaydocumentpdf/?cote=DSTI/STP/GSF(2020)6/FINAL&docLanguage=En.
    +
    +R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
    -

    You can move about with the cd command, which stands for “change directory”. You can use it to move into a directory by specifying the path to the directory:

    -
    -
    cd data
    -pwd
    -cd ..
    -pwd
    -cd data
    -pwd
    -
    -
    /home/runner/work/BIO00088H-data/BIO00088H-data/core/week-2/data
    -/home/runner/work/BIO00088H-data/BIO00088H-data/core/week-2
    -/home/runner/work/BIO00088H-data/BIO00088H-data/core/week-2/data
    +
    +Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. “Ten Simple Rules for Reproducible Computational Research.” PLoS Comput. Biol. 9 (10): e1003285. https://doi.org/10.1371/journal.pcbi.1003285.
    +
    +Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K Teal. 2017. “Good Enough Practices in Scientific Computing.” PLoS Comput. Biol. 13 (6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510.
    -
    head 1cq2.pdb
    -
    HEADER    OXYGEN STORAGE/TRANSPORT                04-AUG-99   1CQ2              
    -TITLE     NEUTRON STRUCTURE OF FULLY DEUTERATED SPERM WHALE MYOGLOBIN AT 2.0    
    -TITLE    2 ANGSTROM                                                             
    -COMPND    MOL_ID: 1;                                                            
    -COMPND   2 MOLECULE: MYOGLOBIN;                                                 
    -COMPND   3 CHAIN: A;                                                            
    -COMPND   4 ENGINEERED: YES;                                                     
    -COMPND   5 OTHER_DETAILS: PROTEIN IS FULLY DEUTERATED                           
    -SOURCE    MOL_ID: 1;                                                            
    -SOURCE   2 ORGANISM_SCIENTIFIC: PHYSETER CATODON;      
    -
    head -20 data/1cq2.pdb
    -
    HEADER    OXYGEN STORAGE/TRANSPORT                04-AUG-99   1CQ2              
    -TITLE     NEUTRON STRUCTURE OF FULLY DEUTERATED SPERM WHALE MYOGLOBIN AT 2.0    
    -TITLE    2 ANGSTROM                                                             
    -COMPND    MOL_ID: 1;                                                            
    -COMPND   2 MOLECULE: MYOGLOBIN;                                                 
    -COMPND   3 CHAIN: A;                                                            
    -COMPND   4 ENGINEERED: YES;                                                     
    -COMPND   5 OTHER_DETAILS: PROTEIN IS FULLY DEUTERATED                           
    -SOURCE    MOL_ID: 1;                                                            
    -SOURCE   2 ORGANISM_SCIENTIFIC: PHYSETER CATODON;                               
    -SOURCE   3 ORGANISM_COMMON: SPERM WHALE;                                        
    -SOURCE   4 ORGANISM_TAXID: 9755;                                                
    -SOURCE   5 EXPRESSION_SYSTEM: ESCHERICHIA COLI;                                 
    -SOURCE   6 EXPRESSION_SYSTEM_TAXID: 562;                                        
    -SOURCE   7 EXPRESSION_SYSTEM_VECTOR_TYPE: PLASMID;                              
    -SOURCE   8 EXPRESSION_SYSTEM_PLASMID: PET15A                                    
    -KEYWDS    HELICAL, GLOBULAR, ALL-HYDROGEN CONTAINING STRUCTURE, OXYGEN STORAGE- 
    -KEYWDS   2 TRANSPORT COMPLEX                                                    
    -EXPDTA    NEUTRON DIFFRACTION                                                   
    -AUTHOR    F.SHU,V.RAMAKRISHNAN,B.P.SCHOENBORN   
    -
    less 1cq2.pdb
    -

    less is a program that displays the contents of a file, one page at a time. It is useful for viewing large files because it does not load the whole file into memory before displaying it. Instead, it reads and displays a few lines at a time. You can navigate forward through the file with the spacebar, and backwards with the b key. Press q to quit.

    -

    A wildcard is a character that can be used as a substitute for any of a class of characters in a search, The most common wildcard characters are the asterisk (*) and the question mark (?).

    -
    ls *.csv
    -

    cp stands for “copy”. You can copy a file from one directory to another by giving cp the path to the file you want to copy and the path to the destination directory.

    -
    cp 1cq2.pdb copy_of_1cq2.pdb
    -
    cp 1cq2.pdb ../copy_of_1cq2.pdb
    -
    cp 1cq2.pdb ../bob.txt
    -

    To delete a file use the rm command, which stands for “remove”.

    -
    rm ../bob.txt
    -

    but be careful because the file will be gone forever. There is no “are you sure?” or undo.

    -

    To move a file from one directory to another, use the mv command. mv works like cp except that it also deletes the original file.

    -
    mv ../copy_of_1cq2.pdb .
    -

    Make a directory

    -
    mkdir mynewdir
    -

    Quarto notebooks

    -

    Demo

    -
      -
    • Text and executable cells
    • -
    • Formatting
    • -
    • Markdown
    • -
    • More in Week 6
    • -

    Google Colaboratory

    -

    Google Colab allows you to write and execute python code in your browser.

    -

    Demo

    -

    Python

    -

    Differences between R and python

    -

    Demo

    -

    You’re finished!

    -

    🥳 Well Done! 🎉

    -

    Independent study following the workshop

    -

    Consolidate

    -

    Pages made with R (R Core Team 2024), Quarto (allaire2022?), knitr (knitr?), kableExtra (Zhu 2021)

    -
    - - - -

    References

    -
    -R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
    -
    -Zhu, Hao. 2021. “kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.” https://CRAN.R-project.org/package=kableExtra. +
    -
    - - + } else { + return undefined; + } + }; + var bibliorefs = window.document.querySelectorAll('a[role="doc-biblioref"]'); + for (var i=0; i \ No newline at end of file diff --git a/core/week-6-old/overview.html b/core/week-6-old/overview.html new file mode 100644 index 0000000..ca8a6fa --- /dev/null +++ b/core/week-6-old/overview.html @@ -0,0 +1,671 @@ + + + + + + + + + + +Overview – Data Analysis for Group Project + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +
    + +
    + +
    + + + + +
    + +
    +
    +

    Overview

    +

    Core Drop-in

    +
    + + + +
    + + +
    +
    Published
    +
    +

    13 September, 2024

    +
    +
    + + +
    + + + +
    + + +

    This week’s session is a drop-in and introduces no new material. Instead, it is an opportunity to ask questions about the content from Core 1 and 2 and to revise skills from stage 1 and 2 as needed.

    +
    +

    Instructions

    +
      +
    1. Prepare

      +
        +
      1. 📖 Review content from Core 1 and 2
      2. +
    2. +
    3. Workshop

      +
        +
      1. 💻 Ask questions about the content from Core 1 and 2 as needed

      2. +
      3. 💻 Revise skills from stage 1 and 2 (88H students) or 52M (70M students) as needed

      4. +
    4. +
    5. Consolidate

      +
        +
      1. There is no consolidation work for this drop-in
      2. +
    6. +
    + + +
    + +
    + +
    + + + + + + \ No newline at end of file diff --git a/core/week-11/study_after_workshop.html b/core/week-6-old/study_after_workshop.html similarity index 72% rename from core/week-11/study_after_workshop.html rename to core/week-6-old/study_after_workshop.html index 5b750b3..8ae592c 100644 --- a/core/week-11/study_after_workshop.html +++ b/core/week-6-old/study_after_workshop.html @@ -101,7 +101,7 @@ - +
    @@ -123,8 +123,8 @@ Welcome!
    - -
    @@ -166,13 +166,13 @@ -
    -
    +

    Overview

    -

    Core Drop-in

    +

    Core: Supporting Information 2

    @@ -340,7 +271,7 @@

    Overview

    Published
    -

    3 September, 2024

    +

    13 September, 2024

    @@ -352,23 +283,28 @@

    Overview

    -

    This week’s session is a drop-in and introduces no new material. Instead, it is an opportunity to ask questions about the content from Core 1 and 2 and to revise skills from stage 1 and 2 as needed.

    +

    This week you will revise some essential concepts for scientific computing: file system organisation, file types, working directories and paths. The workshop will cover a rationale for working reproducibly, project oriented workflow, naming things and documenting your work. We will also examine some file types and the concept of tidy data.

    +
    +

    Learning objectives

    +

    The successful student will be able to:

    +
      +
    • explain the organisation of files and directories in a file systems including root, home and working directories
    • +
    • explain absolute and relative file paths
    • +
    • explain why working reproducibly is important
    • +
    • know how to use a project-oriented workflow to organise work
    • +
    • be able to give files human- and machine-readable names
    • +
    • outline some common biological data file formats
    • +
    +

    Instructions

    1. Prepare

        -
      1. 📖 Review content from Core 1 and 2
      2. -
    2. -
    3. Workshop

      -
        -
      1. 💻 Ask questions about the content from Core 1 and 2 as needed

      2. -
      3. 💻 Revise skills from stage 1 and 2 (88H students) or 52M (70M students) as needed

      4. -
    4. -
    5. Consolidate

      -
        -
      1. There is no consolidation work for this drop-in
      2. +
      3. 📖 Read Understanding file systems
    6. +
    7. Workshop

    8. +
    9. Consolidate

    diff --git a/core/week-6/study_after_workshop.html b/core/week-6/study_after_workshop.html index cefeeca..1793690 100644 --- a/core/week-6/study_after_workshop.html +++ b/core/week-6/study_after_workshop.html @@ -124,7 +124,7 @@ -

    Write the significant genes to file

    We will create dateframe of the signifcant genes and write them to file. These are the files you want to examine in more detail along with the visualisations to select your genes of interest.

    🎬 Create a dataframe of the genes significant at the 0.01 level:

    -
    prog_hspc_results_sig0.01 <- prog_hspc_results |> 
    +
    prog_hspc_results_sig0.01 <- prog_hspc_results |> 
       filter(FDR <= 0.01)

    🎬 Write the dataframe to file

    @@ -896,7 +926,7 @@

    Workshop

    Our data have genes in rows and samples in columns which is a common organisation for gene expression data. However, PCA expects cells in rows and genes, the variables, in columns. We can transpose the data to get it in the correct format.

    🎬 Transpose the log2 transformed normalised counts:

    -
    prog_hspc_trans <- prog_hspc_results |> 
    +
    prog_hspc_trans <- prog_hspc_results |> 
       dplyr::select(starts_with(c("Prog_", "HSPC_"))) |>
       t() |> 
       data.frame()
    @@ -904,16 +934,16 @@

    Workshop

    We have used the select() function to select all the columns that start with Prog_ or HSPC_. We then use the t() function to transpose the dataframe. We then convert the resulting matrix to a dataframe using data.frame(). If you view that dataframe you’ll see it has default column name which we can fix using colnames() to set the column names to the gene ids.

    🎬 Set the column names to the gene ids:

    -
    colnames(prog_hspc_trans) <- prog_hspc_results$ensembl_gene_id
    +
    colnames(prog_hspc_trans) <- prog_hspc_results$ensembl_gene_id

    perform PCA using standard functions

    -
    pca <- prog_hspc_trans |>
    +
    pca <- prog_hspc_trans |>
       prcomp(rank. = 15) 

    The rank. argument tells prcomp() to only calculate the first 15 principal components. This is useful for visualisation as we can only plot in 2 or 3 dimensions. We can see the results of the PCA by viewing the summary() of the pca object.

    -
    summary(pca)
    +
    summary(pca)
    Importance of first k=15 (out of 280) components:
                                PC1     PC2     PC3     PC4     PC5     PC6     PC7
    @@ -933,13 +963,13 @@ 

    Workshop

    The Proportion of Variance tells us how much of the variance is explained by each component. We can see that the first component explains 0.1099 of the variance, the second 0.04874, and the third 0.2498. Together the first three components explain 18% of the total variance in the data. Plotting PC1 against PC2 will capture about 16% of the variance. This is not that high but it likely better than we would get plotting any two genes against each other. To plot the PC1 against PC2 we will need to extract the PC1 and PC2 score from the pca object and add labels for the cells.

    🎬 Create a dataframe of the PC1 and PC2 scores which are in pca$x and add the cell ids:

    -
    pca_labelled <- data.frame(pca$x,
    +
    pca_labelled <- data.frame(pca$x,
                                cell_id = row.names(prog_hspc_trans))

    It will be helpful to add a column for the cell type so we can label points. One way to do this is to extract the information in the cell_id column into two columns.

    🎬 Extract the cell type and cell number from the cell_id column (keeping the cell_id column):

    -
    pca_labelled <- pca_labelled |> 
    +
    pca_labelled <- pca_labelled |> 
       extract(cell_id, 
               remove = FALSE,
               c("cell_type", "cell_number"),
    @@ -949,7 +979,7 @@ 

    Workshop

    We can now plot the PC1 and PC2 scores.

    🎬 Plot PC1 against PC2 and colour the points by cell type:

    -
    pca <- pca_labelled |> 
    +
    pca <- pca_labelled |> 
       ggplot(aes(x = PC1, y = PC2, 
                  colour = cell_type)) +
       geom_point(alpha = 0.4) +
    @@ -967,7 +997,7 @@ 

    Workshop

    Fairly good separation of cell types but plenty of overlap

    🎬 Save the plot to file:

    -
    ggsave("figures/prog_hspc-pca.png",
    +
    ggsave("figures/prog_hspc-pca.png",
            plot = pca,
            height = 3, 
            width = 4,
    @@ -979,28 +1009,28 @@ 

    Workshop

    We are going to create an interactive heatmap with the heatmaply (Galili et al. 2017) package. heatmaply takes a matrix as input so we need to convert a dataframe of the log2 values to a matrix. We will also set the rownames to the gene names.

    🎬 Convert a dataframe of the log2 values to a matrix. I have used sample() to select 70 random columns so the heatmap is generated quickly:

    -
    mat <- prog_hspc_results_sig0.01 |> 
    +
    mat <- prog_hspc_results_sig0.01 |> 
       dplyr::select(starts_with(c("Prog", "HSPC"))) |>
       dplyr::select(sample(1:1499, size = 70)) |>
       as.matrix()

    🎬 Set the row names to the gene names:

    -
    rownames(mat) <- prog_hspc_results_sig0.01$external_gene_name
    +
    rownames(mat) <- prog_hspc_results_sig0.01$external_gene_name

    You might want to view the matrix by clicking on it in the environment pane.

    🎬 Load the heatmaply package:

    We need to tell the clustering algorithm how many clusters to create. We will set the number of clusters for the cell types to be 2 and the number of clusters for the genes to be the same since it makes sense to see what clusters of genes correlate with the cell types.

    -
    n_cell_clusters <- 2
    +
    n_cell_clusters <- 2
     n_gene_clusters <- 2

    🎬 Create the heatmap:

    -
    heatmaply(mat, 
    +
    heatmaply(mat, 
               scale = "row",
               k_col = n_cell_clusters,
               k_row = n_gene_clusters,
    @@ -1009,8 +1039,8 @@ 

    Workshop

    labRow = rownames(mat), heatmap_layers = theme(axis.line = element_blank()))
    -
    - +
    +

    It will take a minute to run and display. On the vertical axis are genes which are differentially expressed at the 0.01 level. On the horizontal axis are cells. We can see that cells of the same type don’t cluster that well together. We can also see two clusters of genes but the pattern of gene is not as clear as it was for the frogs and the correspondence with the cell clusters is not as strong.

    @@ -1019,16 +1049,16 @@

    Workshop

    Visualise all the results with a volcano plot

    colour the points if FDR < 0.05 and prog_hspc_results > 1

    -
    prog_hspc_results <- prog_hspc_results |> 
    +
    prog_hspc_results <- prog_hspc_results |> 
       mutate(log10_FDR = -log10(FDR),
              sig = FDR < 0.05,
              bigfc = abs(summary.logFC) >= 2) 
    -
    vol <- prog_hspc_results |> 
    +
    vol <- prog_hspc_results |> 
       ggplot(aes(x = summary.logFC, 
                  y = log10_FDR, 
                  colour = interaction(sig, bigfc))) +
    @@ -1052,28 +1082,20 @@ 

    Workshop

    theme_classic() + theme(legend.position = "none") vol
    -
    -
    Error in `geom_text_repel()`:
    -! Problem while computing aesthetics.
    -ℹ Error occurred in the 5th layer.
    -Caused by error:
    -! object 'external_gene_name' not found
    +
    +
    +

    +
    +
    -
    ggsave("figures/prog-hspc-volcano.png",
    +
    ggsave("figures/prog-hspc-volcano.png",
            plot = vol,
            height = 4.5, 
            width = 4.5,
            units = "in",
            device = "png")
    -
    -
    Error in `geom_text_repel()`:
    -! Problem while computing aesthetics.
    -ℹ Error occurred in the 5th layer.
    -Caused by error:
    -! object 'external_gene_name' not found
    -
    diff --git a/transcriptomics/week-5/workshop_files/figure-html/unnamed-chunk-33-1.png b/transcriptomics/week-5/workshop_files/figure-html/unnamed-chunk-33-1.png index 4d37e2b..70ab90a 100644 Binary files a/transcriptomics/week-5/workshop_files/figure-html/unnamed-chunk-33-1.png and b/transcriptomics/week-5/workshop_files/figure-html/unnamed-chunk-33-1.png differ diff --git a/transcriptomics/week-5/workshop_files/figure-html/unnamed-chunk-65-1.png b/transcriptomics/week-5/workshop_files/figure-html/unnamed-chunk-65-1.png new file mode 100644 index 0000000..7ec54a6 Binary files /dev/null and b/transcriptomics/week-5/workshop_files/figure-html/unnamed-chunk-65-1.png differ