Skip to content

Commit

Permalink
Response to Lex Nederbragt review (#109)
Browse files Browse the repository at this point in the history
* Summary and clarifications in 02-data_management.md

* Clarifications in 04-collaboration.md

* Simplify recap in 08-what_next.md

* changelog explanation and minor edits to 06-track_changes.md

* Clarify 05-project_organization.md

* Links, clarifications, in 05-project_organization.md

* Attribution to GEP paper in 02-data_management.md

* Attribute GEP paper in 03-software.md

* Attribute GEP paper in 04-collaboration.md

* Attribute GEP paper in 06-track_changes.md

* Attribute GEP paper in 07-manuscripts.md

* Add links in 03-software.md

* More links in 03-software.md

* Intro discussion notes in instructor guide

* Added links to resources on file naming

Addresses #42

* Clarified use of multiple tables

- Record different data types in individual tables as appropriate (e.g. sample metadata may be kept separately from sequencing experiment metadata)
- Use unique identifiers for every record in a table, allowing linkages between tables (e.g. sample identifiers are recorded in the sequencing experiment metadata)

Addresses #44

* Modifications to pseudocode exercise

Simplified exercise by giving learners the function and asking them to call the function with different parameters, then use the function within a for loop.

* Link fix for Azure

* Section heading change for multiple tables

* Expanded glossary

---------

Co-authored-by: ameynert <[email protected]>
  • Loading branch information
ewallace and ameynert committed Jun 12, 2023
1 parent 2b54ce8 commit 157988c
Show file tree
Hide file tree
Showing 9 changed files with 150 additions and 112 deletions.
72 changes: 34 additions & 38 deletions _episodes/02-data_management.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,8 @@ Source: PHD Comics. ["Four stages of data loss"](http://phdcomics.com/comics/arc
> {: .solution}
{: .challenge}

Backing up your data is essential, otherwise it is a question of when (not if) you lose it.

Where possible, save data as originally generated (i.e. by an
instrument or from a survey). It is tempting to overwrite raw data
files with cleaned-up versions, but faithful retention is essential
Expand All @@ -93,7 +95,7 @@ often have their own data storage solutions, so it is worthwhile to
consult with your local Information Technology (IT) group or
library. Alternatively cloud computing resources, like
Amazon Simple Storage Service (Amazon S3), Google Cloud
Storage or https://azure.microsoft.com/en-us/services/storage/ are
Storage or [Azure](https://azure.microsoft.com/en-us/services/storage/) are
reasonably priced and reliable. For large data sets, where storage
and transfer can be expensive and time-consuming, you may need to
use incremental backup or specialized storage systems, and people in
Expand All @@ -102,13 +104,14 @@ assistance on options at your university or organization as well.

## Working with sensitive data

It is important to identify whether your project will work with sensitive data - by which we might mean:
Identify whether your project will work with sensitive data - by which we might mean:

* Research data including personal data or identifiers (this might include names and addresses, or potentially identifyable genetic data or health information, or confidential information)
* Commercially sensitive data or information (this might include intellectual property, or data generated or used within a restrictive commercial research funding agreement)
* Data which may cause harm or adverse affects if released or made public (for example data relating to rare or endangered species which could cause poaching or fuel illegal trading)

It is important to understand the restrictions which may apply when working with sensitive data, and also ensure that your project complies with any applicable laws relating to storage, use and sharing of sensitive data (for example, laws like the General Data Protection Regulation, known as the GDPR).These laws vary between countries and may affect whether you can share information between collaborators in different countries.
It is important to understand the restrictions which may apply when working with sensitive data, and also ensure that your project complies with any applicable laws relating to storage, use and sharing of sensitive data (for example, laws like the General Data Protection Regulation, known as the GDPR).
These laws vary between countries and may affect whether you can share information between collaborators in different countries.

## Create the data you wish to see in the world

Expand All @@ -120,7 +123,7 @@ It is important to understand the restrictions which may apply when working with
filename itself, while keeping the filename regular enough for easy
pattern matching. For example, a filename like
`2016-05-alaska-b.csv` makes it easy for both people and programs to
select by year or by location.
select by year or by location. Common file naming conventions are discussed in the [Turing Way](https://the-turing-way.netlify.app/reproducible-research/rdm/rdm-storage.html) and in the [Project Organization](https://carpentries-incubator.github.io/good-enough-practices/05-project_organization) episode of this lesson.

*Variable names*: Replace inscrutable variable names and artificial
data codes with self-explaining alternatives, e.g., rename variables
Expand All @@ -145,7 +148,8 @@ transformations that we recommend at the beginning of analysis:

- Create analysis-friendly data
- Record all the steps used to process data
- Anticipate the need to use multiple tables, and use a unique identifier for every record
- Record different data types in individual tables as appropriate (e.g. sample metadata may be kept separately from sequencing experiment metadata)
- Use unique identifiers for every record in a table, allowing linkages between tables (e.g. sample identifiers are recorded in the sequencing experiment metadata)

## Create analysis-friendly data

Expand Down Expand Up @@ -203,7 +207,7 @@ chosen as a set of boundary coordinates.
{: .callout}


## Anticipate the need to use multiple tables, and use a unique identifier for every record
## Use multiple tables as necessary, and use a unique identifier for every record

Raw data, even if tidy,
is not necessarily complete. For example, the primary data table
Expand Down Expand Up @@ -237,19 +241,12 @@ when variables in two datasets refer to the same thing.
> {: .solution}
{: .challenge}

Your data is as much a
product of your research as the papers you write, and just as likely
to be useful to others (if not more so). Sites such as
Dryad and Zenodo allow others to find your
work, use it, and cite it; we discuss licensing in
Section [sec:collaboration] below. Follow your research community's
standards for how to provide metadata. Note that there are two types
of metadata: metadata about the dataset as a whole and metadata
about the content within the dataset. If the audience is humans,
write the metadata (the README file) for humans. If the audience
includes automatic metadata harvesters, fill out the formal metadata
and write a good README file for the humans
[[wickes2015](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing/issues/3#issuecomment-157410442)].
Your data is as much a product of your research as the papers you write, and just as likely to be useful to others (if not more so).
Sites such as [Dryad](http://datadryad.org) and [Zenodo](https://zenodo.org) allow others to find your work, use it, and cite it; we discuss licensing in the episode on collaboration [04-collaboration].
Follow your research community's standards for how to provide metadata.
Note that there are two types of metadata: metadata about the dataset as a whole and metadata about the content within the dataset.
If the audience is humans, write the metadata (the README file) for humans.
If the audience includes automatic metadata harvesters, fill out the formal metadata and write a good README file for the humans [[wickes2015](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing/issues/3#issuecomment-157410442)].

> ## What is a DOI?
> - A digital object identifier is a persistent identifier or handle used to identify objects uniquely.
Expand All @@ -260,12 +257,10 @@ and write a good README file for the humans

> ## Places to share Data, with DOIs
>
> - UoE DataShare (https://datashare.is.ed.ac.uk/) local open-access repository
> - UoE DataVault (https://datavault.ed.ac.uk) local long-term retention.
> - Dataverse (http://thedata.org): A repository for research data that takes care of long-term preservation and good archival practices, while researchers can share, keep control of, and get recognition for their data.
> - FigShare (http://figshare.com): A repository where users can make all of their research outputs available in a citable, shareable, and discoverable manner. Note that figshare is commercial.
> - Zenodo (http://zenodo.org): A repository service that enables researchers, scientists, projects, and institutions to share and showcase multidisciplinary research results (data and publications)
> - Dryad (http://datadryad.org): A repository that aims to make data archiving as simple and as rewarding as possible through a suite of services not necessarily provided by publishers or institutional websites.
> - Dataverse (http://thedata.org): A repository for research data that takes care of long-term preservation and good archival practices, while researchers can share, keep control of, and get recognition for their data.
>
{: .callout}

Expand Down Expand Up @@ -300,28 +295,29 @@ Often research institutions provide support for DMPs, e.g. through library servi

More resources on data management plans are available at [DMP online](https://dmponline.dcc.ac.uk).

> ## What's your next step in data management?
>
> - Which recommendations above are most helpful for your current project? What could you try this week?
> - Does your next project have a data management plan? Could you draft one?
>
{: .discussion}

## Summary

Taken in order, the recommendations above will produce intermediate data
files with increasing levels of cleanliness and task-specificity. An
alternative approach to data management would be to fold all data
management tasks into a monolithic procedure for data analysis, so that
intermediate data products are created "on the fly" and stored only in
memory, not saved as distinct files.
Taken in order, the recommendations above will make it easier to keep track of your data and to work with it.
Saving the raw data along with clear metadata, backed up, is your insurance policy.
Creating analysis-friendly data, and recording all the steps used to process data, means that you and others can reproduce your analysis.
Sharing your data via DOI-issuing repository allows others can access and cite it, which they will find easier if your data are analysis-friendly, clearly named, and well-documented.

These recommendations include explicitly creating and retaining of intermediate data files at intermediate steps of the analysis, with increasing levels of cleanliness and task-specificity.
Saving intermediate files makes it easy to re-run *parts* of a data analysis pipeline, which in turn makes it less onerous to revisit and improve specific data processing tasks.
Breaking a lengthy analysis workflow into modular parts makes it easier to understand, share, describe, and modify.

While the latter approach may be appropriate for projects in which very
little data cleaning or processing is needed, we recommend the explicit
creation and retention of intermediate products. Saving intermediate
files makes it easy to re-run *parts* of a data analysis pipeline, which
in turn makes it less onerous to revisit and improve specific data
processing tasks. Breaking a lengthy workflow into pieces makes it
easier to understand, share, describe, and modify. This is particularly
true when working with large data sets, where storage and transfer of
the entire data set is not trivial or inexpensive.
Modifying and sharing your data analysis is only possible if you still have the raw data: **back up your data!!!**


> ## Attribution
> Content of this episode was adopted after Wilson et al.
> This episode was adapted from and includes material from Wilson et al.
> [Good Enough Practices for Scientific Computing](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing).
{: .callout}

Expand Down
60 changes: 34 additions & 26 deletions _episodes/03-software.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ There are many different shapes and sizes of research software:

- Any code that runs in order to process your research data.
- A record of all the steps used to process your data (scripts and workflow such data analysis are software).
- R, python, MATLAB, shell, openrefine, imageJ, etc. are all scriptable. So are Excel macros.
- [R](https://en.wikipedia.org/wiki/R_(programming_language)), [Python](https://www.python.org), [MATLAB](https://www.mathworks.com/products/matlab.html), [unix shell](https://en.wikipedia.org/wiki/Unix_shell), [OpenRefine](https://openrefine.org), [ImageJ](https://imagej.net/software/imagej/), etc. are all scriptable. So are [Microsoft Excel macros](https://support.microsoft.com/en-au/office/quick-start-create-a-macro-741130ca-080d-49f5-9471-1e5fb3d581a8).
- Standalone programs or scripts that do particular research tasks are also research software.

There are extended discussions about research software at the [Software Sustainability Institute](https://www.software.ac.uk/about).
Expand Down Expand Up @@ -176,7 +176,7 @@ chunks. Putting code into functions also makes it easier to test and
troubleshoot when things go wrong.

[Pseudocode](https://en.wikipedia.org/wiki/Pseudocode) is a plain language description of code or analysis steps.
Writing pseudocode can be useful to think through the logic of your analysis, and how to decompose it in to functions.
Writing pseudocode can be useful to think through the logic of your analysis, and how to decompose it into functions.

The "make a cup of tea" example above might look like this:

Expand All @@ -199,35 +199,43 @@ The "make a cup of tea" example above might look like this:
return cup

> ## Decompose this pseudocode statement into functions.
> ## Using pseudocode
>
> ~~~
> coconuts = 0
> for each tree on my island
> coconuts = coconuts plus coconuts on tree
>
> cherries = 0
> for each tree on my island
> cherries = cherries plus cherries on tree
> In this scenario, you're managing fruit production on a set of islands. You have written a pseudocode function
> that tells you how to count how much fruit of a particular type is available to harvest on a given island.
>
> peaches = 0
> for each tree on Sam's island
> peaches = peaches plus peaches on tree
> ~~~
> count_fruit_on_island = function(fruit type, island)
> total fruit = 0
> for every tree of fruit type on the island
> total fruit = total fruit + number of fruit on tree
> end for loop
> return total fruit
> ~~~
> {: .source}
>
> Write the commands to call this function to count how many coconuts there are on Sam's island, how many cherries
> there are on Sam's island, and how many cherries there are on Charlie's island.
>
> Write a pseudocode for loop like the one above that uses this function to count all the cherries on every island.
>
>> ## Solution
>>
>> ~~~
>> count_fruit_on_island = function(fruit, island)
>> fruit = 0
>> for each tree on island
>> fruit = fruit + fruit of this type on tree
>> return fruit
>> sams coconuts = count_fruit_on_island(coconuts, Sam's island)
>> sams cherries = count_fruit_on_island(cherries, Sam's island)
>> charlies cherries = count_fruit_on_island(cherries, Charlie's island)
>> ~~~
>> {: .source}
>>
>> count_fruit_on_island(coconuts, my island)
>> count_fruit_on_island(cherries, my island)
>> count_fruit_on_island(peaches, Sam's island)
>> To count all the cherries on every island:
>>
>> ~~~
>> total cherries = 0
>> for every island
>> total cherries = total cherries + count_fruit_on_island(cherries, island)
>> end for loop
>> print "There are " + total cherries + " cherries on all the islands"
>> ~~~
>> {: .source}
> {: .solution}
Expand Down Expand Up @@ -295,7 +303,7 @@ data structures in a program should *not* have one-letter names.
>> 2. input - incorrect, too vague
>> 3. **numericSequence - correct, short and included information about the type of input**
>> 4. S - incorrect, too vague
> {: .solution}
>> {: .solution}
{: .challenge}

> ## Language style guides
Expand Down Expand Up @@ -349,11 +357,11 @@ Your code is like your data and also needs to be managed, backed up, and shared.
Your software is as much a product of your research as your papers,
and should be as easy for people to credit.
Submit code to a reputable DOI-issuing repository, just as you do with data.
DOIs for software are provided by Figshare and Zenodo, for example.
Zenodo integrates directly with GitHub.
DOIs for software are provided by [Figshare](https://figshare.com) and [Zenodo](http://zenodo.org), for example.
Both Figshare and Zenodo integrate directly with GitHub.

> ## Attribution
> Content of this episode was adopted after Wilson et al.
> This episode was adapted from and includes material from Wilson et al.
> [Good Enough Practices for Scientific Computing](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing).
{: .callout}

Expand Down
18 changes: 10 additions & 8 deletions _episodes/04-collaboration.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ Future you will forget things, and your collaborators will not know them in the
An overview document can collect the most important information about your project,
and act as a signpost.
The overview is usually the first thing people read about your project, so it is often called a "README".
The README has two jobs, what is inside and how it relates to the outside.
The README has two jobs: describing the contents of the project, and explaining how to interact with the project.

Create a short file in the
project's home directory that explains the purpose of the project.
Expand All @@ -82,7 +82,7 @@ similar) should contain :
- A brief description
- Up-to-date contact information
- An example or two of how to run the most important tasks
- Broad overview of folder structure
- Overview of folder structure


## Describe how to contribute to the project
Expand All @@ -101,24 +101,26 @@ contribute to it:
- Tests that can be run to ensure that software has been installed correctly
- Guidelines or checklists that your project adheres to.

A `CONTRIBUTING` file like this can be very helpful in reminding you details of your project that may be forgotten over time.
This information is very helpful and will be forgotten over time unless it's documented inside the project.


> ## Comparing README files
> Here is a README file for a data project and one for a software project.
> What do you think is good and what can be improved about each one.
> What useful and important information is present, and what is missing?
> [Data Project README](https://github.com/ewallace/pseudonuclease_evolution_2020)
> [Software Project README](https://github.com/DualSPHysics/DualSPHysics)
>
>> ## Solution
>> A Data Project README:
>> This Data Project README:
>> - Contains a DOI
>> - Describes the purpose of the code and link to a related paper
>> - Describes the purpose of the code and links to a related paper
>> - Describes the project structure
>> - Includes a license
>> - DOES NOT contain requirements
>> - DOES NOT include a working example
>> - DOES NOT include a explicit list of authors (can be inferred from paper though)
>>
>> A Software Project README:
>> This Software Project README:
>> - Describes the purpose of the code
>> - Describes the requirements
>> - Includes instructions for various type of users
Expand Down Expand Up @@ -221,7 +223,7 @@ a more detailed `CITATION` file, see the one for the
- [The Turing Way Guide for Collaboration](https://the-turing-way.netlify.app/collaboration/collaboration.html)

> ## Attribution
> Content of this episode was adopted after Wilson et al.
> This episode was adapted from and includes material from Wilson et al.
> [Good Enough Practices for Scientific Computing](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing).
{: .callout}

Expand Down
Loading

0 comments on commit 157988c

Please sign in to comment.