Skip to content

Latest commit

 

History

History
127 lines (80 loc) · 9.04 KB

INSTRUCTIONS.md

File metadata and controls

127 lines (80 loc) · 9.04 KB

README First :)

The goal of this repository is to help you create a proejct that is as reproducible as you can make it. This is the template that we start with for projects in the Schloss lab. Ideally, you will post this directory to GitHub under your own lab's account for others to see when you publish your research. Even if you don't meat that ideal, using this project template will help you to make your research more reproducible. It's likely that there will be at least one person that will need to reproduce your analysis - YOU!!! If you would like to learn more about the concepts embeded in this template, you should check out the Riffomonas website where Pat has created a Reproducibile Research Tutorial Series. This is the training that everyone who joins Pat's lab starts with in their first week of being in the lab.

If you have any questions, please feel free to leave a comment at https://github.com/SchlossLab/new_project/issues

Guiding themes

Transparency is a huge part of reproducibility. Even if there aren't any scripts in your project, your project's organization and documentation will go a long ways to helping future you or someone else to see what you've done. One step to enabling this is to post the project as a repository on GitHub.

All co-authors should be able to see the project develop and its history. They should also be able to leave comments in the issue tracker and file pull requests to add/improve analysis and text. Once submitted, the repository will serve as an open notebook to readers that are interested in better understanding methods so that they can critique our work and hopefully use our analysis in their own work. Whether this repository is public prior to submission is left to the first author; however, once submitted, the repository will be made public.

As each project is developed, the lead author should approach the project from the perspective a future user seeking to reproduce their work. At the basic level, the goal is for anyone to be able to clone the repository and run make write.paper to reproduce the analysis. There are many reasons why this may be hard or even impossible, but this should always be the goal. To achieve this it is important to keep in mind several factors:

  • Organization: All analysis should be run from the project's root directory without using cd. To achieve this, relative and not absolute paths should be used.

  • Raw data should stay raw: All raw data should live in data/raw/. This includes metadata received from collaborators and fastq data taken from the sequencer. Data should only be manipulated using code and outputted to a separate folder. Most of our sequence processing will be done using mothur and the intermediate files should be stored in data/mothur/.

  • Create figures in scripts: The code for figures should be stored in code/ and the output should be in the size and specifications set by the target journal.

  • Software: Document all external software, links, citations, and version numbers in the README document. Also document where your Makefile expects this software to be installed.

  • All code lives in the code folder: Any batch or R scripts should be in the code/ folder. We do this to create a partition between code and data. Tests should be kept in the code/tests folder.

  • Automation: All analysis should be automated and executed using something like GNU make or Snakemake. This will make sure that all file dependencies are maintained.

  • Keep code DRY: Code should make use of functions to eliminate repetition.

  • Automate tests as much as possible: testthat is a powerful R package for performing unit tests to make sure your functions perform as expected.

  • Set the random number generator seed: Use set.seed in mothur and R to set the random number generator so that your analysis is reproducible.

  • The heavy computational lifting should be in scripts: Long computations should generate an output file that has a rule in the Makefile. The output of these computations can be used in building figures and tables and in the manuscript's Rmd file.

  • No hardcoding numbers in text: All numbers from your study that appear in the final manuscript should have been generated using R code in the Rmd file. Use the R stopifnot function to make sure you are generating the expected outputs from your Rmd file (e.g. number of significant OTUs).

  • *Create tables in separate .Rmd files: The code and captions for tables should be stored in results/tables/ and should be rendered to PDFs keeping the TeX formatted intermediate files.

  • Manuscript documents are stored in submission folder: Manuscripts should use BibTeX and csl formatting. They should not use EndNote. The preferred output format is a PDF. Even if a journal requests a Word-formatted file for final acceptance, they likely accept TeX instead if asked.

  • Maintain as much of the exploratory phase as possible: There is an exploratory folder for experimentation and maintaining notebooks for situations where it is not possible to automate an analysis. Everything in the scratch directory can be deleted at any time without negative impact.

  • Keep all analysis under version control: Commit your code after any significant changes to your project. Push the code to the GitHub repository at least daily.

See Noble 2009 for a general description of and argument for the principle template structure. Some concepts and goals that guided this work.

Resources

Starting

  • Download the latest release to the directory and decompress

  • Rename the decompressed/untarred file to the name of your project. Name the folder in the format LastName_BriefDescription_Journal_Year. For example, "Kozich_MiSeqSOP_AEM_2013" will work. Move into the folder.

  • Open the README.md document in an editor. Change the first line to reflect the title of your research study, and the content from this section to the end. You can, but are not obligated to keep the Acknowledgements section. You should keep the directory tree

  • In the directory tree of the README.md file, customize the comments to fit your project. Update this tree as the project proceeds.

  • At the terminal do the following to replace the license for the template's repository with the license for your project

    mv newproject_LICENSE.md LICENSE.md
    
  • At the prompt type:

    git init ./
    git add .
    git commit -m "Initial commit"
    
  • Go to the SchlossLab repository on GitHub and create a new repository with the same name as your folder. Click the "Create Repository" button.

  • Follow the instructions for pushing an existing repository from the command line. For example (be sure to replace LastName_BriefDescription_Journal_Year with the appropriate text:

    git remote add origin https://github.com/SchlossLab/LastName_BriefDescription_Journal_Year.git
    git push -u origin master
    
  • You will want to update the README file as you go through your work. It is likely that you'll need to provide a link to the published paper, the title, the abstract of the study, and instructions for how to generate the paper.

  • Once you understand all of this, you can run

    git rm INSTRUCTIONS.md
    git commit -m "Remove instructions document"
    

Citations

You should obviously think about where you want your work published. Various formatting files are available as csl files that will play nicely with rmakrdown, Word, and LaTeX. You can find these files at the CSL GitHub repository. As an example, we commonly use style guides for the ASM Journals. We have provided mbio.csl in the submission directory.

Acknowledgements

The initial file and directory structure of this project was developed by a group of participants in the Reproducible Science Curriculum Workshop, held at NESCent in December 2014 (rr-init repository). The structure is based on, and heavily follows the one proposed by Noble 2009, with a few but small modifications. All copyright and related and neighboring rights to the original template were dedicated to the public domain worldwide under the CC0 Public Domain Dedication. The template and its derivatives are distributed without any warranty. It has been further modified by Pat Schloss to fit the needs of his research group.