From a104298664f3eaba67bd21259417c3af008d5254 Mon Sep 17 00:00:00 2001 From: Forest Gregg Date: Mon, 1 Mar 2021 12:06:53 -0500 Subject: [PATCH] Adopt rMarkdown as data analysis framework (#111) * comparisons for rmarkdown * discussion of editor support * recommendation of adoption * Update rmarkdown/research/comparisons-with-existing-tools.md Co-authored-by: Jean Cochrane * Update rmarkdown/research/comparisons-with-existing-tools.md Co-authored-by: Jean Cochrane * Update rmarkdown/research/comparisons-with-existing-tools.md Co-authored-by: Jean Cochrane * add section recommending people start with rstudio * added readme doc, and otherwise follow hannah's reccs Co-authored-by: Jean Cochrane --- data-analysis/README.md | 30 +++++++++++ .../comparisons-with-existing-tools.md | 54 +++++++++++++++++++ .../research/recommendation-of-adoption.md | 35 ++++++++++++ 3 files changed, 119 insertions(+) create mode 100644 data-analysis/README.md create mode 100644 data-analysis/research/comparisons-with-existing-tools.md create mode 100644 data-analysis/research/recommendation-of-adoption.md diff --git a/data-analysis/README.md b/data-analysis/README.md new file mode 100644 index 0000000..bf54a6b --- /dev/null +++ b/data-analysis/README.md @@ -0,0 +1,30 @@ +# Literate Analysis and RMarkdown + +This directory records best practices for writing literate analysis reports and using +[RMarkdown](https://rmarkdown.rstudio.com/authoring_quick_tour.html) to do it. + +Literate analysis is a style of writing documents that includes the text and the code for analysis in one document. It is a major benefit in keeping your numbers and figures +aligned with your text; consolidating your work sanely; and self-documenting the code +your analysis code. See [Hannah write up for some more depth](https://source.opennews.org/articles/black-box-be-gone-tools-human-optimized-data-analy/). + +## Contents + +- README +- [Research](./research/) + - [Comparisons with existing tools](./research/comparisons-with-existing-tools.md) + - [Recommendation of adoption](./research/recommendation-of-adoption.md) + +## When to Literate Analysis + +When you have to write code to generate figure, charts, or graphics to include in +a research report, you should write a literate analysis document. + +## How to use RMarkdown for Literate Analysis + +Look to the [Courts Transparency cookiecutter](https://github.com/datamade/cookiecutter-court-transparency) for inspiration in getting started. + +If this is your first project, we strongly recommend using [RStudio](https://rstudio.com/), which has fabulous support for RMarkdown. + +## Resources for learning + +* https://rmarkdown.rstudio.com/lesson-1.html \ No newline at end of file diff --git a/data-analysis/research/comparisons-with-existing-tools.md b/data-analysis/research/comparisons-with-existing-tools.md new file mode 100644 index 0000000..111a841 --- /dev/null +++ b/data-analysis/research/comparisons-with-existing-tools.md @@ -0,0 +1,54 @@ +# Comparing rMarkdown with existing tools + +How does rMarkdown compare with existing tools in DataMade's stack or possible alternatives. + +## Pweave + +Like rMarkdown, [Pweave](http://mpastell.com/pweave/) is an implementation of [noweb](https://en.wikipedia.org/wiki/Noweb), but one that primarily targets Python instead of R. + +The main advantage of Pweave is that it is Python. + +While rMarkdown does allow for Python code chunks, there is typically some setup code and that does need to be done in R. With Pweave, it's all Python. + +That is really the only advantage. + +Like rMarkdown, Pweave requires an additional runtime beyond standard Python. rMarkdown requires R and Pweave requires +[IPython](https://ipython.org/). + +Pweave is not actively maintained, and has not been updated +in three years. + +rMarkdown has better editor support than Pweave. For the following editors, rMarkdown is as good and usually better +than support for Pweave, if there any Pweave support exists. + +* [sublime](https://packagecontrol.io/packages/knitr) +* [emacs](https://ess.r-project.org/) +* [atom](http://www.goring.org/resources/atom_and_r.html) +* [vscode](https://marketplace.visualstudio.com/items?itemName=Ikuyadeu.r) + +rMarkdown also has its own IDE, [RStudio](https://rstudio.com/) + +Beyond active devlopment and editor support, Pweave is missing many features compared to rMarkdown. Of greatest consequence are 1. chunk specific caching and support for 2. multiple languages, particularly SQL. + +Chunk specific caching can dramatically reduce build times which is critical in speed of development. + +Our past experience suggests that SQL will be a common language we will use in literate reports, and first class +support is very nice. + +## Jupyter Notebook + +Jupyter Notebooks overlap in functionality with rMarkdown. The main differences is that Notebooks are intended to be +an interactive exploration tools and rMarkdown is intended to be a documentation and document creation tool. + +I have not used Notebooks extensively, but three attributes +make it less attractive. + +1. While possible, it is more difficult to generate attractive documents from Notebooks. +2. The file format of Notebooks is not plain text and not natively diffable by github or gitlab, thus making PRs difficult +3. While possible, Notebooks are not primarily intended to +be scripted instead of interactive, thus making bit of mismatch with our ETL philosophy + +## Manual integration + +We can do and do generate statistics and graphs in one tool and then copy the data or graphics into Google Docs or a markdown file. Sometimes this is the appropriate approach, as described in +the recommendation document. diff --git a/data-analysis/research/recommendation-of-adoption.md b/data-analysis/research/recommendation-of-adoption.md new file mode 100644 index 0000000..6ef5cb5 --- /dev/null +++ b/data-analysis/research/recommendation-of-adoption.md @@ -0,0 +1,35 @@ +# Recommendation of Adoption + +We recommend RMarkdown for authoring literate research reports when the following conditions pertain: + +1. The report is for a client +2. When the report contains graphs or statistics. +3. When we use code to generate the graphs or statistics. If we are doing an quick analysis in Excel, because that is what a client needs, then a literate research report would not be useful approach. + +RMarkdown should be used even if it the report seems like it will be quick and lightweight. Experience tells us that it is not easy to predict when an analysis will grow in complexity or when a client may return months later to ask about a detail in a quick analysis. + +## Proof of concept and pilot + +RMarkdown has been the tool of choice for authoring reports in the Courts project. DataMade staff familiar with Pweave have picked it up quickly and journalists without a deep background in programming have also been able to use it successfully (within the RStudio environment). + +## Prerequisite Skills + +RMarkdown's interleaving of text and code adds another layer to interact with code. As such, we advise that staff not be introduced to RMarkdown until they are familiar with the programming language they will be using in the report. If the report will depend on SQL code, the developer should be familiar with how write and debug SQL code in the terminal or by writing SQL scripts. + +If something is not working within a RMarkdown file, it's very useful to be able to work on the code in familiar environment in order to narrow the possible considerations while debugging. + +Experience with the R programming language is not a prerequisite, unless that's the language that most of the analysis will be done in. + +## Maintenance outlook + +It is already DataMade's experience that literate research reports are more maintainable than alternative report authoring workflows. + +As far as RMarkdown in particular, the longterm outlook for this tool is excellent. + +1. RMarkdown is maintained by RStudio, the major commercial player in R. +2. The R community has settled on RMarkdown (and RStudio) as not just an report authoring tool, but as their notebooking tool. Any possible successor to RMarkdown will have significant pressure to be backwards compatible. +3. RMarkdown, as a file format, is very lightweight and convertible. + +## Editors + +[RStudio](https://rstudio.com/) is an excellent IDE for RMarkdown. We recommend that people new to RMarkdown start with using RStudio. \ No newline at end of file