From c3921dc5864bafaf347ba251e73d54e2fb770e18 Mon Sep 17 00:00:00 2001 From: mvanrongen Date: Thu, 14 Nov 2024 07:52:44 +0000 Subject: [PATCH] 1411 tweaks --- .../materials/01-intro-software/execute-results/html.json | 4 ++-- .../execute-results/html.json | 4 ++-- materials/01-intro-software.qmd | 2 +- materials/02-basic-objects-and-data-types.qmd | 7 ++++++- 4 files changed, 11 insertions(+), 6 deletions(-) diff --git a/_freeze/materials/01-intro-software/execute-results/html.json b/_freeze/materials/01-intro-software/execute-results/html.json index 9b54572..cf98b17 100644 --- a/_freeze/materials/01-intro-software/execute-results/html.json +++ b/_freeze/materials/01-intro-software/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "8daeb2997868ac2a52f1c2660284c2c1", + "hash": "e6c5e0341644568c71276b206c379092", "result": { "engine": "knitr", - "markdown": "---\ntitle: Getting started\n---\n\n\n\n\n::: {.callout-tip}\n#### Learning objectives\n\n- Get familiar with R\n- Get to know RStudio\n:::\n\n\n## Context\n\n### What is R? \n\nR is a statistical programming language. It is very popular in the data science field, including Bioinformatics. The term \"`R`\" is used to refer to both the programming language and the software that interprets the scripts written using it.\n\n\n### Why learn R?\n\n**R does not involve lots of pointing and clicking**\n\nThe learning curve might be steeper than with other software, but with R, the\nresults of your analysis do not rely on remembering a succession of pointing\nand clicking, but instead on a series of written commands, and that's a good\nthing! So, if you want to redo your analysis because you collected more data,\nyou don't have to remember which button you clicked in which order to obtain\nyour results; you just have to run your script again.\n\nWorking with scripts makes the steps you used in your analysis clear, and the\ncode you write can be inspected by someone else who can give you feedback and\nspot mistakes.\n\nWorking with scripts forces you to have a deeper understanding of what you are\ndoing, and facilitates your learning and comprehension of the methods you use.\n\n**R code is great for reproducibility**\n\nReproducibility is when someone else (including your future self) can obtain the\nsame results from the same data set when using the same analysis.\n\nR integrates with other tools to generate reports from your code. If you\ncollect more data, or fix a mistake in your dataset, the figures and the\nstatistical tests in your manuscript are updated automatically after running the code again.\n\nAn increasing number of journals and funding agencies expect analyses to be\nreproducible, so knowing R will give you an edge with these requirements.\n\n\n**R is interdisciplinary and extendable**\n\nWith 10,000+ packages that can be installed to extend its capabilities, R\nprovides a framework that allows you to combine statistical approaches from many\nscientific disciplines to best suit the analytical framework you need to analyze your\ndata. For instance, R has packages for image analysis, GIS, time series, population\ngenetics, and a lot more.\n\n**R works on data of different sizes**\n\nThe skills you learn with R scale easily with the size of your dataset. Whether\nyour dataset has hundreds or millions of lines, it won't make much difference to\nyou.\n\nR is designed for data analysis. It comes with special data structures and data\ntypes that make handling of missing data and statistical factors convenient.\n\nR can connect to spreadsheets, databases, and many other data formats, on your\ncomputer or on the web.\n\n**R produces high-quality graphics**\n\nThe plotting functionality in R is endless, and allow you to adjust any\naspect of your graph to convey most effectively the message from your data.\n\n**R has great support**\n\nThousands of people use R daily. Many of them are willing to help you through\nmailing lists and websites such as [Stack Overflow](https://stackoverflow.com/), or on the [Posit community](https://forum.posit.co/).\n\n**R is free, open-source and cross-platform**\n\nAnyone can inspect the source code to see how R works. Because of this\ntransparency, there is less chance for mistakes, and if you (or someone else)\nfind some, you can report and fix bugs.\n\n### What is RStudio?\n[RStudio](https://posit.co) is currently a very popular Integrated Development Environment (IDE) for working with R. An IDE is an application used by software developers that facilitates programming by offering source code editing, building and debugging tools all integrated into one application. To function correctly, RStudio needs R and therefore both need to be installed on your computer.\n\nThe RStudio Desktop open-source product is free under the\n[Affero General Public License (AGPL) v3](https://www.gnu.org/licenses/agpl-3.0.en.html). [Other versions of RStudio](https://posit.co/download/rstudio-desktop/) are also available.\n\nWe will use RStudio IDE to write code, navigate the files on our computer,\ninspect the variables we are going to create, and visualize the plots we will\ngenerate. RStudio can also be used for other things (*e.g.,* version control,\ndeveloping packages, writing Shiny apps) that we will not cover during the\ncourse\n\n![RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.](images/rstudio-screenshot.png)\n\nRStudio is divided into 4 \"Panes\": the **Source** for your scripts and documents\n(top-left, in the default layout), your **Environment/History** (top-right),\nyour **Files/Plots/Packages/Help/Viewer** (bottom-right), and \nthe R **Console** (bottom-left). The placement of these\npanes and their content can be customized (see menu, Tools -> Global Options ->\nPane Layout). \n\nOne of the advantages of using RStudio is that all the information\nyou need to write code is available in a single window. Additionally, with many\nshortcuts, auto-completion, and highlighting for the major file types you use\nwhile developing in R, RStudio will make typing easier and less error-prone.\n\n::: {.callout-note}\nRStudio's default preferences generally work well, but saving a work space to\n`.RData` can be cumbersome, especially if you are working with larger data sets as this would save all the data that is loaded into R into the `.RData` file. \nTo turn that off, go to `Tools` --> `Global Options` and select the 'Never' option\nfor `Save workspace to .RData' on exit.`\n\n![Set 'Save workspace to .RData on exit' to 'Never'](images/rstudio-preferences.png)\n:::\n\n## Working directory\n\nMaking it easy for yourself *and* R to find all your data, it's helpful to use the concept of a **working directory**. This is a folder that R uses as a starting point where it expects to find all of your data and scripts.\n\nAll of the scripts within this folder can then use **relative paths** to files in the working directory that indicate where inside the project a file is located (as opposed to **absolute paths**, which\npoint to where a file is on a specific computer). Working this way makes it\na lot easier to move your project around on your computer and share it with\nothers without worrying about whether or not the underlying scripts will still work.\n\n::: {.callout-important}\n## Relative versus absolute paths\n\nRelative paths are relative to a certain location on your computer. Absolute paths start from the absolute start of your hard drive. This is easiest illustrated with an example:\n\n**Relative path**: `data/data_01.csv`\n\n**Absolute path**: `C:/Users/User1/Documents/R/data-analysis-r/data/data_01.csv`\n\n:::\n\n### Keeping it all together\n\nWhenever we are working on a project, it is good practice to keep a set of related data, analyses, and scripts contained in a single folder.\n\nUsing a consistent folder structure across your projects will help keep things\norganized, and will also make it easy to find things in the future. This\ncan be especially helpful when you have multiple projects. In general, you may\ncreate directories (folders) for **scripts**, **data**, and **documents**.\n\n - **`data/`** Use this folder to store your raw data. For the sake\n of transparency and [provenance](https://en.wikipedia.org/wiki/Provenance),\n you should *always* keep a copy of your raw data accessible and do as much\n of your data cleanup and pre-processing programmatically (*i.e.,* with scripts,\n rather than manually). Separating raw data from processed data\n is also a good idea. For example, you could have files\n `data/raw/survey.plot1.txt` and `data/raw/survey.plot2.txt` kept separate from\n a `data_output/survey.csv` file generated by the\n `scripts/01.preprocess.survey.R` script.\n - **`documents/`** This would be a place to keep documentation and other text documents\n - **`scripts/`** This would be the location to keep your R scripts for\n different analyses or plotting.\n\nYou may want additional directories or sub directories depending on your project\nneeds, but these should form the backbone of your working directory.\n\n![Example of a working directory structure.](images/working-directory-structure.png)\n\n### Creating a working directory\n\nBefore starting to write code in RStudio, we need to create an R Project. The idea behind an R-project is to have a space where you can keep all the files and settings associated with the project together. That way, next time you open the R Project it would be easier to resume work. An R-project basically creates a folder with a shortcut in it (ending in `.RProj`). When you double-click on the shortcut, it opens RStudio and sets the working directory to that particular folder. \n\nTo create an \"R Project\":\n\n1. Start RStudio.\n2. Under the `File` menu, click on `New Project`. Choose `New Directory`, then\n `New Project`.\n3. Enter a name for this new folder (or \"directory\"), and choose a convenient\n location for it. This will be your **working directory** for the rest of the\n day (*e.g.,* `~/data-analysis-r`).\n4. Click on `Create Project`.\n5. (Optional) Open in new session\n\nR will show you your current working directory in the `Files` pane. Alternatively, you can get it by typing in and running the `getwd()` command.\n\n::: {.callout-important}\nComplete @ex-createwd before proceeding.\n:::\n\n## Working with R\n\nThe basis of programming is that we write down instructions for the computer to\nfollow, and then we tell the computer to follow those instructions. We write, or\n*code*, instructions in R because it is a common language that both the computer\nand we can understand. We call the instructions *commands* and we tell the\ncomputer to follow the instructions by *executing* (also called *running*) those\ncommands.\n\n### Scripts versus console\n\nThere are two main ways of interacting with R: by using the console or by using\nscript files (plain text files that contain your code). The console pane (in\nRStudio, the bottom left panel) is the place where commands written in the R\nlanguage can be typed and executed immediately by the computer. It is also where\nthe results will be shown for commands that have been executed. You can type\ncommands directly into the console and press `Enter` to execute those commands,\nbut they will be forgotten when you close the session.\n\nBecause we want our code and workflow to be reproducible, it is better to type\nthe commands we want in the script editor, and save the script. This way, there\nis a complete record of what we did, and anyone (including our future selves!)\ncan easily replicate the results on their computer.\n\nRStudio allows you to execute commands directly from the script editor by using\nthe {{< kbd Control >}} + {{< kbd Enter >}} shortcut (on Macs, {{< kbd mac=Command >}} +\n{{< kbd mac=Return >}} will work, too). The command on the current line in the\nscript (indicated by the cursor) or all of the commands in the currently\nselected text will be sent to the console and executed when you press\n{{< kbd Control >}} + {{< kbd Enter >}}. You can find other keyboard shortcuts in this [RStudio cheatsheet about the RStudio IDE (PDF)](https://rstudio.github.io/cheatsheets/rstudio-ide.pdf).\n\n::: {.callout-warning}\n## The R prompt\n\nIf R is ready to accept commands, the R console shows a `>` prompt. If it\nreceives a command (by typing, copy-pasting or sent from the script editor using\n{{< kbd Control >}} + {{< kbd Enter >}}), R will try to execute it, and when\nready, will show the results and come back with a new `>` prompt to wait for new\ncommands.\n\nIf R is still waiting for you to enter more data because it isn't complete yet,\nthe console will show a `+` prompt. It means that you haven't finished entering\na complete command. This is because you have not 'closed' a parenthesis or\nquotation, i.e. you don't have the same number of left-parentheses as\nright-parentheses, or the same number of opening and closing quotation marks.\nWhen this happens, and you thought you finished typing your command, click\ninside the console window and press {{< kbd Escape >}}. This will cancel the incomplete\ncommand and return you to the `>` prompt.\n:::\n\n### Comments in code\n\nIt's always a good idea to add explanations to your code. We can do that with the hash tag `#` symbol, for example:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# This code calculates the sum of two numbers\n1 + 9\n```\n:::\n\n\n\n\n\nIt's always a good idea to add lots of comments to your code. What makes sense to you in that moment, might not a week later. Similarly, when sharing code with colleagues and collaborators, it's always good to be as clear as possible.\n\n### Splitting code\n\nAs you increase your code, your script can become quite long. Often we want to split analyses into multiple scripts, for example:\n\n* `01_preprocessing` may contain data cleaning steps\n* `02_exploration` may contain exploratory plots of your data\n* `03_analysis` could contain (statistical) analyses of your data\n* `04_figures` could contain code for figures, ready for publication\n\nEach of these files could be hundreds of lines long. So, keeping track of your code makes sense. We can do that with **code headings**, which use the `# heading ----` syntax. You can even add different heading levels, by increasing the number of `#` at the start.\n\nThis creates a little table of contents in the bottom-left corner of the script pane:\n\n![Code headings](images/rstudio-codeheadings.png)\n\n## Running code {#running-code}\n\nThe simplest way of using a programming language is to use it interactively. We can do this by typing directly into the console / terminal.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nFor example, you can use R as a glorified calculator:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n3 + 5\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 8\n```\n\n\n:::\n\n```{.r .cell-code}\n12 / 7\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1.714286\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\nFor example, you can use Python as a glorified calculator:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\n3 + 5\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n8\n```\n\n\n:::\n\n```{.python .cell-code}\n12 / 7\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n1.7142857142857142\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\nRunning code like this directly in the console is generally not a good idea, because then we can't keep track of what we are doing. So, we first need to create a script to save our code in. Then, we can then play around.\n\n::: {.callout-important}\n## Complete before proceeding\nPlease complete @ex-createscript and @ex-runningcode.\n:::\n\n## Functions and their arguments\n\nFunctions are \"canned scripts\" that automate more complicated sets of commands\nincluding operations assignments, etc. Many functions are predefined, or can be\nmade available by importing *packages* (more on that later). A function\nusually takes one or more inputs called *arguments*. Functions often (but not\nalways) return a *value*. A typical example would be the function `sqrt()`. The\ninput (the argument) must be a number, and the return value (in fact, the\noutput) is the square root of that number.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsqrt(9)\n```\n:::\n\n\n\n\n## Python\n\nThe `sqrt()` function is not available by default, but is stored in the `math` module. Before we can use it, we need to load this module:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport math\n```\n:::\n\n\n\n\nNext, we can use the `sqrt()` function, specifying that it comes from the `math`module. We separate the two with a full-stop (`.`):\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmath.sqrt(9)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n3.0\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\nHere, the value `9` is given to the `sqrt()` function. This function\ncalculates the square root, and returns the value. This function is very simple, because it takes just one argument.\n\nThe return 'value' of a function need not be numerical (like that of `sqrt()`),\nand it also does not need to be a single item: it can be a set of things, or\neven a data set. We'll see that when we read data files.\n\n\n### Arguments\n\nArguments allow you to control the behaviour of a function. They can be anything, not only numbers or file names. Exactly what each argument means differs per function and can be looked up in the documentation. Some functions take arguments which may either be specified by the user, or, if left out, take on a *default* value: these are called *options*.\n\nOptions are typically used to alter the way the\nfunction operates, such as if it should ignore missing values, or what symbol to\nuse in a plot. However, if you want something specific, you can specify a value\nof your choice which will be used instead of the default.\n\nLet's try a function that can take multiple arguments: `round()`.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nround(3.14159)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nround(3.14159)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n3\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\nHere, we've called `round()` with just one argument, `3.14159`, and it has\nreturned the value `3`. That's because the default is to round to the nearest\nwhole number. If we want more digits we can see how to do that by getting\ninformation about the `round()` function. \n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe can use `args(round)` to find what arguments it takes, or look at the help for this function using `?round`.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(round)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nfunction (x, digits = 0, ...) \nNULL\n```\n\n\n:::\n:::\n\n\n\n\nWe see that if we want a different number of digits, we can\ntype `digits = 2` or however many we want. For example:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nround(x = 3.14159, digits = 2)\n```\n:::\n\n\n\n\nIf you provide the arguments in the exact same order as they are defined you\ndon't have to name them:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nround(3.14159, 2)\n```\n:::\n\n\n\n\nAnd if you do name the arguments, you can switch their order:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nround(digits = 2, x = 3.14159)\n```\n:::\n\n\n\n\n## Python\nWe can use `help(round)` to find what arguments it takes.\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nhelp(round)\n```\n:::\n\n\n\n\nWe see that if we want a different number of digits, we can\ntype `ndigits = 2` or however many we want. For example:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nround(3.14159, ndigits = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n3.14\n```\n\n\n:::\n:::\n\n\n\n\nIf you provide the arguments in the exact same order as they are defined you\ndon't have to name them:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nround(3.14159, 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n3.14\n```\n\n\n:::\n:::\n\n\n\n\nPython still expects the arguments in the correct order, so this gives an error:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nround(ndigits = 2, 3.14159)\n```\n:::\n\n\n\n:::\n\nIt's good practice be explicit about the names of the arguments. That way you can avoid confusion later on when looking back at your code or when sharing your code.\n\n\n## Adding functionality using packages\nLO: adding functionality (installing + loading packages)\nLO: For Python: requires `numpy` for next section\n\n\n## Exercises\n\n:::{.callout-exercise #ex-createwd}\n#### Creating a working directory\n\n\n\n{{< level 1 >}}\n\n\n\n\n\n\nCreate a working directory called `data-analysis`. When you've done this, add the following sub folders:\n\n* `data`\n* `scripts`\n* `images`\n\n**Note**: programming languages are case-sensitive, so `data` is not treated the same way as `data`.\n:::\n\n:::{.callout-exercise #ex-createscript}\n#### Creating a script\n\n\n\n{{< level 1 >}}\n\n\n\n\n\n\nCreate a script and save it as `session_01` in the `scripts` folder within your working directory.\n\n:::{.callout-hint}\nRemember, you will need to add an extension to the file. This is `.R` for R scripts or `.py` for Python ones.\n:::\n:::\n\n:::{.callout-exercise #ex-runningcode}\n#### Running code\n\n\n\n{{< level 1 >}}\n\n\n\n\n\n\nIn your new script `session_01`, run some mathematical operations, such as:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n8 * 4\n6 - 9\n```\n:::\n\n\n\n\n:::{.callout-hint}\nRemember, you run the code using Ctrl + Enter (or Command + Enter on Mac).\n:::\n\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- \n:::\n", + "markdown": "---\ntitle: Getting started\n---\n\n\n\n\n::: {.callout-tip}\n#### Learning objectives\n\n- Get familiar with R\n- Get to know RStudio\n:::\n\n\n## Context\n\n### What is R? \n\nR is a statistical programming language. It is very popular in the data science field, including Bioinformatics. The term \"`R`\" is used to refer to both the programming language and the software that interprets the scripts written using it.\n\n\n### Why learn R?\n\n**R does not involve lots of pointing and clicking**\n\nThe learning curve might be steeper than with other software, but with R, the\nresults of your analysis do not rely on remembering a succession of pointing\nand clicking, but instead on a series of written commands, and that's a good\nthing! So, if you want to redo your analysis because you collected more data,\nyou don't have to remember which button you clicked in which order to obtain\nyour results; you just have to run your script again.\n\nWorking with scripts makes the steps you used in your analysis clear, and the\ncode you write can be inspected by someone else who can give you feedback and\nspot mistakes.\n\nWorking with scripts forces you to have a deeper understanding of what you are\ndoing, and facilitates your learning and comprehension of the methods you use.\n\n**R code is great for reproducibility**\n\nReproducibility is when someone else (including your future self) can obtain the\nsame results from the same data set when using the same analysis.\n\nR integrates with other tools to generate reports from your code. If you\ncollect more data, or fix a mistake in your dataset, the figures and the\nstatistical tests in your manuscript are updated automatically after running the code again.\n\nAn increasing number of journals and funding agencies expect analyses to be\nreproducible, so knowing R will give you an edge with these requirements.\n\n\n**R is interdisciplinary and extendable**\n\nWith 10,000+ packages that can be installed to extend its capabilities, R\nprovides a framework that allows you to combine statistical approaches from many\nscientific disciplines to best suit the analytical framework you need to analyze your\ndata. For instance, R has packages for image analysis, GIS, time series, population\ngenetics, and a lot more.\n\n**R works on data of different sizes**\n\nThe skills you learn with R scale easily with the size of your dataset. Whether\nyour dataset has hundreds or millions of lines, it won't make much difference to\nyou.\n\nR is designed for data analysis. It comes with special data structures and data\ntypes that make handling of missing data and statistical factors convenient.\n\nR can connect to spreadsheets, databases, and many other data formats, on your\ncomputer or on the web.\n\n**R produces high-quality graphics**\n\nThe plotting functionality in R is endless, and allow you to adjust any\naspect of your graph to convey most effectively the message from your data.\n\n**R has great support**\n\nThousands of people use R daily. Many of them are willing to help you through\nmailing lists and websites such as [Stack Overflow](https://stackoverflow.com/), or on the [Posit community](https://forum.posit.co/).\n\n**R is free, open-source and cross-platform**\n\nAnyone can inspect the source code to see how R works. Because of this\ntransparency, there is less chance for mistakes, and if you (or someone else)\nfind some, you can report and fix bugs.\n\n### What is RStudio?\n[RStudio](https://posit.co) is currently a very popular Integrated Development Environment (IDE) for working with R. An IDE is an application used by software developers that facilitates programming by offering source code editing, building and debugging tools all integrated into one application. To function correctly, RStudio needs R and therefore both need to be installed on your computer.\n\nThe RStudio Desktop open-source product is free under the\n[Affero General Public License (AGPL) v3](https://www.gnu.org/licenses/agpl-3.0.en.html). [Other versions of RStudio](https://posit.co/download/rstudio-desktop/) are also available.\n\nWe will use RStudio IDE to write code, navigate the files on our computer,\ninspect the variables we are going to create, and visualize the plots we will\ngenerate. RStudio can also be used for other things (*e.g.,* version control,\ndeveloping packages, writing Shiny apps) that we will not cover during the\ncourse\n\n![RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.](images/rstudio-screenshot.png)\n\nRStudio is divided into 4 \"Panes\": the **Source** for your scripts and documents\n(top-left, in the default layout), your **Environment/History** (top-right),\nyour **Files/Plots/Packages/Help/Viewer** (bottom-right), and \nthe R **Console** (bottom-left). The placement of these\npanes and their content can be customized (see menu, Tools -> Global Options ->\nPane Layout). \n\nOne of the advantages of using RStudio is that all the information\nyou need to write code is available in a single window. Additionally, with many\nshortcuts, auto-completion, and highlighting for the major file types you use\nwhile developing in R, RStudio will make typing easier and less error-prone.\n\n::: {.callout-note}\nRStudio's default preferences generally work well, but saving a work space to\n`.RData` can be cumbersome, especially if you are working with larger data sets as this would save all the data that is loaded into R into the `.RData` file. \nTo turn that off, go to `Tools` --> `Global Options` and select the 'Never' option\nfor `Save workspace to .RData' on exit.`\n\n![Set 'Save workspace to .RData on exit' to 'Never'](images/rstudio-preferences.png)\n:::\n\n## Working directory\n\nMaking it easy for yourself *and* R to find all your data, it's helpful to use the concept of a **working directory**. This is a folder that R uses as a starting point where it expects to find all of your data and scripts.\n\nAll of the scripts within this folder can then use **relative paths** to files in the working directory that indicate where inside the project a file is located (as opposed to **absolute paths**, which\npoint to where a file is on a specific computer). Working this way makes it\na lot easier to move your project around on your computer and share it with\nothers without worrying about whether or not the underlying scripts will still work.\n\n::: {.callout-important}\n## Relative versus absolute paths\n\nRelative paths are relative to a certain location on your computer. Absolute paths start from the absolute start of your hard drive. This is easiest illustrated with an example:\n\n**Relative path**: `data/data_01.csv`\n\n**Absolute path**: `C:/Users/User1/Documents/R/data-analysis-r/data/data_01.csv`\n\n:::\n\n### Keeping it all together\n\nWhenever we are working on a project, it is good practice to keep a set of related data, analyses, and scripts contained in a single folder.\n\nUsing a consistent folder structure across your projects will help keep things\norganized, and will also make it easy to find things in the future. This\ncan be especially helpful when you have multiple projects. In general, you may\ncreate directories (folders) for **scripts**, **data**, and **documents**.\n\n - **`data/`** Use this folder to store your raw data. For the sake\n of transparency and [provenance](https://en.wikipedia.org/wiki/Provenance),\n you should *always* keep a copy of your raw data accessible and do as much\n of your data cleanup and pre-processing programmatically (*i.e.,* with scripts,\n rather than manually). Separating raw data from processed data\n is also a good idea. For example, you could have files\n `data/raw/survey.plot1.txt` and `data/raw/survey.plot2.txt` kept separate from\n a `data_output/survey.csv` file generated by the\n `scripts/01.preprocess.survey.R` script.\n - **`documents/`** This would be a place to keep documentation and other text documents\n - **`scripts/`** This would be the location to keep your R scripts for\n different analyses or plotting.\n\nYou may want additional directories or sub directories depending on your project\nneeds, but these should form the backbone of your working directory.\n\n![Example of a working directory structure.](images/working-directory-structure.png)\n\n### Creating a working directory\n\nBefore starting to write code in RStudio, we need to create an R Project. The idea behind an R-project is to have a space where you can keep all the files and settings associated with the project together. That way, next time you open the R Project it would be easier to resume work. An R-project basically creates a folder with a shortcut in it (ending in `.RProj`). When you double-click on the shortcut, it opens RStudio and sets the working directory to that particular folder. \n\nTo create an \"R Project\":\n\n1. Start RStudio.\n2. Under the `File` menu, click on `New Project`. Choose `New Directory`, then\n `New Project`.\n3. Enter a name for this new folder (or \"directory\"), and choose a convenient\n location for it. This will be your **working directory** for the rest of the\n day (*e.g.,* `~/data-analysis-r`).\n4. Click on `Create Project`.\n5. (Optional) Open in new session\n\nR will show you your current working directory in the `Files` pane. Alternatively, you can get it by typing in and running the `getwd()` command.\n\n::: {.callout-important}\nComplete @ex-createwd before proceeding.\n:::\n\n## Working with R\n\nThe basis of programming is that we write down instructions for the computer to\nfollow, and then we tell the computer to follow those instructions. We write, or\n*code*, instructions in R because it is a common language that both the computer\nand we can understand. We call the instructions *commands* and we tell the\ncomputer to follow the instructions by *executing* (also called *running*) those\ncommands.\n\n### Scripts versus console\n\nThere are two main ways of interacting with R: by using the console or by using\nscript files (plain text files that contain your code). The console pane (in\nRStudio, the bottom left panel) is the place where commands written in the R\nlanguage can be typed and executed immediately by the computer. It is also where\nthe results will be shown for commands that have been executed. You can type\ncommands directly into the console and press `Enter` to execute those commands,\nbut they will be forgotten when you close the session.\n\nBecause we want our code and workflow to be reproducible, it is better to type\nthe commands we want in the script editor, and save the script. This way, there\nis a complete record of what we did, and anyone (including our future selves!)\ncan easily replicate the results on their computer.\n\nRStudio allows you to execute commands directly from the script editor by using\nthe {{< kbd Control >}} + {{< kbd Enter >}} shortcut (on Macs, {{< kbd mac=Command >}} +\n{{< kbd mac=Return >}} will work, too). The command on the current line in the\nscript (indicated by the cursor) or all of the commands in the currently\nselected text will be sent to the console and executed when you press\n{{< kbd Control >}} + {{< kbd Enter >}}. You can find other keyboard shortcuts in this [RStudio cheatsheet about the RStudio IDE (PDF)](https://rstudio.github.io/cheatsheets/rstudio-ide.pdf).\n\n::: {.callout-warning}\n## The R prompt\n\nIf R is ready to accept commands, the R console shows a `>` prompt. If it\nreceives a command (by typing, copy-pasting or sent from the script editor using\n{{< kbd Control >}} + {{< kbd Enter >}}), R will try to execute it, and when\nready, will show the results and come back with a new `>` prompt to wait for new\ncommands.\n\nIf R is still waiting for you to enter more data because it isn't complete yet,\nthe console will show a `+` prompt. It means that you haven't finished entering\na complete command. This is because you have not 'closed' a parenthesis or\nquotation, i.e. you don't have the same number of left-parentheses as\nright-parentheses, or the same number of opening and closing quotation marks.\nWhen this happens, and you thought you finished typing your command, click\ninside the console window and press {{< kbd Escape >}}. This will cancel the incomplete\ncommand and return you to the `>` prompt.\n:::\n\n### Comments in code\n\nIt's always a good idea to add explanations to your code. We can do that with the hash tag `#` symbol, for example:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# This code calculates the sum of two numbers\n1 + 9\n```\n:::\n\n\n\n\n\nIt's always a good idea to add lots of comments to your code. What makes sense to you in that moment, might not a week later. Similarly, when sharing code with colleagues and collaborators, it's always good to be as clear as possible.\n\n### Splitting code\n\nAs you increase your code, your script can become quite long. Often we want to split analyses into multiple scripts, for example:\n\n* `01_preprocessing` may contain data cleaning steps\n* `02_exploration` may contain exploratory plots of your data\n* `03_analysis` could contain (statistical) analyses of your data\n* `04_figures` could contain code for figures, ready for publication\n\nEach of these files could be hundreds of lines long. So, keeping track of your code makes sense. We can do that with **code headings**, which use the `# heading ----` syntax. You can even add different heading levels, by increasing the number of `#` at the start.\n\nThis creates a little table of contents in the bottom-left corner of the script pane:\n\n![Code headings](images/rstudio-codeheadings.png)\n\n## Running code {#running-code}\n\nThe simplest way of using a programming language is to use it interactively. We can do this by typing directly into the console / terminal.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nFor example, you can use R as a glorified calculator:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n3 + 5\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 8\n```\n\n\n:::\n\n```{.r .cell-code}\n12 / 7\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1.714286\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\nFor example, you can use Python as a glorified calculator:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\n3 + 5\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n8\n```\n\n\n:::\n\n```{.python .cell-code}\n12 / 7\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n1.7142857142857142\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\nRunning code like this directly in the console is generally not a good idea, because then we can't keep track of what we are doing. So, we first need to create a script to save our code in. Then, we can then play around.\n\n::: {.callout-important}\n## Complete before proceeding\nPlease complete @ex-createscript and @ex-runningcode.\n:::\n\n## Functions and their arguments\n\nFunctions are \"canned scripts\" that automate more complicated sets of commands\nincluding operations assignments, etc. Many functions are predefined, or can be\nmade available by importing *packages* (more on that later). A function\nusually takes one or more inputs called *arguments*. Functions often (but not\nalways) return a *value*. A typical example would be the function `sqrt()`. The\ninput (the argument) must be a number, and the return value (in fact, the\noutput) is the square root of that number.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsqrt(9)\n```\n:::\n\n\n\n\n## Python\n\nThe `sqrt()` function is not available by default, but is stored in the `math` module. Before we can use it, we need to load this module:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport math\n```\n:::\n\n\n\n\nNext, we can use the `sqrt()` function, specifying that it comes from the `math`module. We separate the two with a full-stop (`.`):\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmath.sqrt(9)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n3.0\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\nHere, the value `9` is given to the `sqrt()` function. This function\ncalculates the square root, and returns the value. This function is very simple, because it takes just one argument.\n\nThe return 'value' of a function need not be numerical (like that of `sqrt()`),\nand it also does not need to be a single item: it can be a set of things, or\neven a data set. We'll see that when we read data files.\n\n\n### Arguments\n\nArguments allow you to control the behaviour of a function. They can be anything, not only numbers or file names. Exactly what each argument means differs per function and can be looked up in the documentation. Some functions take arguments which may either be specified by the user, or, if left out, take on a *default* value: these are called *options*.\n\nOptions are typically used to alter the way the\nfunction operates, such as if it should ignore missing values, or what symbol to\nuse in a plot. However, if you want something specific, you can specify a value\nof your choice which will be used instead of the default.\n\nLet's try a function that can take multiple arguments: `round()`.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nround(3.14159)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nround(3.14159)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n3\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\nHere, we've called `round()` with just one argument, `3.14159`, and it has\nreturned the value `3`. That's because the default is to round to the nearest\nwhole number. If we want more digits we can see how to do that by getting\ninformation about the `round()` function. \n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe can use `args(round)` to find what arguments it takes, or look at the help for this function using `?round`.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(round)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nfunction (x, digits = 0, ...) \nNULL\n```\n\n\n:::\n:::\n\n\n\n\nWe see that if we want a different number of digits, we can\ntype `digits = 2` or however many we want. For example:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nround(x = 3.14159, digits = 2)\n```\n:::\n\n\n\n\nIf you provide the arguments in the exact same order as they are defined you\ndon't have to name them:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nround(3.14159, 2)\n```\n:::\n\n\n\n\nAnd if you do name the arguments, you can switch their order:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nround(digits = 2, x = 3.14159)\n```\n:::\n\n\n\n\n## Python\nWe can use `help(round)` to find what arguments it takes.\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nhelp(round)\n```\n:::\n\n\n\n\nWe see that if we want a different number of digits, we can\ntype `ndigits = 2` or however many we want. For example:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nround(3.14159, ndigits = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n3.14\n```\n\n\n:::\n:::\n\n\n\n\nIf you provide the arguments in the exact same order as they are defined you\ndon't have to name them:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nround(3.14159, 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n3.14\n```\n\n\n:::\n:::\n\n\n\n\nPython still expects the arguments in the correct order, so this gives an error:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nround(ndigits = 2, 3.14159)\n```\n:::\n\n\n\n:::\n\nIt's good practice be explicit about the names of the arguments. That way you can avoid confusion later on when looking back at your code or when sharing your code.\n\n\n## Adding functionality using packages\nLO: adding functionality (installing + loading packages)\nLO: For Python: requires `numpy` for next section\n\n\n## Exercises\n\n:::{.callout-exercise #ex-createwd}\n#### Creating a working directory\n\n\n\n{{< level 1 >}}\n\n\n\n\n\n\nCreate a working directory called `data-analysis`. When you've done this, add the following sub folders:\n\n* `data`\n* `scripts`\n* `images`\n\n**Note**: programming languages are case-sensitive, so `data` is not treated the same way as `Data`.\n:::\n\n:::{.callout-exercise #ex-createscript}\n#### Creating a script\n\n\n\n{{< level 1 >}}\n\n\n\n\n\n\nCreate a script and save it as `session_01` in the `scripts` folder within your working directory.\n\n:::{.callout-hint}\nRemember, you will need to add an extension to the file. This is `.R` for R scripts or `.py` for Python ones.\n:::\n:::\n\n:::{.callout-exercise #ex-runningcode}\n#### Running code\n\n\n\n{{< level 1 >}}\n\n\n\n\n\n\nIn your new script `session_01`, run some mathematical operations, such as:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n8 * 4\n6 - 9\n```\n:::\n\n\n\n\n:::{.callout-hint}\nRemember, you run the code using Ctrl + Enter (or Command + Enter on Mac).\n:::\n\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- \n:::\n", "supporting": [ "01-intro-software_files" ], diff --git a/_freeze/materials/02-basic-objects-and-data-types/execute-results/html.json b/_freeze/materials/02-basic-objects-and-data-types/execute-results/html.json index db51fc6..7254ae1 100644 --- a/_freeze/materials/02-basic-objects-and-data-types/execute-results/html.json +++ b/_freeze/materials/02-basic-objects-and-data-types/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "aafad77a4f42833e6521f2cde7b37694", + "hash": "5d0ab2af681a0e7b1e4d5a9e499c771f", "result": { "engine": "knitr", - "markdown": "---\ntitle: Data types & structures\n---\n\n\n\n\n::: {.callout-tip}\n#### Learning objectives\n\n- \n:::\n\n\n## Context\n\nWe’ve seen examples where we entered data directly into a function. Most of the time we have data from elsewhere, such as a spreadsheet. In the previous section we created single objects. We’ll build up from this and introduce vectors and tabular data. We'll also briefly mention other data types, such as matrices, arrays.\n\n## Explained: Data types & structures\n\n### Data types\n\nProgramming languages are able to deal with different data types - and they need to. For example, it makes little sense to perform mathematical operations on text! To ensure that your data is viewed in the appropriate way, you need to be aware of some of the different **data types**.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nR has the following main data types:\n\n| Data type | Description|\n|-----------|--------------------------------------------------------------|\n| numeric | Represents numbers; can be whole (integers) or decimals \\\n(e.g., `19`or `2.73`).|\n| integer | Specific type of numeric data; can only be an integer \\\n(e.g., `7L` where `L` indicates an integer). |\n| character | Also called *text* or *string* \\\n(e.g., `\"Rabbits are great!\"`).|\n| logical | Also called *boolean values*; takes either `TRUE` or `FALSE`.|\n| factor | A type of categorical data that can have inherent ordering \\\n(e.g., `low`, `medium`, `high`).|\n\n\n## Python\n\nPython has the following main data types:\n\n| Data type | Description|\n|-----------|--------------------------------------------------------------|\n| int | Specific type of numeric data; can only be an integer \\\n(e.g., `7` or `56`).|\n| float | Decimal numbers \\\n(e.g., `3.92` or `9.824`).|\n| str | *Text* or *string* data \\\n(e.g., `\"Rabbits are great!\"`).|\n| bool | *Logical* or *boolean* values; takes either `True` or `False`.|\n\n:::\n\n### Data structures\n\nIn the section on [running code](#running-code) we saw how we can run code interactively. However, we frequently need to save values so we can work with them. We've just seen that we can have different *types* of data. We can save these into different *data structures*. Which data structure you need is often determined by the type of data and the complexity.\n\nIn the following sections we look at simple data structures.\n\n## Objects\n\nWe can store values into *objects*. To do this, we *assign* values to them. An object acts as a container for that value.\n\nTo create an object, we need to give it a name followed by the\nassignment operator and the value we want to give it, for example:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature <- 23\n```\n:::\n\n\n\n\nWe can read the code as: the value `23` is assigned (`<-`) to the object `temperature`. Note that when you run this line of code the object you just created appears on your environment tab (top-right panel).\n\nWhen assigning a value to an object, R does not print anything on the console. You can print the value by typing the object name on the console or within your script and running that line of code.\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature = 23\n```\n:::\n\n\n\n\nWe can read the code as: the value `23` is assigned (`=`) to the object `temperature`.\n\nWhen assigning a value to an object, Python does not print anything on the console. You can print the value by typing the object name on the console or within your script and running that line of code.\n\n:::\n\n::: {.callout-important}\n## The assignment operator\n\nWe use an assignment operator to assign values on the right to objects on the left.\n\n::: {.panel-tabset group=\"language\"}\n## R\nIn R we use `<-` as the assignment operator.\n\nIn RStudio, typing Alt + - (push Alt at the same time as the - key) will write ` <- ` in a single keystroke on a PC, while typing Option + - (push Option at the same time as the - key) does the same on a Mac.

\n\n## Python\nIn Python we use `=` as the assignment operator.

\n\n:::\n\\\n:::\n\nObjects can be given almost any name such as `x`, `current_temperature`, or\n`subject_id`. You want the object names to be explicit and short. There are some exceptions / considerations (see below).\n\n::: {.callout-warning}\n## Restrictions on object names\n\nObject names can contain letters, numbers, underscores and periods. They *cannot start with a number nor contain spaces*. Different people use different conventions for long variable names, two common ones being:\n\nUnderscore: my_long_named_object\n\nCamel case: myLongNamedObject\n\nWhat you use is up to you, but be consistent. Programming languages are **case-sensitive** so `temperature` is different from `Temperature.`\n\n* Some names are reserved words or keywords, because they are the names of fundamental functions (e.g., `if`, `else`, `for`, see [R](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) or [Python](https://docs.python.org/3/reference/lexical_analysis.html#keywords) for a complete list).\n* Avoid using function names (e.g., `c`, `T`, `mean`, `data`, `df`, `weights`), even if allowed. If in doubt, check the help to see if the name is already in use.\n* Avoid full-stops (`.`) within an object name as in `my.data`. Full-stops often have meaning in programming languages, so it's best to avoid them.\n* Use consistent styling. In R, popular style guides are:\n * [R's tidyverse's](http://style.tidyverse.org/).\n * [Google's](https://google.github.io/styleguide/Rguide.xml)\n\n**Whatever style you use, be consistent!**\n:::\n\n### Using objects\n\nNow that we have the `temperature` in memory, we can use it to perform operations. For example, this might the temperature in Celsius and we might want to calculate it to Kelvin.\n\nTo do this, we need to add `273.15`:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 296.15\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n296.15\n```\n\n\n:::\n:::\n\n\n\n:::\n\nWe can change an object's value by assigning a new one:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature <- 36\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 309.15\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature = 36\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n309.15\n```\n\n\n:::\n:::\n\n\n\n:::\n\nFinally, assigning a value to one object does not change the values of other objects. For example, let’s store the outcome in Kelvin into a new object `temp_K`:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemp_K <- temperature + 273.15\n```\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_K = temperature + 273.15\n```\n:::\n\n\n\n:::\n\nChanging the value of `temperature` does not change the value of `temp_K`.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature <- 14\ntemp_K\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 309.15\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature = 14\ntemp_K\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n309.15\n```\n\n\n:::\n:::\n\n\n\n:::\n\n## Collections of data\n\nIn the examples above we have stored single values into an object. Of course we often have to deal with more than tat. Generally speaking, we can create **collections** of data. This enables us to organise our data, for example by creating a collection of numbers or text values.\n\n### Creating collections\n\nCreating a collection of data is pretty straightforward, particularly if you are manually doing it.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nThe simplest collection of data in R is called a **vector**. This really is the workhorse of R.\n\nA vector is composed by a series of values, which can numbers, text or any of the data types described.\n\nWe can assign a series of values to a vector using the `c()` function. For example, we can create a vector of temperatures and assign it to a new object `temp_c`:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemp_c <- c(23, 24, 31, 27, 18, 21)\n\ntemp_c\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 23 24 31 27 18 21\n```\n\n\n:::\n:::\n\n\n\n\nA vector can also contain text. For example, let's create a vector that contains weather descriptions:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nweather <- c(\"sunny\", \"cloudy\", \"partial_cloud\", \"cloudy\", \"sunny\", \"rainy\")\n\nweather\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"sunny\" \"cloudy\" \"partial_cloud\" \"cloudy\" \n[5] \"sunny\" \"rainy\" \n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\nThe simplest collection of data in Python is either a **list** or a **tuple**. Both can hold items of the same of different types. Whereas a tuple *cannot* be changed after it's created, a *list* can.\n\nWe can assign a collection of numbers to a list:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c = [23, 24, 31, 27, 18, 21]\n\ntemp_c\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[23, 24, 31, 27, 18, 21]\n```\n\n\n:::\n:::\n\n\n\n\n\nA list can also contain text. For example, let's create a list that contains weather descriptions:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nweather = [\"sunny\", \"cloudy\", \"partial_cloud\", \"cloudy\", \"sunny\", \"rainy\"]\n\nweather\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']\n```\n\n\n:::\n:::\n\n\n\n\nWe can also create a *tuple*. Remember, this is like a list, but it cannot be altered after creating it. Note the difference in the type of brackets, where we use `( )` round brackets instead of `[ ]` square brackets:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c_tuple = (23, 24, 31, 27, 18, 21)\n```\n:::\n\n\n\n\n:::\n\nNote that when we define text (e.g. `\"cloudy\"` or `\"sunny\"`), we need to use quotes.\n\nWhen we deal with numbers - whole or decimal (e.g. `23`, `18.5`) - we do not use quotes.\n\n\n::: {.callout-important}\n## Having a type\n\nDifferent data types result in slightly different types of objects. It can be quite useful to check how your data is viewed by the computer.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe can use the `class()` function to find out how R views our data. This function also works for more complex data structures.\n\nLet's do this for our examples:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(temp_c)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"numeric\"\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(weather)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\nWe can use the `type()` function to find out how Python views our data. This function also works for more complex data structures.\n\nLet's do this for our examples:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(temp_c)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(weather)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(temp_c_tuple)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n```\n\n\n:::\n:::\n\n\n\n\n\n:::\n:::\n\n### Making changes\n\nQuite often we would want to make some changes to a collection of data. There are different ways we can do this.\n\nLet's say we gathered some new temperature data and wanted to add this to the original `temp_c` data.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe'd use the `c()` function to combine the new data:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nc(temp_c, 22, 34)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 23 24 31 27 18 21 22 34\n```\n\n\n:::\n:::\n\n\n\n\n\n## Python\n\nWe take the original `temp_c` list and add the new values:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c + [22, 34]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[23, 24, 31, 27, 18, 21, 22, 34]\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\nLet's consider another scenario. Again, we went out to gather some new temperature data, but this time we stored the measurements into an object called `temp_new` and wanted to add these to the original `temp_c` data.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemp_new <- c(5, 16, 8, 12)\n```\n:::\n\n\n\n\nNext, we wanted to combine these new data with the original data, which we stored in `temp_c`.\n\nAgain, we can use the `c()` function:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nc(temp_c, temp_new)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 23 24 31 27 18 21 5 16 8 12\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_new = [5, 16, 8, 12]\n```\n:::\n\n\n\n\nWe can use the `+` operator to add the two lists together:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c + temp_new\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[23, 24, 31, 27, 18, 21, 5, 16, 8, 12]\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\n### Number sequences\n\nWe often need to create sequences of numbers when analysing data. There are some useful shortcuts available to do this, which can be used in different situations. Run the following code to see the output.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1:10 # integers from 1 to 10\n10:1 # integers from 10 to 1\nseq(1, 10, by = 2) # from 1 to 10 by steps of 2\nseq(10, 1, by = -0.5) # from 10 to 1 by steps of -0.5\nseq(1, 10, length.out = 20) # 20 equally spaced values from 1 to 10\n```\n:::\n\n\n\n\n\n## Python\n\nPython has some built-in functionality to deal with number sequences, but the `numpy` library is particularly helpful. We installed and loaded it previously, but if needed, re-run the following:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport numpy as np\n```\n:::\n\n\n\n\nNext, we can create several different number sequences:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nlist(range(1, 11)) # integers from 1 to 10\nlist(range(10, 0, -1)) # integers from 10 to 1\nlist(range(1, 11, 2)) # from 1 to 10 by steps of 2\nlist(np.arange(10, 1, -0.5)) # from 10 to 1 by steps of -0.5\nlist(np.linspace(1, 10, num = 20)) # 20 equally spaced values from 1 to 10\n```\n:::\n\n\n\n\n\n:::\n\n### Subsetting\n\nSometimes we want to extract one or more values from a collection of data. We will go into more detail later, but for now we'll see how to do this on the simple data structures we've covered so far.\n\n::: {.callout-warning collapse=\"true\"}\n## Technical: Differences in indexing between R and Python\n\nIn the course materials we keep R and Python separate in most cases. However, if you end up using both languages at some point then it's important to be aware about some key differences. One of them is **indexing**.\n\nEach item in a collection of data has a number, called an *index*. Now, it would be great if this was consistent across all programming languages, but it's not.\n\nR uses **1-based indexing** whereas Python uses **zero-based indexing**. What does this mean? Compare the following:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplants <- c(\"tree\", \"shrub\", \"grass\") # the index of \"tree\" is 1, \"shrub\" is 2 etc.\n```\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nplants = [\"tree\", \"shrub\", \"grass\"] # the index of \"tree\" is 0, \"shrub\" is 1 etc. \n```\n:::\n\n\n\n\n\nBehind the scenes of any programming language there is a lot of counting going on. So, it matters if you count starting at zero or one. So, if I'd ask:\n\n\"Hey, R - give me the items with index 1 and 2 in `plants`\" then I'd get `tree` and `shrub`. \n\nIf I'd ask that question in Python, then I'd get `shrub` and `grass`. Fun times.\n:::\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIn R we can use square brackets `[ ]` to extract values. Let's explore this using our `weather` object.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nweather # remind ourselves of the data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"sunny\" \"cloudy\" \"partial_cloud\" \"cloudy\" \n[5] \"sunny\" \"rainy\" \n```\n\n\n:::\n\n```{.r .cell-code}\nweather[2] # extract the second value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"cloudy\"\n```\n\n\n:::\n\n```{.r .cell-code}\nweather[2:4] # extract the second to fourth value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"cloudy\" \"partial_cloud\" \"cloudy\" \n```\n\n\n:::\n\n```{.r .cell-code}\nweather[c(3, 1)] # extract the third and first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"partial_cloud\" \"sunny\" \n```\n\n\n:::\n\n```{.r .cell-code}\nweather[-1] # extract all apart from the first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"cloudy\" \"partial_cloud\" \"cloudy\" \"sunny\" \n[5] \"rainy\" \n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\nLet's explore this using our `weather` object.\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nweather # remind ourselves of the data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[1] # extract the second value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'cloudy'\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[1:4] # extract the second to fourth value (end index is exclusive)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['cloudy', 'partial_cloud', 'cloudy']\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[2], weather[0] # extract the third and first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n('partial_cloud', 'sunny')\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[1:] # extract all apart from the first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\n\n\n## Dealing with missing data\n\n* LO: why is missing data important?\n* LO: good practices of dealing with missing data\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- \n:::\n", + "markdown": "---\ntitle: Data types & structures\n---\n\n\n\n\n::: {.callout-tip}\n#### Learning objectives\n\n- \n:::\n\n\n## Context\n\nWe’ve seen examples where we entered data directly into a function. Most of the time we have data from elsewhere, such as a spreadsheet. In the previous section we created single objects. We’ll build up from this and introduce vectors and tabular data. We'll also briefly mention other data types, such as matrices, arrays.\n\n## Explained: Data types & structures\n\n### Data types\n\nProgramming languages are able to deal with different data types - and they need to. For example, it makes little sense to perform mathematical operations on text! To ensure that your data is viewed in the appropriate way, you need to be aware of some of the different **data types**.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nR has the following main data types:\n\n| Data type | Description|\n|-----------|--------------------------------------------------------------|\n| numeric | Represents numbers; can be whole (integers) or decimals \\\n(e.g., `19`or `2.73`).|\n| integer | Specific type of numeric data; can only be an integer \\\n(e.g., `7L` where `L` indicates an integer). |\n| character | Also called *text* or *string* \\\n(e.g., `\"Rabbits are great!\"`).|\n| logical | Also called *boolean values*; takes either `TRUE` or `FALSE`.|\n| factor | A type of categorical data that can have inherent ordering \\\n(e.g., `low`, `medium`, `high`).|\n\n\n## Python\n\nPython has the following main data types:\n\n| Data type | Description|\n|-----------|--------------------------------------------------------------|\n| int | Specific type of numeric data; can only be an integer \\\n(e.g., `7` or `56`).|\n| float | Decimal numbers \\\n(e.g., `3.92` or `9.824`).|\n| str | *Text* or *string* data \\\n(e.g., `\"Rabbits are great!\"`).|\n| bool | *Logical* or *boolean* values; takes either `True` or `False`.|\n\n:::\n\n### Data structures\n\nIn the section on [running code](#running-code) we saw how we can run code interactively. However, we frequently need to save values so we can work with them. We've just seen that we can have different *types* of data. We can save these into different *data structures*. Which data structure you need is often determined by the type of data and the complexity.\n\nIn the following sections we look at simple data structures.\n\n## Objects\n\nWe can store values into *objects*. To do this, we *assign* values to them. An object acts as a container for that value.\n\nTo create an object, we need to give it a name followed by the\nassignment operator and the value we want to give it, for example:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature <- 23\n```\n:::\n\n\n\n\nWe can read the code as: the value `23` is assigned (`<-`) to the object `temperature`. Note that when you run this line of code the object you just created appears on your environment tab (top-right panel).\n\nWhen assigning a value to an object, R does not print anything on the console. You can print the value by typing the object name on the console or within your script and running that line of code.\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature = 23\n```\n:::\n\n\n\n\nWe can read the code as: the value `23` is assigned (`=`) to the object `temperature`.\n\nWhen assigning a value to an object, Python does not print anything on the console. You can print the value by typing the object name on the console or within your script and running that line of code.\n\n:::\n\n::: {.callout-important}\n## The assignment operator\n\nWe use an assignment operator to assign values on the right to objects on the left.\n\n::: {.panel-tabset group=\"language\"}\n## R\nIn R we use `<-` as the assignment operator.\n\nIn RStudio, typing Alt + - (push Alt at the same time as the - key) will write ` <- ` in a single keystroke on a PC, while typing Option + - (push Option at the same time as the - key) does the same on a Mac.

\n\n## Python\nIn Python we use `=` as the assignment operator.

\n\n:::\n\\\n:::\n\nObjects can be given almost any name such as `x`, `current_temperature`, or\n`subject_id`. You want the object names to be explicit and short. There are some exceptions / considerations (see below).\n\n::: {.callout-warning}\n## Restrictions on object names\n\nObject names can contain letters, numbers, underscores and periods. They *cannot start with a number nor contain spaces*. Different people use different conventions for long variable names, two common ones being:\n\nUnderscore: my_long_named_object\n\nCamel case: myLongNamedObject\n\nWhat you use is up to you, but be consistent. Programming languages are **case-sensitive** so `temperature` is different from `Temperature.`\n\n* Some names are reserved words or keywords, because they are the names of fundamental functions (e.g., `if`, `else`, `for`, see [R](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) or [Python](https://docs.python.org/3/reference/lexical_analysis.html#keywords) for a complete list).\n* Avoid using function names (e.g., `c`, `T`, `mean`, `data`, `df`, `weights`), even if allowed. If in doubt, check the help to see if the name is already in use.\n* Avoid full-stops (`.`) within an object name as in `my.data`. Full-stops often have meaning in programming languages, so it's best to avoid them.\n* Use consistent styling. In R, popular style guides are:\n * [R's tidyverse's](http://style.tidyverse.org/).\n * [Google's](https://google.github.io/styleguide/Rguide.xml)\n\n**Whatever style you use, be consistent!**\n:::\n\n### Using objects\n\nNow that we have the `temperature` in memory, we can use it to perform operations. For example, this might the temperature in Celsius and we might want to calculate it to Kelvin.\n\nTo do this, we need to add `273.15`:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 296.15\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n296.15\n```\n\n\n:::\n:::\n\n\n\n:::\n\nWe can change an object's value by assigning a new one:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature <- 36\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 309.15\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature = 36\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n309.15\n```\n\n\n:::\n:::\n\n\n\n:::\n\nFinally, assigning a value to one object does not change the values of other objects. For example, let’s store the outcome in Kelvin into a new object `temp_K`:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemp_K <- temperature + 273.15\n```\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_K = temperature + 273.15\n```\n:::\n\n\n\n:::\n\nChanging the value of `temperature` does not change the value of `temp_K`.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature <- 14\ntemp_K\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 309.15\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature = 14\ntemp_K\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n309.15\n```\n\n\n:::\n:::\n\n\n\n:::\n\n### Updating objects\n\n> LO: update objects in R\n> LO: update objects in Python & demonstrate lack of updates in tuples\n\n## Collections of data\n\nIn the examples above we have stored single values into an object. Of course we often have to deal with more than tat. Generally speaking, we can create **collections** of data. This enables us to organise our data, for example by creating a collection of numbers or text values.\n\n### Creating collections\n\nCreating a collection of data is pretty straightforward, particularly if you are doing it manually.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nThe simplest collection of data in R is called a **vector**. This really is the workhorse of R.\n\nA vector is composed by a series of values, which can numbers, text or any of the data types described.\n\nWe can assign a series of values to a vector using the `c()` function. For example, we can create a vector of temperatures and assign it to a new object `temp_c`:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemp_c <- c(23, 24, 31, 27, 18, 21)\n\ntemp_c\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 23 24 31 27 18 21\n```\n\n\n:::\n:::\n\n\n\n\nA vector can also contain text. For example, let's create a vector that contains weather descriptions:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nweather <- c(\"sunny\", \"cloudy\", \"partial_cloud\", \"cloudy\", \"sunny\", \"rainy\")\n\nweather\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"sunny\" \"cloudy\" \"partial_cloud\" \"cloudy\" \n[5] \"sunny\" \"rainy\" \n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\nThe simplest collection of data in Python is either a **list** or a **tuple**. Both can hold items of the same of different types. Whereas a tuple *cannot* be changed after it's created, a *list* can.\n\nWe can assign a collection of numbers to a list:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c = [23, 24, 31, 27, 18, 21]\n\ntemp_c\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[23, 24, 31, 27, 18, 21]\n```\n\n\n:::\n:::\n\n\n\n\n\nA list can also contain text. For example, let's create a list that contains weather descriptions:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nweather = [\"sunny\", \"cloudy\", \"partial_cloud\", \"cloudy\", \"sunny\", \"rainy\"]\n\nweather\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']\n```\n\n\n:::\n:::\n\n\n\n\nWe can also create a *tuple*. Remember, this is like a list, but it cannot be altered after creating it. Note the difference in the type of brackets, where we use `( )` round brackets instead of `[ ]` square brackets:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c_tuple = (23, 24, 31, 27, 18, 21)\n```\n:::\n\n\n\n\n:::\n\nNote that when we define text (e.g. `\"cloudy\"` or `\"sunny\"`), we need to use quotes.\n\nWhen we deal with numbers - whole or decimal (e.g. `23`, `18.5`) - we do not use quotes.\n\n\n::: {.callout-important}\n## Having a type\n\nDifferent data types result in slightly different types of objects. It can be quite useful to check how your data is viewed by the computer.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe can use the `class()` function to find out how R views our data. This function also works for more complex data structures.\n\nLet's do this for our examples:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(temp_c)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"numeric\"\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(weather)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\nWe can use the `type()` function to find out how Python views our data. This function also works for more complex data structures.\n\nLet's do this for our examples:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(temp_c)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(weather)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(temp_c_tuple)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n```\n\n\n:::\n:::\n\n\n\n\n\n:::\n:::\n\n### Making changes\n\nQuite often we would want to make some changes to a collection of data. There are different ways we can do this.\n\nLet's say we gathered some new temperature data and wanted to add this to the original `temp_c` data.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe'd use the `c()` function to combine the new data:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nc(temp_c, 22, 34)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 23 24 31 27 18 21 22 34\n```\n\n\n:::\n:::\n\n\n\n\n\n## Python\n\nWe take the original `temp_c` list and add the new values:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c + [22, 34]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[23, 24, 31, 27, 18, 21, 22, 34]\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\nLet's consider another scenario. Again, we went out to gather some new temperature data, but this time we stored the measurements into an object called `temp_new` and wanted to add these to the original `temp_c` data.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemp_new <- c(5, 16, 8, 12)\n```\n:::\n\n\n\n\nNext, we wanted to combine these new data with the original data, which we stored in `temp_c`.\n\nAgain, we can use the `c()` function:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nc(temp_c, temp_new)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 23 24 31 27 18 21 5 16 8 12\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_new = [5, 16, 8, 12]\n```\n:::\n\n\n\n\nWe can use the `+` operator to add the two lists together:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c + temp_new\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[23, 24, 31, 27, 18, 21, 5, 16, 8, 12]\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\n### Number sequences\n\nWe often need to create sequences of numbers when analysing data. There are some useful shortcuts available to do this, which can be used in different situations. Run the following code to see the output.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1:10 # integers from 1 to 10\n10:1 # integers from 10 to 1\nseq(1, 10, by = 2) # from 1 to 10 by steps of 2\nseq(10, 1, by = -0.5) # from 10 to 1 by steps of -0.5\nseq(1, 10, length.out = 20) # 20 equally spaced values from 1 to 10\n```\n:::\n\n\n\n\n\n## Python\n\nPython has some built-in functionality to deal with number sequences, but the `numpy` library is particularly helpful. We installed and loaded it previously, but if needed, re-run the following:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport numpy as np\n```\n:::\n\n\n\n\nNext, we can create several different number sequences:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nlist(range(1, 11)) # integers from 1 to 10\nlist(range(10, 0, -1)) # integers from 10 to 1\nlist(range(1, 11, 2)) # from 1 to 10 by steps of 2\nlist(np.arange(10, 1, -0.5)) # from 10 to 1 by steps of -0.5\nlist(np.linspace(1, 10, num = 20)) # 20 equally spaced values from 1 to 10\n```\n:::\n\n\n\n\n\n:::\n\n### Subsetting\n\nSometimes we want to extract one or more values from a collection of data. We will go into more detail later, but for now we'll see how to do this on the simple data structures we've covered so far.\n\n::: {.callout-warning collapse=\"true\"}\n## Technical: Differences in indexing between R and Python\n\nIn the course materials we keep R and Python separate in most cases. However, if you end up using both languages at some point then it's important to be aware about some key differences. One of them is **indexing**.\n\nEach item in a collection of data has a number, called an *index*. Now, it would be great if this was consistent across all programming languages, but it's not.\n\nR uses **1-based indexing** whereas Python uses **zero-based indexing**. What does this mean? Compare the following:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplants <- c(\"tree\", \"shrub\", \"grass\") # the index of \"tree\" is 1, \"shrub\" is 2 etc.\n```\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nplants = [\"tree\", \"shrub\", \"grass\"] # the index of \"tree\" is 0, \"shrub\" is 1 etc. \n```\n:::\n\n\n\n\n\nBehind the scenes of any programming language there is a lot of counting going on. So, it matters if you count starting at zero or one. So, if I'd ask:\n\n\"Hey, R - give me the items with index 1 and 2 in `plants`\" then I'd get `tree` and `shrub`. \n\nIf I'd ask that question in Python, then I'd get `shrub` and `grass`. Fun times.\n:::\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIn R we can use square brackets `[ ]` to extract values. Let's explore this using our `weather` object.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nweather # remind ourselves of the data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"sunny\" \"cloudy\" \"partial_cloud\" \"cloudy\" \n[5] \"sunny\" \"rainy\" \n```\n\n\n:::\n\n```{.r .cell-code}\nweather[2] # extract the second value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"cloudy\"\n```\n\n\n:::\n\n```{.r .cell-code}\nweather[2:4] # extract the second to fourth value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"cloudy\" \"partial_cloud\" \"cloudy\" \n```\n\n\n:::\n\n```{.r .cell-code}\nweather[c(3, 1)] # extract the third and first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"partial_cloud\" \"sunny\" \n```\n\n\n:::\n\n```{.r .cell-code}\nweather[-1] # extract all apart from the first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"cloudy\" \"partial_cloud\" \"cloudy\" \"sunny\" \n[5] \"rainy\" \n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\nLet's explore this using our `weather` object.\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nweather # remind ourselves of the data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[1] # extract the second value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'cloudy'\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[1:4] # extract the second to fourth value (end index is exclusive)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['cloudy', 'partial_cloud', 'cloudy']\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[2], weather[0] # extract the third and first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n('partial_cloud', 'sunny')\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[1:] # extract all apart from the first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\n\n\n## Dealing with missing data\n\n* LO: why is missing data important?\n* LO: good practices of dealing with missing data\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- \n:::\n", "supporting": [ "02-basic-objects-and-data-types_files" ], diff --git a/materials/01-intro-software.qmd b/materials/01-intro-software.qmd index ba5fa17..6b7ff80 100644 --- a/materials/01-intro-software.qmd +++ b/materials/01-intro-software.qmd @@ -453,7 +453,7 @@ Create a working directory called `data-analysis`. When you've done this, add th * `scripts` * `images` -**Note**: programming languages are case-sensitive, so `data` is not treated the same way as `data`. +**Note**: programming languages are case-sensitive, so `data` is not treated the same way as `Data`. ::: :::{.callout-exercise #ex-createscript} diff --git a/materials/02-basic-objects-and-data-types.qmd b/materials/02-basic-objects-and-data-types.qmd index 2752199..0f0538c 100644 --- a/materials/02-basic-objects-and-data-types.qmd +++ b/materials/02-basic-objects-and-data-types.qmd @@ -203,13 +203,18 @@ temp_K ``` ::: +### Updating objects + +> LO: update objects in R +> LO: update objects in Python & demonstrate lack of updates in tuples + ## Collections of data In the examples above we have stored single values into an object. Of course we often have to deal with more than tat. Generally speaking, we can create **collections** of data. This enables us to organise our data, for example by creating a collection of numbers or text values. ### Creating collections -Creating a collection of data is pretty straightforward, particularly if you are manually doing it. +Creating a collection of data is pretty straightforward, particularly if you are doing it manually. ::: {.panel-tabset group="language"} ## R