Skip to content

Commit 1d6696f

Browse files
committed
adding ecology pages
1 parent eaca8ef commit 1d6696f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+5588
-1377
lines changed

EcologyLesson/.DS_Store

8 KB
Binary file not shown.

EcologyLesson/Intro/index.html

Lines changed: 0 additions & 412 deletions
This file was deleted.

EcologyLesson/Intro/index.qmd

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
---
2+
title: "Introduction"
3+
subtitle: "Data Organization in Spreadsheets for Ecologists"
4+
---
5+
6+
## Overview
7+
* **Teaching:** 15 min
8+
* **Exercises:** 3 min
9+
* **Questions**
10+
* What are basic principles for using spreadsheets for good data organization?
11+
* **Objectives**
12+
* Describe best practices for organizing data so computers can make the best use of data sets.
13+
14+
Good data organization is the foundation of your research project. Most researchers have data or do data entry in spreadsheets. Spreadsheet programs are very useful graphical interfaces for designing data tables and handling very basic data quality control functions.
15+
16+
## Spreadsheet outline
17+
18+
After this lesson, you will be able to:
19+
20+
*bImplement best practices in data table formatting
21+
* Identify and address common formatting mistakes
22+
* Understand approaches for handling dates in spreadsheets
23+
* Utilize basic quality control features and data manipulation practices
24+
* Effectively export data from spreadsheet programs
25+
Overall good data practices
26+
27+
Spreadsheets are good for data entry. Therefore we have a lot of data in spreadsheets. Much of your time as a researcher will be spent in this ‘data wrangling’ stage. It's not the most fun, but it's necessary. We'll teach you how to think about data organization and some practices for more effective data wrangling.
28+
29+
## What this lesson will not teach you
30+
31+
* How to do statistics in a spreadsheet
32+
* How to do plotting in a spreadsheet
33+
* How to write code in spreadsheet programs
34+
35+
If you're looking to do this, a good reference is [Head First Excel](https://www.amazon.com/Head-First-Excel-learners-spreadsheets/dp/0596807694/), published by O’Reilly.
36+
37+
## Why aren't we teaching data analysis in spreadsheets
38+
39+
* Data analysis in spreadsheets usually requires a lot of manual work. If you want to change a parameter or run an analysis with a new data set, you usually have to redo everything by hand. (We do know that you can create macros, but see the next point.)
40+
41+
* It is also difficult to track or reproduce statistical or plotting analyses done in spreadsheet programs when you want to go back to your work or someone asks for details of your analysis.
42+
43+
44+
## Spreadsheet programs
45+
46+
Many spreadsheet programs are available. Since most participants utilize Excel as their primary spreadsheet program, this lesson will make use of Excel examples.
47+
48+
Free spreadsheet programs that can also be used are LibreOffice Calc, and even Google Sheets.
49+
50+
Commands may differ a bit between programs, but the general idea is the same.
51+
52+
Spreadsheets encompass a lot of the things we need to be able to do as researchers. We can use them for:
53+
54+
* Data entry
55+
* Organizing data
56+
* Subsetting and sorting data
57+
* Statistics
58+
* Plotting
59+
60+
We do a lot of different operations in spreadsheets. What kind of operations do you do in spreadsheets? Which ones do you think spreadsheets are good for?
61+
62+
## Problems with Spreadsheets
63+
64+
Spreadsheets are good for data entry, but in reality we tend to use spreadsheet programs for much more than data entry. We use them to create data tables for publications, to generate summary statistics, and make figures.
65+
66+
Generating tables for publications in a spreadsheet is not optimal - often, when formatting a data table for publication, we're reporting key summary statistics in a way that is not really meant to be read as data, and often involves special formatting (merging cells, creating borders, making it pretty). Cutting and pasting from a spreadsheet to a document software (like Word) can have unpredictable results. We advise you to create tables within these document software using the document's own table editing software.
67+
68+
The latter two applications, generating statistics and figures, should be used with caution: because of the graphical, drag and drop nature of spreadsheet programs, it can be very difficult, if not impossible, to replicate your steps (much less retrace anyone else's), particularly if your stats or figures require you to do more complex calculations. Furthermore, in doing calculations in a spreadsheet, it's easy to accidentally apply a slightly different formula to multiple adjacent cells. When using a command-line based statistics program like R or SAS, it's practically impossible to apply a calculation to one observation in your data set but not another unless you're doing it on purpose.
69+
70+
## Using Spreadsheets for Data Entry and Cleaning
71+
72+
However, there are circumstances where you might want to use a spreadsheet program to produce “quick and dirty” calculations or figures, and data cleaning will help you use some of these features. Data cleaning also puts your data in a better format prior to importation into a statistical analysis program. We will show you how to use some features of spreadsheet programs to check your data quality along the way and produce preliminary summary statistics.
73+
74+
In this lesson, we will assume that you are most likely using Excel as your primary spreadsheet program - there are others (gnumeric, Calc from OpenOffice), and their functionality is similar, but Excel seems to be the program most used by biologists and ecologists.
75+
76+
In this lesson we're going to talk about:
77+
78+
1. [Formatting data tables in spreadsheets](../formattingtables/index.qmd)
79+
2. [Formatting problems](../formattingproblems/index.qmd)
80+
3. [Dates as data](../datesasdata/index.qmd)
81+
4. [Quality control](../qualitycontrol/index.qmd)
82+
5. [Exporting data](../exporting/index.qmd)
83+
84+
## Key Points
85+
* Good data organization is the foundation of any research project.
86+
87+
Licensed under [CC-BY 4.0 2018–2022](https://datacarpentry.org/spreadsheet-ecology-lesson/00-intro/index.html) by [The Carpentries](https://carpentries.org/)
88+
89+
Licensed under [CC-BY 4.0 2016–2018](https://datacarpentry.org/spreadsheet-ecology-lesson/00-intro/index.html) by [Data Carpentry](http://datacarpentry.org/)
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
---
2+
title: "Dates as data"
3+
subtitle: "Data Organization in Spreadsheets for Ecologists"
4+
---
5+
6+
## Overview
7+
* **Teaching:** 10 min
8+
* **Exercises:** 3 min
9+
* **Questions**
10+
* What are good approaches for handling dates in spreadsheets?
11+
* **Objectives**
12+
* Describe how dates are stored and formatted in spreadsheets.
13+
* Describe the advantages of alternative date formatting in spreadsheets.
14+
* Demonstrate best practices for entering dates in spreadsheets.
15+
16+
Dates in spreadsheets can be a problem. For one thing, dates are stored in a single column. While this seems the most natural way to record dates, it actually is not best practice. A spreadsheet application will display the dates in a seemingly correct way (to a human observer) but how it actually handles and stores the dates may be problematic.
17+
18+
In particular, please remember that DATE functions that are valid for a given spreadsheet program (be it LibreOffice Calc, Microsoft Excel, OpenOffice, Gnumeric, etc.) DATE functions are usually guaranteed to be compatible only within the same family of products. Most of the images of spreadsheets in this lesson come from Microsoft Excel, run on a Mac or on Windows. Regardless of your spreadsheet, if you will later need to export the data and need to conserve the timestamps, you are better off handling them using one of the solutions discussed below.
19+
20+
One of the big problems with Excel is it can [turn things that aren’t dates into dates](https://nsaunders.wordpress.com/2012/10/22/gene-name-errors-and-excel-lessons-not-learned/), for example gene/protein names or identifiers like MAR1, DEC1, OCT4 will be changed to dates, and you cannot retreive the original name or identifier (except manually). So if you avoid the date format overall, it’s easier to work with these types of data. When you must work with dates, here is how to do it efficiently.
21+
22+
## Exercise
23+
Challenge: pulling month, day and year out of dates
24+
25+
* Let’s create a tab called `dates` in our data spreadsheet and copy the ‘plot 3’ table from the `2014` tab (that contains the problematic dates).
26+
* Let’s extract month, day and year from the dates in the `Date collected` column into new columns. For this we can use the following built-in Excel functions:
27+
28+
```YEAR()```
29+
30+
```MONTH()```
31+
32+
```DAY()```
33+
34+
(Make sure the new columns are formatted as a number and not as a date.)
35+
36+
You can see that even though we expected the year to be 2014, the year is actually 2015. What happened here is that the field assistant who collected the data for year 2014 initially forgot to include their data for ‘plot 3’ in this dataset. They came back in 2015 to add the missing data into the dataset and entered the dates for ‘plot 3’ without the year. Excel automatically interpreted the year as 2015 - the year the data was entered into the spreadsheet and not the year the data was collected. Thereby, the spreadsheet program introduced an error in the dataset without the field assistant realising.
37+
38+
## Exercise
39+
40+
Challenge: pulling hour, minute and second out of the current time
41+
42+
Current time and date are best retrieved using the functions `NOW()`, which returns the current date and time, and `TODAY()`, which returns the current date. The results will be formatted according to your computer’s settings.
43+
44+
1. Extract the year, month and day from the current date and time string returned by the NOW() function.
45+
2. Calculate the current time using NOW()-TODAY().
46+
3. Extract the hour, minute and second from the current time using functions HOUR(), MINUTE() and SECOND().
47+
4. Press F9 to force the spreadsheet to recalculate the NOW() function, and check that it has been updated.
48+
49+
## Preferred date format
50+
51+
It is much safer to store dates with [YEAR, MONTH, DAY](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html#day) in separate columns or as [YEAR and DAY-OF-YEAR](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html#doy) in separate columns.
52+
53+
Note: Excel is unable to parse dates from before 1899-12-31, and will thus leave these untouched. If you’re mixing historic data from before and after this date, Excel will translate only the post-1900 dates into its internal format, thus resulting in mixed data. If you’re working with historic data, be extremely careful with your dates!
54+
55+
Excel also entertains a second date system, the 1904 date system, as the default in Excel for Macintosh. This system will assign a different serial number than the [1900 date system](https://support.microsoft.com/en-us/help/214330/differences-between-the-1900-and-the-1904-date-system-in-excel). Because of this, [dates must be checked for accuracy when exporting data from Excel](http://uc3.cdlib.org/2014/04/09/abandon-all-hope-ye-who-enter-dates-in-excel/) (look for dates that are ~4 years off).
56+
57+
## Date formats in spreadsheets
58+
59+
Spreadsheet programs have numerous “useful features” which allow them to handle dates in a variety of ways.
60+
61+
![Many formats, many ambiguities](index4_files/5_excel_dates_1.jpg)
62+
63+
But these “features” often allow ambiguity to creep into your data. Ideally, data should be as unambiguous as possible.
64+
65+
## Dates stored as integers
66+
67+
The first thing you need to know is that Excel stores dates as numbers - see the last column in the above figure. Essentially, it counts the days from a default of December 31, 1899, and thus stores July 2, 2014 as the serial number 41822.
68+
69+
(But wait. That’s the default on my version of Excel. We’ll get into how this can introduce problems down the line later in this lesson. )
70+
71+
This serial number thing can actually be useful in some circumstances. By using the above functions we can easily add days, months or years to a given date. Say you had a sampling plan where you needed to sample every thirty seven days. In another cell, you could type:
72+
73+
```=B2+37```
74+
75+
And it would return
76+
77+
```8-Aug```
78+
79+
because it understands the date as a number `41822`, and `41822 + 37 = 41859` which Excel interprets as August 8, 2014. It retains the format (for the most part) of the cell that is being operated upon, (unless you did some sort of formatting to the cell before, and then all bets are off). Month and year rollovers are internally tracked and applied.
80+
81+
**Note** Adding years and months and days is slightly trickier because we need to make sure that we are adding the amount to the correct entity.
82+
83+
* First we extract the single entities (day, month or year)
84+
* We can then add values to do that
85+
* Finally the complete date string is reconstructed using the DATE() function.
86+
87+
As for dates, times are handled in a similar way; seconds can be directly added but to add hour and minutes we need to make sure that we are adding the quantities to the correct entities.
88+
89+
Which brings us to the many different ways Excel provides in how it displays dates. If you refer to the figure above, you’ll see that there are many ways that ambiguity creeps into your data depending on the format you chose when you enter your data, and if you’re not fully aware of which format you’re using, you can end up actually entering your data in a way that Excel will badly misinterpret and you will end up with errors in your data that will be extremely difficult to track down and troubleshoot.
90+
91+
### Exercise
92+
What happens to the `dates` in the dates tab of our workbook if we save this sheet in Excel (in `csv` format) and then open the file in a plain text editor (like TextEdit or Notepad)? What happens to the dates if we then open the `csv` file in Excel?
93+
94+
#### Note
95+
You will notice that when exporting into a text-based format (such as CSV), Excel will export its internal date integer instead of a useful value (that is, the dates will be represented as integer numbers). This can potentially lead to problems if you use other software to manipulate the file.
96+
97+
## Advantages of Alternative Date Formatting
98+
99+
Storing dates as YEAR, MONTH, DAY
100+
101+
Storing dates in YEAR, MONTH, DAY format helps remove this ambiguity. Let’s look at this issue a bit closer.
102+
103+
For instance this is a spreadsheet representing insect counts that were taken every few days over the summer, and things went something like this:
104+
105+
![So, so ambiguous, it's even confusing Excel](index4_files/6_excel_dates_2.jpg)
106+
107+
If Excel was to be believed, this person had been collecting bugs **in the future**. Now, we have no doubt this person is highly capable, but I believe time travel was beyond even their grasp.
108+
109+
Entering dates in one cell is helpful but due to the fact that the spreadsheet programs may interpret and save the data in different ways (doing that somewhat behind the scenes), there is a better practice.
110+
111+
In dealing with dates in spreadsheets, separate date data into separate fields (day, month, year), which will eliminate any chance of ambiguity.
112+
113+
## Storing dates as YEAR, DAY-OF-YEAR
114+
115+
There is also another option. You can also store dates as year and day of year (DOY). Why? Because depending on your question, this might be what’s useful to you, and there is practically no possibility for ambiguity creeping in.
116+
117+
Statistical models often incorporate year as a factor, or a categorical variable, rather than a numeric variable, to account for year-to-year variation, and DOY can be used to measure the passage of time within a year.
118+
119+
So, can you convert all your dates into DOY format? Well, in Excel, here’s a useful guide:
120+
121+
![Kill that ambiguity before it bites you!](index4_files/7_excel_dates_3.jpg)
122+
123+
## Storing dates as a single string
124+
125+
Another alternative could be to convert the date string into a single string using the `YYYYMMDD` format. For example the date `March 24, 2015` would become `20150324`. This option also works for datetimes using the `YYYYMMDDhhmmss` format. So the datetime `March 24, 2015 17:25:35` would become `20150324172535`, where:
126+
127+
* `YYYY`: the full year, i.e. 2015
128+
* `MM`: the month, i.e. 03
129+
* `DD`: the day of month, i.e. 24
130+
* `hh`: hour of day, i.e. 17
131+
* `mm`: minutes, i.e. 25
132+
* `ss`: seconds, i.e. 35
133+
134+
Such strings will be correctly sorted in ascending or descending order, and by knowing the format they can then be correctly processed by the receiving software.
135+
136+
## Key Points
137+
* Treating dates as multiple pieces of data rather than one makes them easier to handle.
138+
139+
Licensed under [CC-BY 4.0 2018–2022](https://datacarpentry.org/spreadsheet-ecology-lesson/00-intro/index.html) by [The Carpentries](https://carpentries.org/)
140+
141+
Licensed under [CC-BY 4.0 2016–2018](https://datacarpentry.org/spreadsheet-ecology-lesson/00-intro/index.html) by [Data Carpentry](http://datacarpentry.org/)

0 commit comments

Comments
 (0)