- 
                Notifications
    
You must be signed in to change notification settings  - Fork 6
 
Excel hell
Spread sheets offer a simple and effective method to collect data. However, their ease of use often leads to a sloppy approach to data entry. Good data discipline is an essential foundation to data analysis and a small investment at this stage can prevent you running into trouble when trying to analyse and publish your study. This tutorial will take you through the essential dos and don’ts of managing your data in a spreadsheet.
As a word of caution, don’t forget, that any data collection should be inline with the Data Protection Act. Please ensure you manage your data appropriately.
There are many different spreadsheets to use. Almost everyone will be familiar with excel, although it’s worth considering if this is actually the right choice for you. If you are planning a major collaboration, where many people need to enter data simultaneously, then an excel spreadsheet is probably not for you. Have a look at google sheets as an alternative. As your requirements grow, you may want to explore a relational database. But for now, we’ll assume you are using excel.
In computing terms, not all data is equal. It is important to tell excel what kind of data you are storing in each cell. Whilst excel will make attempts to guess what kind of data you are entering, it isn’t particularly good at it. The following are some of the main data types, and why they are important.
Integers are any whole number. Decimals include any number with a decimal point. Computers apply maths differently to these two numerical data types, so its best practice to be consistent.
A string is any sequence of characters. Literally anything you can type can be represented as a string. This is the default data type in excel. This becomes really important when analysing your data. Take the example of recording the dose of analgesia given for postoperative pain relief. If we write “180 mg”, excel will store this value as a string. It has no choice, since the “mg” adds data that cannot possibly be interpreted numerically. When dealing with data that is numerical in nature, it’s best practice to keep all units elsewhere and keep the data as an integer or decimal.
This is essentially a special case of the decimal (sometimes also referred to as a double). Most dates are simply numbers counted up from a point in time (usually Jan 1st 1970). Time is given as a decimal of a single day. When storing date and time information, it is often easier to separate out the two. I.e store date in a single column and time in another.
These are True or False statements. Usually also described as 1 or 0. False is ALWAYS 0 or NULL (which means no data is present). Any non 0 value, could be interpreted as True in this context, but the convention is to use 1. When storing any true or false data, we recommend using 1 and 0 as a convention.
Spend some time to develop your spreadsheet before you start collecting data. Here are some simple rules to follow to help get it right:
- Define your unit of analysis (the row). What is the most basic event that we are reviewing? A common mistake is use each row as a specific patient, and keep extending into more columns are you acquire more data. Each row should represent an “episode” in your data, this is not necessarily a patient. Common examples of an episode include:
- A patient
 - A day
 - An observation
 - A ward
 - A hospital
 - A bed
 - A procedure/operation
 
 - Define characteristics of that unit (the column). This is the specific data that you are collecting on each episode. For example, if your basic episode is a procedure your column heading might be:
- Patient first name
 - Patient last name
 - Patient DOB
 - Proceedure Type
 - Pain Score
 
 - Name your columns whilst avoiding any white space. “Blood Pressure” may seem intuitive and easy to read, however it can come unstuck later in the analysis as computers don’t tend to like interpreting spaces. The accepted standards are:
- Camel case: BloodPressure, PatientNumber, HospitalNumber
 - Snake case: blood_pressure, patient_number, hospital_number
 
 - Keep each of your column names unique. If you start recording more of the same metric (eg blood_pressure_1, blood_pressure_2 etc.) have you considered whether or not this should be an entirely new row?
 - Reserve your top row for column names. Use the top row, and only the top row for column naming. Don't create a naming heir achy using multiple rows.
 - Don’t use formatting to store data. Computers are colour blind, so any formatting you apply serves no function other than to highlight things to you. This is fine if it helps you interpret your data, but if you are storing information by turning certain cells red, you need to rethink your structure.
 - Develop a naming system (eg bps, bpd, map might be easier to search/organise than sysop, diastolicbp, meanbp)
 - Keep to a single system for describing booleans. In many audits, the data you collect has a binary outcome, something is true or not true, a standard it met or not. We would advocate using 1 and 0 to represent True and False respectively.
 - Use data validation. Data validation restricts the options you have when entering data into a cell. If when typing into a "procedure" column, you enter "C-Section", "Caesarian Section" and "c-section", these will all be treated as different procedures, even though we know they are all the same. Data validation keeps data entry consistent and avoids this issue. There are many simple tutorials on youtube that demonstrate how to use data validation in excel.
 - Keep units in your column titles. Any units used in your cell will automatically convert a numerical value into a string. It also creates a lot of duplication in typing which is unnecessary.
 - Don't start using excel for your data analysis. Keep it purely for gathering your raw data. If you start adding rows for summary counts, averages or basic statistics, it will make it harder to unpick raw data from statistic when importing to R. Your spreadsheet is for your raw data, and nothing else.
 - Watch your case. Remember that "Hello", "hello" and "HeLlo" are all different to a computer. Using data validation will help keep everything consistent and prevent a tiny miskey damaging your data analysis.
 - Don’t merge cells under any circumstances. Merged cells are programatically a nightmare.
 - Be clear about missing values and zeroes
 - Don't enter multiple entries in one cell.
 
A good test to see if your data will be interpretable by R is whether or not you can save it to github as a CSV file. If at this stage your data still makes sense, you have probably kept to some good data discipline.
Comma separated values, or .CSV files are a very basic type of file that stores information in a universally accessible way. .xls files are proprietary to microsoft and can only be opened correctly by a particular version of excel. Being able to convert this file into a .CSV file is a vital part of working with your data in a different application.
We'll use an example of a poorly designed spreadsheet and unpick some of the more classic data entry mistakes, to transform into a well designed spreadsheet.
Be more prescriptive in our use of 1 and 0 as bools Summary characteristics Take care of writing formulas when exporting to CSV Don't start your data analysis early - i.e. no totals etc. Construct course feedback excel sheet - handover to Ahmed Index / Key Many examples of "rows" as entities. Relational data ?TidyR
Please contact Steve Harris if you have any questions.