layout | root | title | minutes |
---|---|---|---|
lesson |
. |
Databases using SQL |
60 |
Note: this should have been done by participants before the start of the workshop.
- Install Firefox
- Install the SQLite Manager add on: Menu (the three horizontal lines near the top right corner of Firefox) -> Add-ons -> Search -> SQLite Manager -> Install -> Restart now
- Add SQLite Manager to the menu: Menu -> Customize, then drag the SQLite Manager icon to one of the empty menu squares on the right, Exit Customize
- Open SQLite Manager: Menu -> SQLite Manager
To start, let's orient ourselves in our project workflow. Previously, we used Excel and OpenRefine to go from from messy, human created data to cleaned, computer-readable data. Now we're going to move to the next piece of the data workflow, using the computer to read in our data, and then use it for analysis and visualization.
The data we will be using is a time-series for a small mammal community in southern Arizona. This is part of a project studying the effects of rodents and ants on the plant community that has been running for almost 40 years. The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots.
This is a real dataset that has been used in over 100 publications. We've simplified it just a little bit for the workshop, but you can download the full dataset and work with it using exactly the same tools we'll learn about today.
First, let's download and look at some of the cleaned spreadsheets
from the
Portal Project dataset.
We'll need the following three files:
surveys.csv
species.csv
plots.csv
Open each of these csv files and explore them. What information is contained in each file? Specifically, if I had the following research questions:
- How has the hindfoot length and weight of Dipodomys species changed over time?
- What is the average weight of each species, per year?
- What information can I learn about Dipodomys species in the 2000s, over time?
What would I need to answer these questions? Which files have the data I need? What operations would I need to perform if I were doing these analyses by hand?
In order to answer the questions described above, we'll need to do the following basic data operations:
- select subsets of the data (rows and columns)
- group subsets of data
- do math and other calculations
- combine data across spreadsheets
In addition, we don't want to do this manually! Instead of searching for the right pieces of data ourselves, or clicking between spreadsheets, or manually sorting columns, we want to make the computer do the work.
In particular, we want to use a tool where it's easy to repeat our analysis in case our data changes. We also want to do all this searching without actually modifying our source data.
Putting our data into a database and using SQL will help us achieve these goals.
Using a relational database serves several purposes.
- It keeps your data separate from your analysis.
- This means there's no risk of accidentally changing data when you analyze it.
- If we get new data we can just rerun the query.
- It's fast, even for large amounts of data.
- It improves quality control of data entry (type constraints and use of forms in Access, Filemaker, etc.)
- The concepts of relational database querying are core to understanding how to do similar things using programming languages such as R or Python.
There are a number of different database management systems for working with relational data. We're going to use SQLite today, but basically everything we teach you will apply to the other database systems as well (e.g., MySQL, PostgreSQL, MS Access, Filemaker Pro). The only things that will differ are the details of exactly how to import and export data and the details of data types.
Let's download and look at a pre-existing database, the portal_mammals.sqlite
file. Clicking on the "open file" icon and then that file will open the database.
You can see the tables in the database by looking at the left hand side of the
screen under Tables, where each table corresponds to one of the csv
files
we were exploring earlier. To see the contents of any table, click on it, and
then go to the "Browse and Search" tab in the middle of the screen. This will
give us a view that we're used to - just a copy of the table. Hopefully this
helps to show that a
database is, in some sense, just a collection of tables, where there's some value
in the tables that allows them to be connected to each other (the "related" part
of "relational database").
The leftmost tab, "Structure", provides some metadata about each table. It
describes the columns, often called fields. (The rows of a database table
are called records.) If you scroll down in the Structure view, you'll
see a list of fields, their labels, and their data type. Each field contains
one variety or type of data, often numbers or text. You can see in the
surveys
table that most fields contain numbers (integers) while the species
table is nearly all text.
The "Execute SQL" tab is blank now - this is where we'll be typing our queries to retrieve information from the database tables.
To summarize:
- Relational databases store data in tables with fields (columns) and records (rows)
- Data in tables has types, and all values in a field have the same type (list of data types)
- Queries let us look up data or make calculations based on columns
- Every row-column combination contains a single atomic value, i.e., not containing parts we might want to work with separately.
- One field per type of information
- No redundant information
- Split into separate tables with one table per class of information
- Needs an identifier in common between tables – shared column - to reconnect (foreign key).
Before we get started with writing our own queries, we'll create our own
database. We'll be creating this database from the three csv
files
we downloaded earlier. Close the currently open database and then
follow these instructions:
- Start a New Database Database -> New Database
- Start the import Database -> Import
- Select the file to import
- Give the table a name that matches the file name (surveys, species, plots), or use the default
- If the first row has column headings, check the appropriate box
- Make sure the delimiter and quotation options are appropriate for the CSV files. Ensure 'Ignore trailing Separator/Delimiter' is left unchecked.
- Press OK
- When asked if you want to modify the table, click OK
- Set the data types for each field: choose TEXT for fields with text
(
species_id
,genus
,sex
, etc.) and INT for fields with numbers (day
,month
,year
,weight
, etc.)
Import the plots and species tables
You can also use this same approach to append new data to an existing table.
- Browse & Search -> Add
- Enter data into a csv file and append
Data type | Description |
---|---|
CHARACTER(n) | Character string. Fixed-length n |
VARCHAR(n) or CHARACTER VARYING(n) | Character string. Variable length. Maximum length n |
BINARY(n) | Binary string. Fixed-length n |
BOOLEAN | Stores TRUE or FALSE values |
VARBINARY(n) or BINARY VARYING(n) | Binary string. Variable length. Maximum length n |
INTEGER(p) | Integer numerical (no decimal). |
SMALLINT | Integer numerical (no decimal). |
INTEGER | Integer numerical (no decimal). |
BIGINT | Integer numerical (no decimal). |
DECIMAL(p,s) | Exact numerical, precision p, scale s. |
NUMERIC(p,s) | Exact numerical, precision p, scale s. (Same as DECIMAL) |
FLOAT(p) | Approximate numerical, mantissa precision p. A floating number in base 10 exponential notation. |
REAL | Approximate numerical |
FLOAT | Approximate numerical |
DOUBLE PRECISION | Approximate numerical |
DATE | Stores year, month, and day values |
TIME | Stores hour, minute, and second values |
TIMESTAMP | Stores year, month, day, hour, minute, and second values |
INTERVAL | Composed of a number of integer fields, representing a period of time, depending on the type of interval |
ARRAY | A set-length and ordered collection of elements |
MULTISET | A variable-length and unordered collection of elements |
XML | Stores XML data |
Different databases offer different choices for the data type definition.
The following table shows some of the common names of data types between the various database platforms:
Data type | Access | SQLServer | Oracle | MySQL | PostgreSQL |
---|---|---|---|---|---|
boolean | Yes/No | Bit | Byte | N/A | Boolean |
integer | Number (integer) | Int | Number | Int / Integer | Int / Integer |
float | Number (single) | Float / Real | Number | Float | Numeric |
currency | Currency | Money | N/A | N/A | Money |
string (fixed) | N/A | Char | Char | Char | Char |
string (variable) | Text (<256) / Memo (65k+) | Varchar | Varchar / Varchar2 | Varchar | Varchar |
binary object OLE Object Memo Binary (fixed up to 8K) | Varbinary (<8K) | Image (<2GB) Long | Raw Blob | Text Binary | Varbinary |
Previous: Index Next: Basic queries.