- Module Introduction
- Working with Text Strings
- Working with Date and Time Strings
- Date and Time Strings Examples
- Case Statements
- Views
- Data Governance and Profiling
- Using SQL for Data Science, Part 1
- Using SQL for Data Science, Part 2
- concatenating
- trimming
- changing case
- substring
- date and time strings
SQL's version of IF
, THEN
, and ELSE
statements
Manipulating strings on the server reduces work that needs to be done by the client.
String functions:
- concatenate
- substring
- trim
- upper
- lower
In SQLite one uses ||
to concatenate strings. SQL Server allows for +
to
concatenate strings.
SELECT
company_name
,contact_name
,compaany_name || ' (' || contact_name || ')'
FROM Customers
TRIM
RTRIM
LTRIM
SELECT TRIM(" You the best ") AS trimmed_string;
The SUBSTR
command is 1-indexed. It accepts the string on which to operate,
the starting index, and the number of characters to return from that index.
-- SUBSTR([string name], [start pos], [num return chars])
SELECT first_name, SUBSTR(first_name, 2, 3)
FROM Employees
WHERE department_id = 69;
-- make string uppercase
SELECT UPPER(first_name) FROM Customers
-- or
SELECT UCASE(first_name) FROM Customers
-- make string lowercase
SELECT LOWER(first_name) FROM Customers
Dates are stored as date
or datetime
datatypes. The format of the date is
dependent on the RDMS you are working with.
If only the date portion of an entry is being queried, queries are generally quite easy. As soon as the time portion is included in the query complexity increases.
When inserting a date into a table, care must be taken to ensure that it is done so in the correct format.
DATE
: YYYY-MM-DD
DATETIME
: YYYY-MM-DD HH:MM:SS
TIMESTAMP
: YYYY-MM-DD HH:MM:SS
-- querying a DATETIME field with:
WHERE purchase_date = '2016-12-12'
-- will return no results because it's missing the _time_ portion of the data
SQLite supports 5 date and time functions:
DATE(timestring, [modifier, ...])
TIME(timestring, [modifier, ...])
DATETIME(timestring, [modifier, ...])
JULIANDAY(timestring, [modifier, ...])
STRFTIME(format, timestring, [modifier, ...])
A time string can be in one of the following formats:
YYYY-MM-DD
YYYY-MM-DD HH:MM
YYYY-MM-DD HH:MM:SS
YYYY-MM-DD HH:MM:SS.SSS
# T here is a literal T delimiting the date from the time
YYYY-MM-DDTHH:MM
YYYY-MM-DDTHH:MM:SS
YYYY-MM-DDTHH:MM:SS.SSS
HH:MM
HH:MM:SS
HH:MM:SS.SSS
Modifiers alter the date, and multiple modifiers can be applied to a date. Modifiers are applied in the order they are provided, from left to right.
The following modifiers add a specified amount to the date:
NNN days
NNN hours
NNN minutes
NNN.NNN seconds
NNN months
NNN years
The following modifiers return a date "shifted back" from the provided date:
start of month
start of year
start of day
weekday N
shifts the date forward to the 0-indexed day in the future, unless
the provided date is that index. Sunday is 0.
localtime
modifies UTC dates by shifting the time to the localtime of where
the process is running. utc
is the inverse of localtime
.
now
will retrieve the current date and time.
STRFTIME
extracts specific values from dates.
With a value such as a birth date, we often don't want any time information, and
it is probable that it'll be stored as 00:00:00
anyhow. We don't want to
necessarily retrieve time information, so we can use STRFTIME
to retrieve only
portions of the date that are important to us.
-- get the year, month, and day of each employee
SELECT
birthdate
, STRFTIME('%Y', birthdate) AS year
, STRFTIME('%m', birthdate) AS month
, STRFTIME('%d', birthdate) AS day
FROM employees
SELECT DATE('now')
This can be valuable to determine how long ago some event occurred, or to retrieve records that were created or updated within a specific period of time.
-- date
SELECT STRFTIME('%Y %m %d', 'now')
-- time
SELECT STRFTIME('%H %M %S %s', 'now')
-- compute employees' ages
SELECT
birthdate
,STRFTIME('%Y', birthdate) AS year
,STRFTIME('%m', birthdate) AS month
,STRFTIME('%d', birthdate) AS day
,DATE(('now') - birthdate) AS age
FROM employees
Case statements in SQL mimic if-else statements in general programming.
Case statements can be used in any clause that accepts a valid expression, e.g.
WHERE
ORDER BY
HAVING
and in statements:
SELECT
INSERT
UPDATE
DELETE
CASE [input_expression]
WHEN c1 THEN e1
WHEN c2 THEN e2
ELSE e3
END column_name
e.g.
SELECT
employee_id
,firstname
,lastname
,CASE city
WHEN 'Calgary' THEN 'Calgary'
ELSE 'Other'
END city
FROM Employees
ORDER BY lastname, firstname;
As with programming languages, CASE
in SQL will short-circuit on a true
statement, and no further statements will be evaluated. If ELSE
is omitted and
no WHEN
statements evaluate to true, NULL
will be returned.
By omitting the optional input expression that directly follows the CASE
keyword one can use the WHEN
statements to evaluate and compare values:
CASE
WHEN result_expression [...n]
[ELSE else_result]
END
e.g.:
SELECT
track_id
,name
,bytes
,CASE
WHEN bytes < 30000 THEN 'small'
WHEN bytes >= 30001 AND bytes <= 50000 THEN 'medium'
WHEN bytes > 50000 THEN 'large'
ELSE 'Other'
END bytes_category
FROM Tracks;
THEN
can return another column, not only text or hard-coded values.
Views are stored queries. With a view, one can add or remove columns without changing the schema. They can also be used to encapsulate queries.
Views persist until the database connection is closed.
CREATE [TEMP] VIEW [IF NOT EXISTS] view_name(column-name-list) AS
SELECT ...
IF NOT EXISTS
can be used to determine whether the view should be created or
not based on the name provided when creating the view.
-- create a view called my_view where...
CREATE VIEW my_view AS
-- there are 6 columns
SELECT
r.region_description
,t.territory_description
,e.lastname
,e.firstname
,e.hire_date
,e.reports_to
-- obtained by selecting from 4 tables, joined using an INNER JOIN
FROM Regions r
INNER JOIN Territories t ON r.region_id = t.region_id
INNER JOIN EmployeeTerritories et ON t.territory_id = et.territory_id
INNER_JOIN Employees e on et.employee_id = e.employee_id
To actually use the view, one treats it as if it were an existing table:
SELECT *
FROM my_view
DROP VIEW my_view;
e.g.:
SELECT
COUNT(territory_description)
,firstname
,lastname
FROM my_view
GROUP BY lastname, firstname;
If a view is no longer required, one should clean up by DROP
ping the view.
Views can be useful in environments where you're unable to write data to tables.
Data profiling is the process of examining data for completeness and accuracy in order to understand the data before querying it.
- understanding how many rows of data are in the table
- when data was last updated
- evaluating the data types of each column
- evaluating the number of distinct values in a column
- evaluating the number of NULL vs non-NULL values
- descriptive statistics:
- maximums and minimums
- averages
- standard deviation for column values
The reason one profiles their data is to understand what their data is so that they can test it. Without profiling data first, one would find it more difficult to determine when queries are not returning expected results.
These are the rules around data in an organisation.
- understand your read and write priveleges
- cleaning up data while experimenting and evaluating queries
- what is the process followed when changes are made, e.g. a model requires updating, and its associated tables
- data understanding
- business understanding
- profiling
- start with
SELECT
- test
- format and comment
- review
- start with
Understanding the data you'll be working with is the most important step. This includes:
- understanding relationships, i.e. entity diagrams
- evaluating columns that have NULL values
- how are string values inserted, e.g. generated, or user input
Without understanding the subject of focus, writing queries may take more time.
- understanding where data joins are
- differentiating integers from strings
- investing time to understand the subject matter will help later during analysis
It's important to be able to distinguish between business requirements and the data required to solve that requirement.
Understanding the problem at a core level is important:
- which customers do we want to study?
- should we omit certain customers based on some criteria?
- what data is important when drilling down on a specific entity?
Profiling data is about drilling down into the details of your data.
This includes:
- creating a data model and mapping the fields and tables you need
- considering joins and calculations
- understanding where there may be any data quality or format issues
Especially if you're unfamiliar with the data, start small. Every statement
starts with a SELECT
.
If using subqueries, start with the innermost query, and work out from there.
- don't want until the end to test queries
- test queries after each join or filter
- evaluate the results
- start small and evaluate step-by-step when troubleshooting queries
- use the correct formatting and indentation
- comment strategically
- clean code and comments help when handing off or revisiting code
- always review old queries
- evaluate that queries are in alignment with business rules