Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions 02_activities/assignments/assignment-3/assignment_3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Data Visualization

## Assignment 3: Final Project

### Requirements:
- We will finish this class by giving you the chance to use what you have learned in a practical context, by creating data visualizations from raw data.
- Choose a dataset of interest from the [City of Toronto’s Open Data Portal](https://www.toronto.ca/city-government/data-research-maps/open-data/) or [Ontario’s Open Data Catalogue](https://data.ontario.ca/).
- Using Python and one other data visualization software (Excel or free alternative, Tableau Public, any other tool you prefer), create two distinct visualizations from your dataset of choice.
- For each visualization, describe and justify:
> What software did you use to create your data visualization?

> Who is your intended audience?

> What information or message are you trying to convey with your visualization?

> What aspects of design did you consider when making your visualization? How did you apply them? With what elements of your plots?

> How did you ensure that your data visualizations are reproducible? If the tool you used to make your data visualization is not reproducible, how will this impact your data visualization?

> How did you ensure that your data visualization is accessible?

> Who are the individuals and communities who might be impacted by your visualization?

> How did you choose which features of your chosen dataset to include or exclude from your visualization?

> What ‘underwater labour’ contributed to your final data visualization product?

- This assignment is intentionally open-ended - you are free to create static or dynamic data visualizations, maps, or whatever form of data visualization you think best communicates your information to your audience of choice!
- Total word count should not exceed **(as a maximum) 1000 words**

## Appendix: Code

All files required for Assignment 3 are organized within the `assignment-3/` directory.

The complete and commented Python code used to generate Visualization 1 (line chart) is provided in:

- `assignment-3/visualization_1_python/visualization_1_code.py`

This script reads the raw TTC LRT delay dataset, performs data cleaning and aggregation, and generates the final visualization saved as a high-resolution PNG.

The data used to create Visualization 2 (bar chart) was generated programmatically using Python and exported as a CSV file:

- `assignment-3/bar_top10_stations.csv`

The bar chart itself was created in Microsoft Excel using this summary dataset.


### Why am I doing this assignment?:
- This ongoing assignment ensures active participation in the course, and assesses the learning outcomes:
* Create and customize data visualizations from start to finish in Python
* Apply general design principles to create accessible and equitable data visualizations
* Use data visualization to tell a story
- This would be a great project to include in your GitHub Portfolio – put in the effort to make it something worthy of showing prospective employers!

### Rubric:

| Component | Scoring | Requirement |
|-------------------|----------|-----------------------------------------------------------------------------|
| Data Visualizations | Complete/Incomplete | - Data visualizations are distinct from each other<br>- Data visualizations are clearly identified<br>- Different sources/rationales (text with two images of data, if visualizations are labeled)<br>- High-quality visuals (high resolution and clear data)<br>- Data visualizations follow best practices of accessibility |
| Written Explanations | Complete/Incomplete | - All questions from assignment description are answered for each visualization<br>- Explanations are supported by course content or scholarly sources, where needed |
| Code | Complete/Incomplete | - All code is included as an appendix with your final submissions<br>- Code is clearly commented and reproducible |

## Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Submission Parameters:
* Submission Due Date: `23:59 - 02/02/2026`
* The branch name for your repo should be: `assignment-3`
* What to submit for this assignment:
* A folder/directory containing:
* This file (assignment_3.md)
* Two data visualizations
* Two markdown files for each both visualizations with their written descriptions.
* Link to your dataset of choice.
* Complete and commented code as an appendix (for your visualization made with Python, and for the other, if relevant)
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/visualization/pull/<pr_id>`
* Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [ ] Create a branch called `assignment-3`.
- [ ] Ensure that the repository is public.
- [ ] Review [the PR description guidelines](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md#guidelines-for-pull-request-descriptions) and adhere to them.
- [ ] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
5 changes: 5 additions & 0 deletions 02_activities/assignments/assignment-3/dataset_link.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## Dataset

**Title:** TTC LRT Delay Data
**Source:** City of Toronto Open Data Portal
**Link:** https://open.toronto.ca/dataset/ttc-lrt-delay-data/
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
## Visualization 1: TTC LRT Daily Total Delay Minutes Over Time (Line Chart)
![TTC LRT Daily Total Delay Minutes Over Time](visualization_1_line.png)



**What software did you use to create your data visualization?**
This visualization was created using Python, specifically the pandas library for data manipulation and matplotlib for plotting.

**Who is your intended audience?**
The intended audience includes transit planners, operations analysts and policy stakeholders interested in understanding temporal patterns in TTC LRT service delays.

**What information or message are you trying to convey with your visualization?**
The visualization conveys how the total number of delay minutes experienced by TTC LRT services varies over time, highlighting periods of increased disruption and enabling identification of temporal trends in service reliability.

**What aspects of design did you consider when making your visualization? How did you apply them? With what elements of your plots?**
A line chart was selected to emphasize change over time. Dates were placed on the horizontal axis and total delay minutes on the vertical axis to align with standard temporal visualization conventions. Minimal styling, a single line, and clear axis labels were used to maintain clarity and reduce visual clutter.

**How did you ensure that your data visualizations are reproducible? If the tool you used to make your data visualization is not reproducible, how will this impact your data visualization?**
The visualization is fully reproducible. The Python script reads the raw CSV dataset, applies deterministic transformations and saves the output as a static image. Any user with access to the dataset and script can regenerate the visualization.

**How did you ensure that your data visualization is accessible?**
Accessibility was considered by using clear axis labels, a descriptive title, sufficient contrast and avoiding reliance on color to encode meaning. The chart avoids unnecessary visual complexity and is readable in grayscale.

**Who are the individuals and communities who might be impacted by your visualization?**
TTC riders, particularly those who rely on LRT services for daily transportation as well as planners responsible for maintaining equitable and reliable transit service.

**How did you choose which features of your chosen dataset to include or exclude from your visualization?**
The Date and Min Delay variables were selected to capture the temporal burden of delays. Other operational variables, such as vehicle identifiers, were excluded to maintain focus on system-level delay trends.

**What ‘underwater labour’ contributed to your final data visualization product?**
Inspecting the dataset structure, converting date fields, validating aggregation choices, and iteratively testing visualization outputs contributed to the final product.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Station,Total Delay Minutes
HUMBER COLLEGE STOP,1081
FINCH WEST FWLRT STATI,428
JANE FWLRT STOP,231
DRIFTWOOD STOP,223
FINCH WEST TO SENTINEL,202
ALBION STOP,145
HUMBER COLLEGE STOP (A,122
ROWNTREE MILLS STOP (A,119
FINCH AVENUE AND HIGHW,114
MILVAN RUMIKE STOP,101
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import pandas as pd
from pathlib import Path

base_dir = Path(__file__).resolve().parent.parent
csv_path = base_dir / "ttc_lrt_delays.csv"

df = pd.read_csv(csv_path)

station_summary = (
df.groupby("Station", as_index=False)["Min Delay"]
.sum()
.rename(columns={"Min Delay": "Total Delay Minutes"})
.sort_values("Total Delay Minutes", ascending=False)
.head(10)
)

out_csv = Path(__file__).resolve().parent / "bar_top10_stations.csv"
station_summary.to_csv(out_csv, index=False)
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

base_dir = Path(__file__).resolve().parent.parent
csv_path = base_dir / "ttc_lrt_delays.csv"

df = pd.read_csv(csv_path)

df["Date"] = pd.to_datetime(df["Date"], errors="coerce")
df = df.dropna(subset=["Date"])

daily_delay = (
df.groupby("Date", as_index=False)["Min Delay"]
.sum()
.rename(columns={"Min Delay": "Total Delay Minutes"})
.sort_values("Date")
)

plt.figure()
plt.plot(daily_delay["Date"], daily_delay["Total Delay Minutes"])
plt.title("TTC LRT Daily Total Delay Minutes")
plt.xlabel("Date")
plt.ylabel("Total delay minutes")
plt.xticks(rotation=45)
plt.tight_layout()

out_png = Path(__file__).resolve().parent / "visualization_1_line.png"
plt.savefig(out_png, dpi=300)
plt.close()
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
## Visualization 2: Top 10 TTC LRT Stations by Total Delay Minutes (Bar Chart)

![Top 10 TTC LRT Stations by Total Delay Minutes](visualization_2_bar_top10_stations.png)


**What software did you use to create your data visualization?**
This visualization was created using Microsoft Excel. The underlying summary data was generated using Python and exported as a CSV file.

**Who is your intended audience?**
The intended audience includes transit operations staff and infrastructure planners seeking to identify locations associated with higher cumulative delay burden.

**What information or message are you trying to convey with your visualization?**
The visualization highlights the ten TTC LRT stations with the highest total delay minutes, drawing attention to locations that may warrant targeted operational or infrastructure interventions.

**What aspects of design did you consider when making your visualization? How did you apply them? With what elements of your plots?**
A bar chart was chosen to support direct comparison across categorical values. Stations were ordered by total delay minutes to facilitate rapid interpretation. Axis labels include units and a single color was used to avoid misinterpretation.

**How did you ensure that your data visualizations are reproducible? If the tool you used to make your data visualization is not reproducible, how will this impact your data visualization?**
While the Excel visualization itself is not programmatically reproducible, reproducibility was supported by generating the aggregated dataset using Python and saving it as a CSV file. This ensures transparency in data processing and allows the chart to be recreated if needed.

**How did you ensure that your data visualization is accessible?**
Station names are clearly labeled, numeric values are scaled appropriately and units are explicitly stated. The visualization avoids excessive decoration and does not rely on color alone to convey information.

**Who are the individuals and communities who might be impacted by your visualization?**
Communities served by the identified stations, particularly riders who experience repeated delays as well as decision-makers responsible for service planning and equity.

**How did you choose which features of your chosen dataset to include or exclude from your visualization?**
The Station and Min Delay variables were selected to quantify delay burden by location. Other variables, such as delay codes or vehicle numbers, were excluded to keep the focus on spatial patterns.

**What ‘underwater labour’ contributed to your final data visualization product?**
Aggregating delay minutes by station, ranking and filtering results, exporting summary data, and iterating on chart orientation and labeling contributed to the final visualization.
27 changes: 27 additions & 0 deletions 02_activities/assignments/assignment_2.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,27 @@
```
Your answer...

## Good Data Visualization
Life Expectancy – Our World in Data
https://ourworldindata.org/life-expectancy

### Why this visualization is good
Because it is clear, accurate, and easy to interpret. First, it uses appropriate chart types for the task (showing trends over time and comparisons across countries). Matching chart type to the question improves comprehension and reduces confusion.

Second, the design supports readability: axes are labeled, scales are consistent and the layout avoids unnecessary decoration. This helps viewers interpret the data without being distracted by non-data ink.

Third, it supports exploration and storytelling. Interactive controls let the viewer compare countries and time periods without overloading the screen. This “overview first, details on demand” approach helps users learn from the visualization without clutter.

## Bad Data Visualization
Nightmarish Pie Charts – Chandoo.org
https://chandoo.org/wp/nightmarish-pie-charts/

### Why this visualization is bad
Because it makes the data difficult to interpret and compare. First, the use of a pie chart with a very large number of categories creates extreme visual clutter. Many slices are too small to distinguish, making it impossible to accurately compare values.

Second, the visualization relies heavily on color without clear structure. Similar colors are repeated across many slices, and there is no meaningful visual hierarchy. This increases cognitive load and forces the viewer to constantly shift attention between the chart and the legend.

Third, pie charts are generally ineffective for precise comparison because humans struggle to judge angles accurately. When combined with a high number of categories, this limitation becomes more severe, increasing the risk of misinterpretation.



Expand All @@ -22,7 +42,14 @@
- How could this data visualization have been improved?
```
Your answer...
### How it could be improved (good one)
Add clearer cues for first-time users about how to interact with the visualization, such as short instructions or highlighted controls to improve usability for less experienced users.
Another improvement: ensure all color choices are fully accessible for users with color vision deficiencies by offering alternative color palettes.


### How it could be improved (bad one)
Replace the 3D pie chart with a bar chart, which supports accurate comparison using lengths on a common scale.
Another improvement: Remove 3D effects and simplify the design. A flat chart with direct labels would improve clarity and reduce cognitive load.



Expand Down