Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
Gchism94 authored Feb 27, 2024
0 parents commit a7fa70d
Show file tree
Hide file tree
Showing 11 changed files with 351 additions and 0 deletions.
58 changes: 58 additions & 0 deletions .github/ISSUE_TEMPLATE/hw-02-grading.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
name: 'HW 2 Grading '
about: 'Feedback '
title: 'Homework 2 Feedback '
labels: ''
assignees: ''

---

---
name: Feedback
about: Feedback for assignment
title: Feedback
labels: ''
assignees: ''

---

Dear [STUDENT GITHUB NAME] -- Below is the feedback for your assignment. Please review carefully, and stop by office hours if you have any questions about the feedback.

---

## Feedback

### Part 1: Exploratory Data Analysis (EDA): `[_ / 40 points]`

- Data Overview: `[_ / 5 points]`
- Univariate Analysis: `[_ / 10 points]`
- Bivariate Analysis: `[_ / 10 points]`
- Missing Data and Outliers: `[_ / 5 points]`
- Additional Visualizations: `[_ / 5 points]`
- Interpretation of EDA: `[_ / 5 points]`
- Feedback: [Insert feedback here.]

### Part 2: Data Preprocessing: `[_ / 40 points]`

- Handling Missing Values: `[_ / 10 points]`
- Dealing with Outliers: `[_ / 10 points]`
- Feature Engineering: `[_ / 10 points]`
- Data Transformation: `[_ / 5 points]`
- Data Reduction Techniques: `[_ / 5 points]`
- Feedback: [Insert feedback here.]


### Overall: `[_ / 20 points]`

- `[_ / 5]` - Doesn't require reproduction: .md file is in repo, figures show up in the .md file on GitHub, no egregious errors requiring reproducing the entire analysis to be able to follow it.
- `[_ / 3]` - Code style: Follows the rule of six, spaces around operators, spaces after commas, lines not too long, etc.
- `[_ / 3]` - Code smell: Messy / unnecessarily complex code, difficult to follow, unaddressed warnings, etc.
- `[_ / 3]` - Quality and quantity of commits: No uninformative/nonsense text in commit messages, entire assignment not committed in a single commit.
- `[_ / 3]` - Correct repository organization: data is in a `data/` folder if applicable, no stray files, foles are names correctly.
- `[_ / 3]` - Font size, organization: No crazy large text for narrative, questions answered in order and identified, easy to follow.

## Late penalty

- [ ] On time: No penalty
- [ ] Late, but same day (before midnight): -10 points
- [ ] Late, but next day: -20 points
49 changes: 49 additions & 0 deletions .github/workflows/check-assignment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
on: push
name: Check Assignment

jobs:
check-allowed-files:
runs-on: ubuntu-latest
container:
image: python:3.9-slim
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Check Files
run: python check_allowed_files.py "hw-02.qmd" "hw-02.ipynb" "README.md" "data/*" "images/*" "check_allowed_files.py" "requirements.txt"

check-renders:
env:
GITHUB_PAT: ${{ secrets.GH_PAT }}
runs-on: ubuntu-latest
container:
image: python:3.9-slim
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Set up Python and Quarto environment
run: |
apt-get update && apt-get install -y wget
wget -q -O quarto.deb https://github.com/quarto-dev/quarto-cli/releases/download/v1.1.189/quarto-1.1.189-linux-amd64.deb
dpkg -i quarto.deb
python -m pip install --upgrade pip
- name: Install Python dependencies
run: pip install -r requirements.txt
- name: Check if .qmd file exists
id: check_qmd
run: |
echo "QMD_EXISTS=false" >> $GITHUB_ENV
if [ -f hw-02.qmd ]; then
echo "QMD_EXISTS=true" >> $GITHUB_ENV
fi
- name: Render .qmd to HTML
if: env.QMD_EXISTS == 'true'
run: quarto render hw-02.qmd
- name: Render .ipynb to HTML
if: env.QMD_EXISTS == 'false'
run: quarto render hw-02.ipynb
- name: Create artifacts
uses: actions/upload-artifact@v2
with:
name: hw-02-html
path: hw-02.html
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.Rproj.user
.Rhistory
.RData
.Ruserdata
161 changes: 161 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# hw-02

For any exercise where you’re writing code, insert a code chunk and make
sure to label the chunk. Use a short and informative label. For any
exercise where you’re creating a plot, make sure to label all axes,
legends, etc. and give it an informative title. For any exercise where
you’re including a description and/or interpretation, use full
sentences. Make a commit at least after finishing each exercise, or
better yet, more frequently. Push your work regularly to GitHub, and make sure
all checks pass.

---

# Exploratory Data Analysis and Data Preprocessing Exercise

Welcome to your Exploratory Data Analysis and Data Preprocessing Exercise. This assignment is crucial for anyone diving into the field of data mining and data analysis. Your main objective in this exercise will be to conduct thorough exploratory data analysis (EDA) followed by meticulous data preprocessing on a given dataset. The dataset will be provided to you, and it will require a combination of Python and its powerful libraries, NumPy and Pandas, to uncover insights and prepare the data for any subsequent modeling tasks.

You will be expected to read in a dataset from the #[TidyTuesday 2023 datasets](https://github.com/rfordatascience/tidytuesday/tree/master/data/2023) for the purpose of exploratory data analysis and preprocessing.

### Objective
Investigate the relationship between regional socioeconomic categories and the allocation of daily hours across countries. Analyze how the uncertainty in these allocations correlates with regional demographics and population sizes.

### Dataset

- [**Global Human Day**](https://github.com/rfordatascience/tidytuesday/tree/master/data/2023/2023-09-12)
- Use at least datasets `all_countries.csv` and `country_regions.csv`.

The completion of this exercise is divided into two main parts:

- **Part 1: Exploratory Data Analysis**
- **Part 2: Data Preprocessing**

Before you start, ensure you have the latest versions of `Python`, `NumPy`, and `Pandas` installed on your system. Good documentation and commenting of your code are mandatory to make your code easy to understand.

---

## Part 1: Exploratory Data Analysis

In this section, you will perform an exploratory data analysis on the provided dataset. You will identify patterns, detect outliers, and generate insights based on your findings.

### Task 1: Data Overview
- Load the dataset into a Pandas DataFrame and display the first few rows.
- Provide a basic description of the dataset, including its shape, columns, and data types.

<details>
<summary><h3><b>Hint</b></h3></summary>

- Use functions like `.head()`, `.shape`, `.columns`, and `.dtypes` to get an overview of your DataFrame.
- Remember that `.info()` can be used to get a concise summary of the DataFrame including the non-null count and type of each column.

</details>

### Task 2: Univariate Analysis
- For numerical features, calculate descriptive statistics and create histograms.
- For categorical features, count unique values and create bar plots.

<details>
<summary><h3><b>Hint</b></h3></summary>

- Use `.describe()` for a quick statistical summary of the numerical features.
- Utilize `matplotlib` or `seaborn` libraries to create histograms (`hist()` or `sns.histplot()`).
- For categorical data, `value_counts()` can help in understanding the distribution of classes, and you can plot the results using `bar()` or `sns.countplot()`.

</details>

### Task 3: Bivariate Analysis
- Choose three pairs of numerical variables and create scatter plots to explore their relationships.
- Create boxplots for one numerical variable grouped by a categorical variable.

<details>
<summary><h3><b>Hint</b></h3></summary>

- When creating scatter plots with `plt.scatter()` or `sns.scatterplot()`, it might be helpful to color points by a third categorical variable using the hue parameter in Seaborn.
- Use `sns.boxplot()` to create boxplots. Consider using the hue parameter if you have sub-categories within your categorical variable.

</details>

### Task 4: Missing Data and Outliers
- Identify any missing values in the dataset.
- Detect outliers in the numerical features using an appropriate method (e.g., Z-score, IQR).

<details>
<summary><h3><b>Hint</b></h3></summary>

- The `.isnull()` method chained with `.sum()` can help identify missing values.
- Consider using the `scipy.stats` module for Z-score computation or the `IQR` which is the range between the first and third quartile of your data distribution for outlier detection.

</details>

## Part 2: Data Preprocessing

This section will focus on cleaning and preparing the dataset for modeling. You will correct any issues you found during the EDA phase.

### Task 1: Handling Missing Values
- Choose appropriate methods to handle the missing data (e.g., imputation, removal).

<details>
<summary><h3><b>Hint</b></h3></summary>

- Imputation methods could involve using `.fillna()` with measures like mean (`data.mean()`) for numerical columns and mode (`data.mode().iloc[0]`) for categorical columns.
- For removal, `.dropna()` is straightforward but consider the impact on your dataset size.

</details>

### Task 2: Dealing with Outliers
- Treat or remove the outliers identified earlier based on your chosen methodology.

<details>
<summary><h3><b>Hint</b></h3></summary>

- For outlier removal, you may use boolean indexing based on Z-scores or IQR to filter your data.
- If you don't want to remove outliers, consider transforming them using methods such as log transformation.

</details>

### Task 3: Feature Engineering
- Create at least one new feature that could be useful for a data mining task.

<details>
<summary><h3><b>Hint</b></h3></summary>

- Think about the domain knowledge related to your dataset that could suggest new features. For instance, if you have date-time information, extracting the day of the week could be useful.
- Also, combining features, if relevant, to create ratios or differences can often reveal useful insights.

</details>

### Task 4: Data Transformation
- Standardize or normalize numerical features.
- Perform any additional transformations you deem necessary (e.g., encoding categorical variables, binning, etc.).

<details>
<summary><h3><b>Hint</b></h3></summary>

- For scaling, `StandardScaler` or `MinMaxScaler` from `sklearn.preprocessing` can be applied to numerical features.
- For normalization, `np.log1p()` (log(1+x)) can help in managing skewed data.
- Use `pd.get_dummies()` or `LabelEncoder`/`OneHotEncoder` from `sklearn.preprocessing` for encoding categorical variables.

</details>

---

**Deliverables:**
- A Jupyter Notebook (with Quarto yaml configuration) containing all code and visualizations.
- A written report summarizing your findings from the EDA, the decisions you made during preprocessing, and the rationale behind your choices.

**Submission Guidelines:**
- Push your Jupyter Notebook to your GitHub repository.
- Ensure your commit messages are descriptive.
- Submit the link to your GitHub repository on the course submission page.

**Grading Rubric:**
Your work will be evaluated based on the following criteria:
- Correctness and completeness of the code.
- Quality and clarity of the visualizations and summary report.
- Proper use of comments and documentation in the code.
- Adherence to the submission guidelines.

**Points Distribution:**
Each task is allocated a specific number of points. Points will be awarded based on the completeness and correctness of the work submitted. Be sure to follow best practices in data analysis and provide interpretations for your findings and decisions during preprocessing.

Good luck, and may your insights be profound!
21 changes: 21 additions & 0 deletions check_allowed_files.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# check_allowed_files.py

import sys
import glob

# Get list of patterns to check from arguments
patterns_to_check = sys.argv[1:]

# Initialize a list to keep track of found files
found_files = []

# Loop over each pattern and check if files exist
for pattern in patterns_to_check:
found_files.extend(glob.glob(pattern))

# If no files are found for a pattern, exit with an error
if not found_files:
print("Error: No allowed files found in the repository.")
sys.exit(1)
else:
print(f"Allowed files found: {found_files}")
1 change: 1 addition & 0 deletions data/ADD_DATA_HERE
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

52 changes: 52 additions & 0 deletions hw-02.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: "HW 02"
author: "INSERT YOUR NAME HERE"
format:
html:
embed-resources: true
toc: true
jupyter: python3
---

# Exploratory Data Analysis

## 1 - Data Overview

```{python}
#| label: label-me-1
```

## 2 - Univariate Analysis

```{python}
#| label: label-me-2
```

## 3 - Bivariate Analysis


## 4 - Missing Data and Outlines

# Data Preprocessing

## 1 - Handling Missing Values

```{python}
#| label: label-me-3
```

## 2 - Dealing with Outliers

```{python}
#| label: label-me-4
```

## 3 - Feature Engineering


## 4 - Data Transformation

Binary file added images/2018_case_numbers_final.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions images/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Source:

`2018_case_numbers_final.jpg`: https://lymediseaseassociation.org/resources/2018-reported-lyme-cases-top-15-states/
Binary file added images/plot-10-90-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
jupyter
nbformat

0 comments on commit a7fa70d

Please sign in to comment.