edrubin
diff --git a/‎.gitignore
+1-1 b/‎.gitignore
+1-1
diff --git a/‎exam/past-class/inclass-23.pdf
28 KB b/‎exam/past-class/inclass-23.pdf
28 KB
diff --git a/‎exam/past-class/inclass-24.pdf
32.7 KB b/‎exam/past-class/inclass-24.pdf
32.7 KB
diff --git a/‎exam/past-home/home-23.md
+50 b/‎exam/past-home/home-23.md
+50
diff --git a/‎exam/past-home/home-24.md
+61 b/‎exam/past-home/home-24.md
+61
@@ -7,7 +7,7 @@ lecture/01*
 projects/project*
 
 # Exams
-exam/*
+exam/exam*
 
 # Project-specific files
 
 
@@ -0,0 +1,50 @@
+# 524/424 Final Exam: Take-Home Portion
+
+## Big picture 
+
+You are going to build a few statistical models to predict individual babies' birthweights using a host of parental data.
+
+## Data
+
+The data (contained in [`data-final.csv`](https://github.com/edrubin/EC524W23/blob/master/exam/take-home/data-final.csv)) come from a random subset of 10,000 births in the United States during 2021.
+
+*Note:* If you're having trouble downloading the file with the link above, try [this link](https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/edrubin/EC524W23/blob/626cf8b30a770055c04ea48174f67f1441bf8ee5/exam/take-home/data-final.csv) or download it from Canvas. 
+
+I downloaded the data from the [National Bureau of Economic Research (NBER)'s server](https://www.nber.org/research/data/vital-statistics-natality-birth-data). They provide a nice [codebook](https://data.nber.org/nvss/natality/code/nat2021us.html).
+
+The original data come from the [CDC's National Vital Statistics System (NVSS)](https://www.cdc.gov/nchs/nvss/birth_methods.htm#anchor_1551744577970), which also has a [nice codebook](https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm) (see the *User's Guide* for 2021). 
+
+You are going to be predicting `dbwt` (dry birthweight) using the 224 other variables in the dataset. Make sure you at least skim the codebook: some of the variables are going to be more helpful than others. You will also find that some are missing many observations—and you probably do not want always want to impute the missing values.
+
+**Warning:** Because there are 10,000 observations and many predictors, some of your models may take a little while to run. Be careful with how many hyperparameters you are trying... and which ones you try.
+
+## Tasks
+
+**[01]** (10 points) **Visualize** Read through the codebooks (linked above) to get a sense of the dataset's features. Once you understand the variables: Create three visualizations of the data that show some interesting insights. 
+
+- *Why?* You should always visualize your data—both before and after analyzing your data. Start the exam by making a few figures to understand the data. You can always make better figures after you finish the other steps.
+- *What* Your figures should be well labeled and aesthetically pleasing.
+
+**[02]** (10 points) **Old-fashioned linear regression**  Now run a regression with several variables that you anticipate will be important for predicting birthweight. Report your cross-validated estimate for test performance (let's stick with MSE).
+
+*Questions:*
+
+- Does this linear-regression model seem good? 
+- Does it seem like you did a good job of choosing variables?
+
+**[03]** (10 points) **New-fashioned linear regression** Now try a penalized version of linear regression. Again, report your CV-based MSE.
+
+*Questions*
+
+- Did the penalized model beat your OLS model?
+- Did the penalized model choose similar variables to your OLS model?
+
+**[04]** (10 points) **Going nonlinear** Try an ensemble of trees—either random forest or boosted trees. Report your CV-based MSE.
+
+*Bonus:* Does the ensemble "value" the same variables as the penalized model (in terms of variable importance)?
+
+**[05]** (10 points) **Summary**
+
+- Which model performed best? Would you say it is *significantly* better than the other models? Explain your answer.
+- Does the best model's type (OLS, penalized regression, tree ensemble) suggest anything about this setting? Explain.
+
@@ -0,0 +1,61 @@
+# 524/424 Final Exam: Take-Home Portion
+
+## Admin
+
+### Optional
+
+As discussed in class, this portion of the exam is **optional**. If you choose not to submit this portion, your grade for the final-exam will be based solely on the in-class portion. If you submit this exam, it will count for 25% of your final-exam grade.
+
+### Academic honesty
+
+You **are not** allowed to work with anyone else. Working with *anyone* else will be considered cheating. You will receive a zero for **both** parts of the final exam and will fail the class.
+
+You *can* use online materials (including ChatGPT and Copilot), books, notes, solutions, *etc*. However, you still must put all of your answers **in your own words**. Copying other people's (and chatbots') words is also considered cheating.
+
+Ngan and Ed **will not** help you debug your code. Please do not ask.
+
+### Instructions
+
+**Due** Upload your answers to [Canvas](https://canvas.uoregon.edu/) *before* 10:15 **am** (Pacific) on Friday, 14 June 2024.
+
+**Important** You **must** submit your answers as an HTML or PDF file, built from an RMarkdown (`.RMD`) or Quarto (`.qmd`) file (you can also submit a link to an HTML page if you prefer that route).
+
+## Prompts
+
+Let's end where we began: [predicting house prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/) (as we did in the first two problem sets). Specifically, let's see if you can beat your old score using all of your fancy new prediction knowledge and ML skills.
+
+## Getting started
+
+**[01]** (10 points) **Visualize** Make sure you remember all of the variables in the dataset. Once you understand/recall the variables: Create three visualizations of the data that show some interesting insights. These figures should be publication quality: well labeled, aesthetically pleasing, and insightful.
+
+*Why?* Visualization is good practice—you should always visualize your data before and after analyzing it. Start the exam by making a few figures to understand the data. You can always make better figures after you finish the other steps.
+
+**[02]** (10 points) **Better regression?** In the past we used fairly simplistic imputation approaches for missing data. This time, use a more "sophisticated" approach for imputation. Then run a your original regression model. Predict onto the test set and and report your score.
+
+*Questions:*
+
+- Did the fancier imputation approach improve your model?
+- Why would "better" imputation matter?
+
+**[03]** (10 points) **Better-er regression?** Repeat **[02]** but this time use a lasso regression model. Report your score.
+
+*Questions:*
+
+- Did this approach improve your model?
+- Did the lasso model choose similar variables to your OLS model?
+
+**[04]** (10 points) **Going nonlinear?** Now use a random forest for the prediction. You need to tune it. Also: Keep the variable importance scores.
+
+*Questions:*
+
+- Which hyperparameters did you tune?
+- Did the random forest beat your penalized regression model? Report your score.
+- Did the variable importance from the random forest match the variables chosen by your penalized regression model?
+
+**[05]** (10 points) **Summary** Answer the following questions:
+
+- Which model performed best?
+- Would you say the "best" model is *significantly* better than the other models? Explain your answer.
+- What could make your model better?
+
+**[Bonus]** (Optional; 5 points) Use a (tuned) boosted tree model. Report your score and compare it to the other models.