Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regressions team #13

Open
Tracked by #11
donbowen opened this issue Feb 22, 2023 · 12 comments
Open
Tracked by #11

regressions team #13

donbowen opened this issue Feb 22, 2023 · 12 comments

Comments

@donbowen
Copy link
Member

No description provided.

@donbowen
Copy link
Member Author

We didn't talk about this today. Please reply with a link to your file with tables!

@jum223 @XiaozheZhangLehigh

@donbowen
Copy link
Member Author

donbowen commented Feb 22, 2023

@mromano224 @SebastianStoneham

What about regressions? These would be easy to implement from the bank tract data! We can talk on monday about how!

$$y = a * TractStat + b * BOW + c * TractStat * BOW + e$$

--> c would show issues.

y1 = Denial Rate
y2 = log(# applications)

TractStat1 = Hispanic%
TractStat2 = Hispanic > median Hisp% (from census level)

Reg 1: DenialRate = a + bH + c{BOW==1} + d*H{BOW ==1} + e

@jum223
Copy link
Collaborator

jum223 commented Feb 22, 2023

You can find the tables we produced in my branch called juan4, the files are in code folder and they are called TablesAZ and TablesCA.

@donbowen
Copy link
Member Author

Cool, I see some numbers you should show Matt Monday in your existing tables...

  • Quintiles_loan_table ... please add denial rates and avg (approved) loan size... useful stats to know
  • Quintiles_loan_table3.iloc[:,-3:] (we only need to see the last 3 cols)

@donbowen
Copy link
Member Author

@jum223 @XiaozheZhangLehigh @mromano224 @SebastianStoneham

What's the status here? Please be prepared with which tables (And which numbers specifically in the tables) you what to show. Please reply with a link to the files with the tables you want to show.

@donbowen
Copy link
Member Author

8EFCF815-8CAF-403A-944F-F3473CD5888D

@donbowen donbowen changed the title Tables team regressions team Feb 28, 2023
@donbowen
Copy link
Member Author

You need to update the regression. See the picture above.

Reg 1 uses y1 and x1, 2 uses y1 x2, and so on

@mromano224

@mromano224
Copy link
Collaborator

Just updated the regressions in regression_1 @donbowen

@donbowen
Copy link
Member Author

Better. Still not right.

  1. model_names = ['m1', 'm2', 'm3', 'm4'] is not informative when the y variable is changing from one column to another. I think if you delete that, it will show the variable name for each column. If not update the model_names.
  2. Second: Sigh... that's not the regression... look at the bottom of the picture. See it? That's the regression. Or look at this comment further up the thread, where @annakharv46 transcribed the photo for you.
  3. In the plan for the regressions from the comment/whiteboard, I called x1 and x2 H. H is either
    • Fraction hispanic (not high_hispanic )
    • Fraction hispanic > median(fraction hispanic)

@mromano224
Copy link
Collaborator

another update... please lmk, sorry for the confusion @donbowen

@donbowen
Copy link
Member Author

donbowen commented Feb 28, 2023

Am I looking at the right file?

Good job with the column names... it helped me figure out the key problem.

  1. Obviously still not the formula... it's missing 2 variables in each column. I know why now (next point)
  2. You need to include all rows in your data for the regressions. A single regression is supposed to have a variable indicating whether or not the row is about BOW. You obviously can't do this on your mini datasets that have only BOW or only Competitors.
  3. You didn't ABCD and print your data, but I bet your "hisp_over_med" variable wasn't only 0s and 1s.
  4. No need to type the regressions twice and print them by themselves and then all together at the end.

Here, just restart the file with this (some issues fixed, others I left pointers to.)

import pandas as pd
import numpy as np
from statsmodels.formula.api import ols as sm_ols
from statsmodels.iolib.summary2 import summary_col 

bank_tract = pd.read_csv('../input_data_clean/bank_tract_clean_WITH_CENSUS.csv')

# adjust this next line to drop the BMO rows
bank_tract = bank_tract.query('which_bank != "BMO")

# create vars
bank_tract['hisp_rate']     = (bank_tract ['HispanicLatinoPop'] / bank_tract ['Tot.Pop']) * 100
bank_tract['hisp_over_med'] = bank_tract["hisp_rate"] >  bank_tract["hisp_rate"].median()
bank_tract['log_num_apps']  = np.log(bank_tract['num_applications'])

# skip all the one-off regressions (just show them all together...)

# regressions

# define the regression models (YOU'LL NEED TO MAKE ONE MORE VARIABLE ABOVE, AND THEN UPDATE THESE TO MATCH FORMULA)
m1 = sm.OLS.from_formula('denial_rate ~ hisp_rate', data=all_other).fit()
m2 = sm.OLS.from_formula('denial_rate ~ hisp_over_med', data=bank_of_west).fit()
m3 = sm.OLS.from_formula('log_num_apps ~ hisp_rate', data=all_other).fit()
m4 = sm.OLS.from_formula('log_num_apps ~ hisp_over_med', data=bank_of_west).fit()

# set up the formatting for the table
info_dict = {'No. observations': lambda x: f"{int(x.nobs):d}"}
float_format = '%0.3f'

# UPDATE THIS AS NEEDED: 
regressor_order = ['Intercept', 'hisp_rate', 'hisp_over_med']

# UPDATE THE COLUMN NAMES (just using the y variable in each column works)
table = summary_col(results=[m1, m2, m3, m4],
                    model_names=['|all banks denial rate reg|',
                                 '|BOW denial rate reg|',
                                 '|all banks log num apps reg|',
                                 '|BOW log num apps reg|'],
                    regressor_order=regressor_order,
                    float_format=float_format,
                    info_dict=info_dict,
                    stars=True)  

table.title = 'OLS Regressions'

# print the table
print(table)

@vrg223
Copy link
Collaborator

vrg223 commented Mar 1, 2023

Regression Analysis:
-Based on reg2.ipynb
In linear regression for CA and AZ, the models displaying log(number of applications return higher R-squareds (In the 70s) relative to the models describing denial rates. The model is able to better describe variation in the number of applications.
R-squared retrieved in models 1 and 2 which describe the denial rates, dependant on the hispanic rates of the population, return a much lower R-Squared. This tells us that the relationship between a greater hispanic population and denial rates is not necessarily a one-to-one relationship.
hisp_rate: -0.00075*** implies a very small but negative relationship in denial rates and hispanic population in AZ. Same goes for hisp_rate: -0.00115*** in CA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants