❗ IMPORTANT: This repo is under active development and is not yet in production.
If an edit needs to made to anything in the
glue
directory, there is a specific process to be found here.
This repository contains code to identify and flag sales that may be non-arms-length transactions. A non-arms-length sale occurs when the buyer and seller have a relationship that might influence the transaction price, leading to a sale that doesn't reflect the true market value of the property. These sales can distort our analyses and models, since they don't adhere to the principle of an open and competitive market.
The workflow for sale flagging is as follows:
- A manual initial run of
manual_flagging/initial_flagging.py
instantiates all tables and flags all specified sales as either outliers or non-outliers. - Next,
glue/sales_val_flagging.py
flags all new, unflagged sales. This script is automated such that it runs on a schedule (e.g. monthly). - If an error occurs or we want to update the methodology on previously-flagged sales,
manual_flagging/manual_update.py
is used to select a subset of sales to re-flag. All sales have a version number that is incremented on update. When utilizing our sales views, we pull the flag data with the highest version value to keep it up-to-date.
On the left, we see the normal workflow of the process. Represented on the right is the use of manual_update.py
to update/re-flag sales.
graph TD
A{{No sales are flagged}}
B[Run initial_flagging.py locally]
C[Flags added to sales<br>via flagging.py with<br>Version = 1]
D[Flags joined to<br>default.vw_pin_sale]
A --> B
B -- Sales pulled from within<br>specified time window --> C
C -- Results saved to S3<br>with unique run ID --> D
E{{Some sales need re-flagging}}
F[Subset sales in yaml, run<br>manual_update.py locally]
G[If sale already flagged<br>increment Version += 1]
H{If sale unflagged,<br>assign Version = 1}
I[Flags update existing<br>default.vw_pin_sale records]
E --> F
F --> G
F --> H
H --> G
G -- Results saved to S3<br>with new run ID --> I
All flagging runs populate 3 Athena tables with metadata, flag results, and other information. These tables can be used to determine why an individual sale was flagged as an outlier. The structure of the tables is:
erDiagram
flag }|--|| metadata : describes
flag }|--|{ parameter : describes
flag }|--|{ group_mean : describes
flag {
string meta_sale_document_num PK
date rolling_window
bigint sv_is_outlier
bigint sv_is_ptax_outlier
bigint sv_is_heuristic_outlier
string sv_outlier_type
string run_id FK
bigint version PK
}
metadata {
string run_id PK
string long_commit_sha
string short_commit_sha
string run_timestamp
string run_type
}
parameter {
string run_id PK
bigint sales_flagged
timestamp earliest_data_ingest
timestamp latest_data_ingest
bigint short_term_owner_threshold
arraystring iso_forest_cols
arraystring res_stat_groups
arraystring condo_stat_groups
arraybigint dev_bounds
bigint rolling_window
string date_floor
bigint min_group_thresh
}
group_mean {
bigint group_size
double mean_price
double mean_price_sqft
string run_id PK
string group PK
}
The flagging model uses group means to determine the statistical deviation of sales, and flags them beyond a certain threshold. Group means are constructed using a rolling window strategy.
The current implementation uses a 12 month rolling window. This means that for any sale, the "group" contains all sales within the same month, along with all sales from the previous 11 months. This 12 month window can be changed by editing the config files: manual_flagging/yaml/
and glue/sales_val_flagging.json
. Additional notes on the rolling window implementation:
- We take every sale in the same month of the sale date, along with all sale data from the previous N months. This window contains roughly 1 year of data.
- This process starts with an
.explode()
call. Example here. - It ends by subsetting to the
original_observation
data. Example here. - Corresponding functions in Mansueto's flagging model accommodate this rolling window integration, these functions are defined in each of the flagging functions, one in
manual_flagging/src/flagging_rolling.py
, and one for the Glue job inglue/flagging_script_glue/*.py
.
This repository manages the configurations, scripts, and details for an AWS Glue Job. It's essential to maintain consistency and version control for all changes related to the job. Therefore, specific procedures have been established.
- DO NOT modify the Glue job script, its associated flagging python script, or any of its job details directly via the AWS Console.
- All changes to these components should originate from this repository. This ensures that every modification is tracked and version-controlled.
- The only advisable actions in the AWS Console concerning this Glue job are:
- Running the job
- To test a change to the Glue job script or the flagging script, make an edit on a branch and open a pull request. Our GitHub Actions configuration will deploy a staging version of your job, named
ci_<your-branch-name>_sales_val_flagging
, that you can run to test your changes. See the Modifying the Glue job section below for more details.
The Glue job and its flagging script are written in Python, while the job details and settings are defined in a Terraform configuration file. These files can be edited to modify the Glue job script, its flagging script, or its job settings.
- Locate the desired files to edit:
- Glue script:
glue/sales_val_flagging.py
- Flagging script:
glue/flagging_script_glue/flagging.py
- Job details/settings:
main.tf
, under the resource blockaws_glue_job.sales_val_flagging
(see the Terraform AWS provider docs for details)
- Glue script:
- Any changes to these files should be made in the following sequence:
- Make a new git branch for your changes.
- Edit the files as necessary.
- Open a pull request for your changes against the
main
branch. A GitHub Actions workflow calleddeploy-terraform
will deploy a staging version of your job namedci_<your-branch-name>_sales_val_flagging
that you can run to test your changes.- By default, this configuration will deploy an empty version of the
sale.flag
table, which simulates an environment in which there are no preexisting flags prior to a run. - If you would like to test your job against a subset of the production data, copy your data subset from the production job bucket to the bucket created by Terraform for your job (or leave the new bucket empty to simulate running the job when no flags exist). Then, run the crawler created by Terraform for your PR in order to populate the staging version of the
sale.flag
database that your staging job uses. If you're having trouble finding your staging bucket, job, or crawler, check the GitHub Actions output for the first successful run of your PR and look for the Terraform output displaying the IDs of these resources.
- By default, this configuration will deploy an empty version of the
- If you need to make further changes, push commits to your branch and GitHub Actions will deploy the changes to the staging job and its associated resources.
- Once you're happy with your changes, request review on your PR.
- Once your PR is approved, merge it into
main
. A GitHub Actions workflow calledcleanup-terraform
will delete the staging resources that were created for your branch, while a separatedeploy-terraform
run will deploy your changes to the production job and its associated resources.