Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DB with TY2022 data #25

Merged
merged 21 commits into from
Jan 19, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/release-database.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ labels: release
- [ ] Using the command line, grab the final compressed database file from the temporary directory (found at `db_path` after running `data-raw/create_db.R`) and move it to the project directory. Rename the file `ptaxsim-<TAX_YEAR>.<MAJOR VERSION>.<MINOR VERSION>.db.bz2`
- [ ] Decompress the database file for local testing using `pbzip2`. The typical command will be something like `pbzip2 -d -k ptaxsim-2021.0.2.db.bz2`
- [ ] Rename the decompressed local database file to `ptaxsim.db` for local testing. This is the file name that the unit tests and vignettes expect
- [ ] Use [sqldiff](https://www.sqlite.org/sqldiff.html) or a similar tool to compare the new database file to the previous version. Ensure that the changes are expected
- [ ] Restart R. Then run the unit tests (`devtools::test()` in the console) and vignettes (`pkgdown::build_site()` in the console) locally
- [ ] Knit the `README.Rmd` file to update the database link at the top of the README. The link is pulled from the `ptaxsim.db` file's `metadata` table
- [ ] If necessary, update the database diagrams in the README with any new fields or tables
Expand Down
6 changes: 4 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Imports:
glue,
RSQLite,
utils
RoxygenNote: 7.2.3
RoxygenNote: 7.3.0
Suggests:
arrow,
covr,
Expand All @@ -36,8 +36,10 @@ Suggests:
httr,
knitr,
lintr,
noctua,
odbc,
openxlsx,
pdftools,
pkgdown,
prettymapr,
purrr,
Expand All @@ -60,4 +62,4 @@ Remotes:
paleolimbot/geoarrow,
ropensci/tabulizer
Config/Requires_DB_Version: 2021.0.4
Config/Wants_DB_Version: 2021.0.4
Config/Wants_DB_Version: 2022.0.0
2 changes: 1 addition & 1 deletion R/tax_bill.R
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ tax_bill <- function(year_vec,

# Calculate the exemption effect by subtracting the exempt amount from
# the total taxable EAV
dt[, agency_tax_rate := agency_total_ext / agency_total_eav]
dt[, agency_tax_rate := agency_total_ext / as.numeric(agency_total_eav)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix for a weird integer overflow using the int64 type and 0 values. Coercing to numeric solves it fine 🤷

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've encountered this before with ptaxsim it is an annoying int64 quirk.

dt[, tax_amt_exe := exe_total * agency_tax_rate]
dt[, tax_amt_pre_exe := round(eav * agency_tax_rate, 2)]
dt[, tax_amt_post_exe := round(tax_amt_pre_exe - tax_amt_exe, 2)]
Expand Down
6 changes: 3 additions & 3 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -555,9 +555,9 @@ erDiagram

## Notes and caveats

- Currently, the per-district tax calculations for properties in the Red-Purple Modernization (RPM) TIF are slightly flawed. However, the total tax bill per PIN is still accurate. See issue [#11](#11) for more information.
- Special Service Area (SSA) rates must be calculated manually when creating counterfactual bills. See issue [#31](#31) for more information.
- In rare instances, a TIF can have multiple `agency_num` identifiers (usually there's only one per TIF). The `tif_crosswalk` table determines what the "main" `agency_num` is for each TIF and pulls the name and TIF information using that identifier. See issue [#39](#39) for more information.
- Currently, the per-district tax calculations for properties in the Red-Purple Modernization (RPM) TIF are slightly flawed. However, the total tax bill per PIN is still accurate. See issue [#4](https://github.com/ccao-data/ptaxsim/issues/4) for more information.
- Special Service Area (SSA) rates must be calculated manually when creating counterfactual bills. See issue [#3](https://github.com/ccao-data/ptaxsim/issues/3) for more information.
- In rare instances, a TIF can have multiple `agency_num` identifiers (usually there's only one per TIF). The `tif_crosswalk` table determines what the "main" `agency_num` is for each TIF and pulls the name and TIF information using that identifier. See issue [GitLab #39](https://gitlab.com/ccao-data-science---modeling/packages/ptaxsim/-/issues/39) for more information.
- PTAXSIM is relatively memory-efficient and can calculate every district line-item for every tax bill for the last 15 years (roughly 350 million rows). However, the memory required for this calculation is substantial (around 100 GB).
- PTAXSIM's accuracy is measured automatically with an [integration test](tests/testthat/test-accuracy.R). The test takes a random sample of 1 million PINs, calculates the total bill for each PIN, and compares it to the real total bill.
- This repository contains an edited version of PTAXSIM's commit history. Historical Git LFS and other data files (.csv, .xlsx, etc.) were removed in the transition to GitHub. The most current version of these files is available starting in commit [1f06639](https://github.com/ccao-data/ptaxsim/commit/1f06639d98a720999222579b7ff61bcce061f1ec). If you need the historical LFS files for any reason, please visit the [GitLab archive](https://gitlab.com/ccao-data-science---modeling/packages/ptaxsim) of this repository.
Expand Down
21 changes: 12 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,13 @@ Table of Contents
> installation](#database-installation) for details.
>
> [**Link to PTAXSIM
> database**](https://ccao-data-public-us-east-1.s3.amazonaws.com/ptaxsim/ptaxsim-2021.0.4.db.bz2)
> (DB version: 2021.0.4; Last updated: 2023-04-28 23:40:05)
> database**](https://ccao-data-public-us-east-1.s3.amazonaws.com/ptaxsim/ptaxsim-2022.0.0.db.bz2)
> (DB version: 2022.0.0; Last updated: 2024-01-19 04:40:35)

PTAXSIM is an R package/database to approximate Cook County property tax
bills. It uses real assessment, exemption, TIF, and levy data to
generate historic, line-item tax bills (broken out by taxing district)
for any property from 2006 to 2021. Given some careful assumptions and
for any property from 2006 to 2022. Given some careful assumptions and
data manipulation, it can also provide hypothetical, but factually
grounded, answers to questions such as:

Expand Down Expand Up @@ -173,9 +173,9 @@ database:

1. Download the compressed database file from the CCAO’s public S3
bucket. [Link
here](https://ccao-data-public-us-east-1.s3.amazonaws.com/ptaxsim/ptaxsim-2021.0.4.db.bz2).
here](https://ccao-data-public-us-east-1.s3.amazonaws.com/ptaxsim/ptaxsim-2022.0.0.db.bz2).
2. (Optional) Rename the downloaded database file by removing the
version number, i.e. ptaxsim-2021.0.4.db.bz2 becomes
version number, i.e. ptaxsim-2022.0.0.db.bz2 becomes
`ptaxsim.db.bz2`.
3. Decompress the downloaded database file. The file is compressed
using [bzip2](https://sourceware.org/bzip2/).
Expand Down Expand Up @@ -863,15 +863,18 @@ erDiagram

- Currently, the per-district tax calculations for properties in the
Red-Purple Modernization (RPM) TIF are slightly flawed. However, the
total tax bill per PIN is still accurate. See issue [\#4](https://github.com/ccao-data/ptaxsim/issues/4) for
more information.
total tax bill per PIN is still accurate. See issue
[\#4](https://github.com/ccao-data/ptaxsim/issues/4) for more
information.
- Special Service Area (SSA) rates must be calculated manually when
creating counterfactual bills. See issue [\#3](https://github.com/ccao-data/ptaxsim/issues/3) for more
creating counterfactual bills. See issue
[\#3](https://github.com/ccao-data/ptaxsim/issues/3) for more
information.
- In rare instances, a TIF can have multiple `agency_num` identifiers
(usually there’s only one per TIF). The `tif_crosswalk` table
determines what the “main” `agency_num` is for each TIF and pulls the
name and TIF information using that identifier. See archived issue [\#39](https://gitlab.com/ccao-data-science---modeling/packages/ptaxsim/-/issues/39)
name and TIF information using that identifier. See issue [GitLab
\#39](https://gitlab.com/ccao-data-science---modeling/packages/ptaxsim/-/issues/39)
for more information.
- PTAXSIM is relatively memory-efficient and can calculate every
district line-item for every tax bill for the last 15 years (roughly
Expand Down
3 changes: 3 additions & 0 deletions data-raw/agency/Agency Rate Report 2022.xlsx
Git LFS file not shown
20 changes: 11 additions & 9 deletions data-raw/agency/agency.R
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,6 @@ file_names <- list.files(




# agency_fund ------------------------------------------------------------------

# Load the detail sheet from each agency file. This includes the levy and rate
Expand All @@ -64,7 +63,7 @@ agency_fund <- map_dfr(file_names, function(file) {
"loss", "loss_percent", "fund_loss"
))) %>%
rename_with(~"levy_plus_loss", any_of(c(
"levy_and_loss", "fund_levy_plus_loss"
"levy_and_loss", "fund_levy_plus_loss", "levy_loss"
))) %>%
rename_with(~"rate_ceiling", any_of(c(
"ceiling", "rate_ceiling", "fund_rate_ceiling"
Expand Down Expand Up @@ -189,7 +188,7 @@ arrow::write_parquet(
# EAV, final extension, and much more
agency <- map_dfr(file_names, function(file) {
message("Reading: ", file)
readxl::read_xlsx(file) %>%
readxl::read_xlsx(file, sheet = 1) %>%
set_names(snakecase::to_snake_case(names(.))) %>%
mutate(
across(
Expand Down Expand Up @@ -235,9 +234,12 @@ agency <- map_dfr(file_names, function(file) {
"reduction_percent", "reduction_factor", "clerk_reduction_factor"
))) %>%
rename_with(~"total_non_cap_ext", any_of(c(
"total_non_cap_ext", "final_non_cap_ext"
"total_non_cap_ext", "final_non_cap_ext", "total_non_cap_extension"
))) %>%
rename_with(~"total_ext", any_of(c(
"total_ext", "final_ext",
"grand_total_ext"
))) %>%
rename_with(~"total_ext", any_of(c("total_ext", "final_ext"))) %>%
# Select, order, and rename columns
select(
year,
Expand Down Expand Up @@ -281,7 +283,7 @@ agency <- map_dfr(file_names, function(file) {
0,
cty_cook_eav
),
across(starts_with("cty_"), replace_na, 0),
across(starts_with("cty_"), ~ replace_na(.x, 0)),
# Make all percentages decimals
across(
c(pct_burden, reduction_pct),
Expand All @@ -296,20 +298,20 @@ agency <- map_dfr(file_names, function(file) {
arrange(year, agency_num) %>%
# Coerce columns to expected types
mutate(
across(c(year), as.character),
across(c(year), ~ as.character(.x)),
across(
c(
lim_numerator, lim_denominator, prior_eav:cty_total_eav,
total_levy, total_max_levy, total_reduced_levy, total_final_levy
),
as.integer64
~ as.integer64(.x)
),
across(
c(
lim_rate, pct_burden, total_prelim_rate, total_final_rate,
reduction_pct, total_non_cap_ext, total_ext
),
as.double
~ as.double(.x)
)
)

Expand Down
4 changes: 2 additions & 2 deletions data-raw/agency/tif_agency_names.csv
Git LFS file not shown
27 changes: 17 additions & 10 deletions data-raw/cpi/cpi.R
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
library(arrow)
library(dplyr)
library(miniUI)
library(tabulizer)
library(pdftools)
library(tidyr)
library(stringr)

Expand All @@ -14,27 +14,34 @@ row_to_names <- function(df) {
# The goal of this script is to create a data frame of Consumer Price Indices
# CPI-U used by PTELL to calculate/cap property tax extensions
# We can load the historical CPIs from a PDF provided by the State of Illinois
# https://tax.illinois.gov/content/dam/soi/en/web/tax/localgovernments/property/documents/cpihistory.pdf # nolint

# Paths for local raw data storage and remote storage on S3
remote_bucket <- Sys.getenv("S3_REMOTE_BUCKET")
remote_path <- file.path(remote_bucket, "cpi", "part-0.parquet")

# Extract the table only (no headers), then manually assign header
cpi_ext <- extract_areas(file = "data-raw/cpi/cpihistory.pdf")[[1]]
cpi <- as_tibble(cpi_ext[, c(1, 2, 4, 5, 6)])
cpi <- setNames(cpi, c("year", "cpi", "ptell_cook", "comments", "levy_year"))
cpi <- pdftools::pdf_text(pdf = "data-raw/cpi/cpihistory.pdf") %>%
str_extract(., regex("1991.*", dotall = TRUE)) %>%
str_remove_all(., "\\(5 % for Cook\\)") %>%
str_split(., "\n") %>%
unlist() %>%
tibble(vals = `.`) %>%
mutate(vals = str_squish(vals)) %>%
separate_wider_delim(
col = vals,
names = c("year", "cpi", "pct", "ptell_cook", "levy_year", "year_paid"),
delim = " ", too_few = "align_start", too_many = "drop"
)

# Merge Cook rate into main column
cpi <- cpi %>%
mutate(
across(c(year, levy_year), as.character),
across(c(cpi), as.numeric),
across(c(ptell_cook, comments), readr::parse_number),
ptell_cook = ifelse(!is.na(comments), comments, ptell_cook),
across(c(ptell_cook), readr::parse_number),
ptell_cook = ptell_cook / 100
) %>%
select(-comments) %>%
filter(year != "1991") %>%
filter(year != "1991", year != "", year != "CPI") %>%
select(-pct, -year_paid) %>%
arrange(year)

# Write to S3
Expand Down
Binary file modified data-raw/cpi/cpihistory.pdf
Binary file not shown.
2 changes: 1 addition & 1 deletion data-raw/create_db.R
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ db_send_queries <- function(conn, sql) {
# changes. This is checked against Config/Requires_DB_Version in the DESCRIPTION
# file via check_db_version(). Schema is:
# "MAX_YEAR_OF_DATA.MAJOR_VERSION.MINOR_VERSION"
db_version <- "2021.0.4"
db_version <- "2022.0.0"

# Set the package version required to use this database. This is checked against
# Version in the DESCRIPTION file. Basically, we have a two-way check so that
Expand Down
4 changes: 2 additions & 2 deletions data-raw/eq_factor/eq_factor.csv
Git LFS file not shown
33 changes: 32 additions & 1 deletion data-raw/pin/pin.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ library(arrow)
library(DBI)
library(dplyr)
library(geoarrow)
library(noctua)
library(odbc)
library(sf)
library(tidyr)
Expand All @@ -27,6 +28,10 @@ ccaodata <- dbConnect(
.connection_string = Sys.getenv("DB_CONFIG_CCAODATA")
)

# Establish a connection the Data Department's Athena data warehouse. We'll use
# values from here to fill in any missing values from the legacy system
ccaoathena <- dbConnect(noctua::athena())

# Pull AV and class from the Clerk and HEAD tables, giving preference to values
# from the Clerk table in case of mismatch (except for property class).
# These tables are pulled from the AS/400 and will be pulled from iasWorld
Expand Down Expand Up @@ -82,13 +87,38 @@ pin <- dbGetQuery(
tax_bill_total = tidyr::replace_na(tax_bill_total, 0)
)

# Pull AVs from Athena to fill in any missingness from the legacy system
pin_athena <- dbGetQuery(
ccaoathena,
"
SELECT DISTINCT
pin,
year,
mailed_tot,
certified_tot,
board_tot
FROM default.vw_pin_value
WHERE year >= '2006'
"
) %>%
mutate(
across(c(year, pin), as.character),
across(c(ends_with("_tot")), as.integer)
)

pin_fill <- pin %>%
# There are a few (less than 100) rows with Clerk AVs split for the same PIN.
# Sum to get 1 record per PIN, then keep the record with the highest AV
group_by(year, pin) %>%
mutate(av_clerk = sum(av_clerk)) %>%
ungroup() %>%
distinct(year, pin, .keep_all = TRUE) %>%
left_join(pin_athena, by = c("year", "pin")) %>%
mutate(
av_board = ifelse(is.na(av_board), board_tot, av_board),
av_certified = ifelse(is.na(av_certified), certified_tot, av_certified),
av_mailed = ifelse(is.na(av_mailed), mailed_tot, av_mailed)
) %>%
# A few (less than 500) values are missing from the mailed assessment stage
# AV column. We can replace any missing mailed value with certified value
# from the same year. Only 2 board/certified values are missing, and both are
Expand All @@ -97,7 +127,8 @@ pin_fill <- pin %>%
av_board = ifelse(is.na(av_board), 0L, av_board),
av_certified = ifelse(is.na(av_certified), 0L, av_certified),
av_mailed = ifelse(is.na(av_mailed), av_certified, av_mailed)
)
) %>%
select(-ends_with("_tot"))

# Write to S3
arrow::write_dataset(
Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
4 changes: 2 additions & 2 deletions data-raw/sample_tax_bills/agency_name_match.csv
Git LFS file not shown
Loading
Loading